CN107948729B

CN107948729B - Rich media processing method and device, storage medium and electronic equipment

Info

Publication number: CN107948729B
Application number: CN201711332691.5A
Authority: CN
Inventors: 董治
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2017-12-13
Filing date: 2017-12-13
Publication date: 2020-03-27
Anticipated expiration: 2037-12-13
Also published as: CN107948729A

Abstract

The application relates to a rich media processing method, a rich media processing device, a storage medium and an electronic device. The method comprises the following steps: acquiring audio information in the rich media; determining scene information contained in the rich media according to the audio information; dividing the rich media into scene types matched with the scene information, and displaying the scene types; and responding to the selection of the scene type by the user, and playing the rich media matched with the scene type. The rich media processing method, the rich media processing device, the storage medium and the electronic equipment can keep improving the flexibility of rich media processing.

Description

Rich media processing method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a rich media processing method and apparatus, a storage medium, and an electronic device.

Background

With the popularization of the shooting function, more and more users record surrounding scenes at any time and any place through terminals with the shooting function or take self-shots to form videos. Users typically send captured videos to friends or other users through a type of application such as instant messaging.

When the user terminal clicks and plays the received rich media of video or other animation information with sound, or the rich media in the self album, the rich media is played according to the volume of the terminal or the set volume in the last time. However, since the terminal cannot know the specific environment of the scene during the playing, the rich media may be played in a quiet environment at a high volume, which may affect the surrounding environment; or there may be situations where the rich media is played at a lower volume in a noisier environment, again making it difficult to hear the specific sound in the rich media.

Disclosure of Invention

The embodiment of the application provides a rich media processing method and device, a storage medium and electronic equipment, which can improve the flexibility of rich media processing.

A rich media processing method, comprising:

acquiring audio information in the rich media;

determining scene information contained in the rich media according to the audio information;

dividing the rich media into scene types matched with the scene information, and displaying the scene types;

and responding to the selection of the scene type by the user, and playing the rich media matched with the scene type.

A rich media processing device, the device comprising:

the audio information acquisition module is used for acquiring audio information in the rich media;

the scene information identification module is used for determining scene information contained in the rich media according to the audio information;

the classification module is used for dividing the rich media into scene types matched with the scene information and displaying the scene types;

and the playing module is used for responding to the selection of the scene type by the user and playing the rich media matched with the scene type.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the embodiments of the application.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method of any one of the embodiments when executing the computer program.

According to the rich media processing method, the audio information in the rich media is obtained, and the scene information contained in the rich media is determined according to the audio information; the rich media is divided into the scene types matched with the scene information, and the scene types are displayed, so that the scene types in the sound of the rich media can be known before the rich media is played, and then the rich media matched with the scene types are played in response to the selection of the scene types by a user, and the flexibility of rich media playing is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of an application environment of a rich media processing method in one embodiment;

FIG. 2 is a schematic diagram showing an internal configuration of an electronic apparatus according to an embodiment;

FIG. 3 is a flow diagram of a rich media processing method in one embodiment;

FIG. 4A is a diagram illustrating previewing rich media in one embodiment;

FIG. 4B is a diagram of a rich media preview in another embodiment;

FIG. 4C is a diagram of a rich media preview in yet another embodiment;

FIG. 5 is a flow diagram of playing rich media in one embodiment;

FIG. 6 is a flowchart illustrating entering a playing screen of a scene corresponding to a scene type according to a playing instruction and playing the scene in one embodiment;

FIG. 7 is a flowchart of a rich media processing method in another embodiment;

FIG. 8 is a block diagram of a rich media processing device in one embodiment;

FIG. 9 is a block diagram showing the construction of a rich media processing device in another embodiment;

FIG. 10 is a block diagram showing the construction of a rich media processing device in still another embodiment;

FIG. 11 is a block diagram showing a configuration of a rich media processing device in still another embodiment

FIG. 12 is a block diagram of a portion of the structure of a handset associated with an electronic device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

FIG. 1 is a diagram of an application environment of a rich media processing method in one embodiment. As shown in fig. 1, the application environment includes an electronic device 110 and a server 120. The electronic device 110 is connected to the server 120 through a network, the electronic device 110 includes but is not limited to any terminal such as a mobile phone, a handheld game console, a tablet computer, a personal digital assistant or a wearable device, and the electronic device may also be a server. The server 120 may be an independent server, a server cluster composed of a plurality of servers, or one or more sub-servers in the server cluster. The electronic device 110 may obtain the rich media from the server 120, obtain the rich media stored in the electronic device itself, and perform independent processing on the rich media, or interact with the server to perform processing on the rich media.

In one embodiment, as shown in FIG. 2, a schematic diagram of an internal structure of an electronic device is provided. The electronic device includes a processor, a memory, and a display screen connected by a system bus. Wherein, the processor is used for providing calculation and control capability and supporting the operation of the whole electronic equipment. The memory is used for storing data, programs, and/or instruction codes, and the like, and at least one computer program is stored on the memory, and the computer program can be executed by the processor to realize the rich media processing method suitable for the electronic device provided by the embodiment of the application. The Memory may include a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-only Memory (ROM), or a Random-Access-Memory (RAM). For example, in one embodiment, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a database, and a computer program. The database stores data related to implementing a rich media processing method provided in the above embodiments, for example, rich media may be stored. The computer program can be executed by a processor for implementing a rich media processing method provided by the above embodiments. The internal memory provides a cached operating environment for the operating system, databases, and computer programs in the non-volatile storage medium. The display screen can be a touch screen, such as a capacitive screen or an electronic screen, and is used for displaying visual information such as rich media, and can also be used for detecting touch operation acting on the display screen and generating corresponding instructions.

Those skilled in the art will appreciate that the architecture shown in fig. 2 is a block diagram of only a portion of the architecture associated with the subject application, and does not constitute a limitation on the electronic devices to which the subject application may be applied, and that a particular electronic device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components. For example, the electronic device further includes a network interface connected via the system bus, where the network interface may be an ethernet card or a wireless network card, and is used for communicating with an external electronic device, such as a server to transmit data, such as video.

In one embodiment, as shown in FIG. 3, a rich media processing method is provided. The embodiment mainly takes the application of the method to the electronic device shown in fig. 2 as an example for explanation, and the method includes:

step 302, obtaining audio information in the rich media.

The rich media is a rich media file which needs to be classified. Rich Media (Rich Media) refers to a combination of one or more of the forms including streaming Media, voice, Flash, and programming languages such as Java, Javascript, DHTML. Alternatively, the rich media in this application refers to rich media containing sound information, such as video, or gif animation image with sound information. The rich media may be stored in the memory of the electronic device itself, or may be stored in a server in the cloud. The electronic device can extract the rich media from the local memory and can also obtain the rich media from the cloud server.

In one embodiment, the electronic device may automatically retrieve the rich media or manually retrieve the rich media. For example, the electronic device may receive a classification instruction for the rich media, and retrieve the rich media based on the classification instruction. Or the electronic equipment can analyze the rich media in the local machine and take the video meeting the preset classification condition as the rich media. For example, a video received through social software such as an instant messaging application may be used as the rich media, and a video having a playing time and/or a size within a preset range may also be used as the rich media. For example, a video with a playing time within 10 minutes can be used as the rich media, or a video with a size within 100Mb can be used as the rich media, so as to reduce the processing workload of the rich media.

The audio information represents audio information contained in the rich media, i.e., information that sound can be played out when the rich media is played. The electronic equipment can extract the audio information of the rich media through a preset audio extraction tool, or can call preset recording software to record the sound in the rich media, and the recorded sound is used as the audio information. For example, a predetermined audio extraction tool may be invoked, the rich media is used as an input of the software, and the audio extraction tool is operated to extract the corresponding audio information from the rich media. The extracted audio information may be the complete audio information in the rich media, or may be a part of the audio information in the rich media. Alternatively, whether the audio information in the rich media is completely extracted may be determined according to the video duration of the rich media. For example, when the duration of the video exceeds the preset duration, part of the audio information can be extracted, and when the duration is less than the preset duration, the complete audio information is extracted.

And step 304, determining scene information contained in the rich media according to the audio information.

The scene information represents information for representing sound content, strength and/or sound theme in the rich media to be processed. The sound content represents the specific content of the sound heard when the rich media to be processed is played, and can be, for example, a bird song, a wind song, a laughing song, and the like. The sound intensity indicates the intensity of the sound heard, such as the sound in one time period being very loud or loud and the sound in another time period being very quiet. The sound theme may be a theme divided according to sound intensity and/or sound content. For example, the scene information may be classified into scene information of a noisy type or a quiet type according to the intensity of the sound, and the scene information may be classified into scene information of a sound theme such as a music type, a character type, or a natural type according to a theme to which the sound content belongs.

The electronic equipment is preset with a plurality of kinds of scene information, is also preset with scene information to which different sound contents, strengths and/or sound types belong, and judges the scene information contained in the corresponding rich media according to the identification of the sound contents, strengths and/or sound types in the audio information. For example, the analysis may be performed according to the strength of the sound signal, and whether the scene information in the rich media is the quiet type, the general type, or the noisy type; or by analyzing the specific content of the sound, the subject to which the sound contained in the rich media belongs can be identified. For example, when the audio information contains music, the sound theme contained in the rich media is judged to belong to the scene information of the music; when the voice is contained, judging the scene information of the character corresponding to the voice theme contained in the rich media; when natural sounds such as wind sounds and sea sounds are included, scene information and the like that the sound theme of the rich media belongs to a natural type are determined.

And step 306, dividing the rich media into scene types matched with the scene information, and displaying the scene types.

The electronic device also sets scene types that match different scene information. For example, for quiet, general, or noisy scene information, the corresponding matched scene type is quiet, general, or noisy. Wherein, the quiet type represents that the sound signal in the rich media is weaker or does not exist, so the sound appearing in the playing process is smaller or does not exist; the noisy type represents that very intense sound exists in the rich media, and if the rich media is played in a quiet environment, the rich media is easy to affect others; the generic representation is intermediate between the quiet and noisy representation described above.

For example, when the electronic device analyzes that the audio information in the rich media is mainly a relatively soft sound, it may be determined that the rich media type to which the rich media belongs is a general type; when the analyzed audio information is mainly fierce sound, judging that the rich media type of the rich media is fierce; and when the sound signals in the analyzed audio information are all lower than a certain preset signal threshold, judging that the rich media type to which the rich media belongs is a quiet type. It is understood that the rich media types can be divided in various ways, and are not limited to the above-mentioned division ways.

For a scene type into which the rich media is to be partitioned, the electronic device may expose the scene type. For example, the scene type of the corresponding rich media can be displayed on the preview screen of the corresponding rich media, so that the scene type of the corresponding rich media can be known before the rich media is played.

In one embodiment, the electronic device sets corresponding processing modes for different scene types, and processes the rich media according to the corresponding processing modes. For example, corresponding prompt information can be set for the rich media and displayed, and the prompt information is used for prompting the user of the scene type of the rich media so that the user can determine the playing sound used according to a specific live environment to reduce disturbance to others.

In response to the user's selection of the scene type, the rich media matching the scene type is played, step 308.

The user can select the displayed scene type, the electronic equipment responds to the received selection operation of the user, triggers a playing instruction of the scene type selected by the user, and plays the rich media matched with the scene type. For example, when the selected scene type is silent, the scene can be played according to the silent rich media.

For example, as shown in FIGS. 4A-4C, several ways of presenting scene types for rich media are provided. As shown in FIG. 4A, the "quiet rich media" indicates a scene type in the rich media 1-6 that is quiet, and the "quiet rich media" may be in the form of an album, and the rich media with the same scene type are collected into the same album, for example, the album of the "quiet rich media". As shown in FIG. 4B, for all rich media in the electronic device in one embodiment, the scene type of the corresponding rich media can be displayed on the thumbnail of each rich media, for example, the scene type of the corresponding rich media is marked as quiet type on the thumbnails of the rich media 1-6; marking the scene type of the corresponding rich media as quiet type on the thumbnails of the rich media 7-8; and marking the scene type corresponding to the rich media as a quiet type on the thumbnails of the rich media 9-11. Or, for a thumbnail display interface of a single rich media, a scene type representing a sound theme may be further marked on the thumbnail thereof, as shown in fig. 4C, when the scene type to which the rich media 400 belongs includes a bird call and a laughing sound, type marks of the "bird call" and the "laughing sound" corresponding to the scene type of the bird call and the laughing sound may be set in the corresponding thumbnail, such as the bird call mark 402 and the bird call mark 404 in the figure. The rich media 400 can be a video, and can be a quiet rich media such as rich media 7 or rich media 8 in fig. 4B.

According to the rich media processing method, the audio information in the rich media is obtained, and the scene information contained in the rich media is determined according to the audio information; the rich media is divided into the scene types matched with the scene information, and the scene types are displayed, so that the scene types in the sound of the rich media can be known before the rich media is played. And then responding to the selection of the scene type by the user, and playing the rich media matched with the scene type, thereby improving the flexibility of playing the rich media.

In one embodiment, step 304 includes: performing audio content identification on the audio information; judging the strength of the sound signal in the rich media according to the identified audio content; and/or determining a subject to which the sound in the rich media belongs based on the identified audio content; step 306 includes: and dividing the rich media into scene types matched with the judgment result.

In this embodiment, the audio content indicates the specific content heard and the intensity of the sound signal when the sound in the audio is played. For example, if the sound content in the audio information is the sound of sea waves, the audio content is sea sound; if the sound content in the audio information is the sound of gunshot, the audio content is the gunshot; if the sound content in the audio information is laughter, the audio content is laughter, etc. Alternatively, the intensity of the sound signal may be quiet, general, and noisy. And the electronic device sets corresponding matched scene types according to different sound signal intensities and/or subjects to which different sounds belong. According to the recognized sound intensity or theme, the matched scene type can be determined according to the corresponding relation, and the rich media is divided into the corresponding matched scene types.

In one embodiment, the dividing the rich media into scene types matching the decision result comprises: dividing the rich media into scene types corresponding to sounds with strongest signals in the audio content; and/or partitioning rich media into scene types that match the subject matter.

The electronic device can determine which intensity type of scene the sound signal belongs to according to the intensity of the strongest sound signal in the sound content. Optionally, the first intensity, the second intensity and the third intensity may be set according to the sound intensity from small to large, and when the strongest sound signal in the rich media to be processed exceeds the third intensity, the rich media may be divided into the corresponding matched noisy scene types; when the strongest sound signal is between the second intensity and the third intensity, dividing the strongest sound signal into general scene types; when the strongest sound signal is less than the first intensity, the scene type is classified as quiet.

Further, the electronic equipment can also perform audio content identification on the audio information to detect whether the audio information belongs to one of the predetermined topics, and divide the rich media into scene types matched with the belonged topics. Optionally, whether the audio features of one or more segments of audio in the audio information are matched with the audio features corresponding to several preset sound themes may be detected, and if so, it is determined that the audio information belongs to the corresponding theme. For example, if the audio features of the audio with the time period of 2 minutes to 3 minutes and 20 seconds in the audio information match with the audio features of a certain musical theme, it is determined that the audio contains the musical theme, and the rich media is divided into the scene types corresponding to the matched musical types.

In the method, the scene type to which the rich media belongs is determined through the audio content, so that the accuracy of determining the scene type can be improved.

In one embodiment, after step 306, further comprising: extracting an audio clip matched with the scene content from the audio information; forming an audio file according to the audio clips; playing the rich media matched with the scene type, comprising: and playing the audio file.

After the audio content included in the audio information is identified, the audio clip corresponding to the audio content can be extracted, and the extracted audio clip is converted into an audio file with a preset format, so that the audio file can be played by adopting related audio playing software.

Alternatively, an audio clip belonging to predetermined audio content may be extracted, and the predetermined audio content may be audio content set by a user in a customized manner, so that the formed audio file is an audio file of interest to the user.

In one embodiment, the electronic device may receive a fetch instruction for audio content, where the fetch instruction may include the selected audio content. And extracting an audio clip matched with the audio content according to the selected audio content, and forming an audio file according to the audio clip. Optionally, the instruction for extracting may further include a time interval between the start time and the end of the audio segment. The electronic device may extract an audio segment from the audio information between the start time and the end time, and form an audio file from the audio segment.

The electronic equipment responds to the selection operation of the user on the separated audio file and plays the audio file according to the selection operation.

For example, when the audio content included in the extracting instruction is the audio content a, the start time and the end time of the audio segment corresponding to the audio content a in the audio information may be acquired, the audio segments between the start time and the end time may be extracted from the audio information, and the audio file may be formed according to the audio segments. And when the clicking operation for the audio file is detected, playing the audio file.

In one embodiment, after step 306, further comprising: performing video separation on the rich media; forming a video file according to the separated video information; playing the rich media matched with the scene type, comprising: and playing the video file.

Optionally, the electronic device may further perform separation processing on the audio information and the video information of the rich media to separate out the video information therein, and independently form a video file according to the separated video information, so that the video file may be viewed also in an environment where complete silence is required.

In one embodiment, the electronic device may receive an instruction to extract video information, where the instruction may include the selected video content. And extracting a video clip matched with the video content according to the selected video content, and forming a video file according to the video clip. Optionally, the fetch instruction may further include a start time and an end time from the video segment. The electronic device may extract a video segment from the video information between the start time and the end time, forming a video file from the video segment. And when the clicking operation for the video file is detected, playing the video file.

The electronic equipment responds to the selection operation of the user on the separated video file and plays the video file according to the selection operation.

For example, when the video content included in the extraction instruction is the video content a, the start time and the end time of the video segment corresponding to the video content a in the video information may be acquired, the video segment between the start time and the end time may be extracted from the video information, and a video file may be formed according to the video segment, so that the formed video file is a video file that is of interest to the user.

In one embodiment, before step 308, the method further comprises: a type flag for marking a scene type is set to the rich media.

The type mark is used for marking the scene type, and for each scene type, the electronic equipment sets the type mark corresponding to the scene type, and marks the scene type to which the corresponding video belongs through the type mark. For example, the type of mark may be "bird call", "gunshot", "laugh", or the like. The electronic equipment can set the type mark at a preset display position before the rich media is played or in the playing process, so that a user can know the scene type of the rich media when looking at the corresponding video mark.

Optionally, the electronic device may load the set type mark at any position on the thumbnail of the rich media, so that the scene type of the corresponding video can be known through the type mark on the thumbnail before the rich media is played. Or the set type mark can be loaded at any position in the picture of video playing, so that the scene type of the corresponding video can be known in the playing process.

The rich media is taken as an example for the video, and as shown in fig. 4, a schematic view of a video preview in an embodiment is shown. The thumbnail 400 is a thumbnail of a certain video, and when it is determined that the scene type to which the video belongs includes a bird call and a laugh, type marks of the bird call and the laugh corresponding to the scene types of the bird call and the laugh can be set in the corresponding thumbnails, such as a bird call mark 402 and a bird call mark 404 in the figure.

In one embodiment, as shown in FIG. 5, step 308 comprises:

step 502, receiving a play instruction triggered by acting on the type flag.

Optionally, when the type flag is set, the electronic device further sets a play instruction of a scene corresponding to the type flag. The playing instruction represents a playing instruction for a scene corresponding to the type mark. The electronic device can further set a play button for playing the scene, and when the click operation of the play button is detected, the play instruction of the corresponding scene is triggered.

In one embodiment, the type flag of the scene type showing the rich media can be directly set as the play button, that is, the electronic device can add a play button dedicated to enter the determined scene type for the rich media and show the type flag in the play button. As shown in fig. 4, the laugh mark 402 and the bird call mark 404 can also be used as a play button for a laugh scene and a play button for a bird call.

And triggering a playing instruction of a scene corresponding to the clicked type mark by detecting the clicking operation of the type mark. The type mark can be displayed before or in the playing process of the rich media, and when the type mark is set in the playing process, the video mark can be clicked to realize rapid switching to the scene corresponding to the video mark.

At step 504, scenes corresponding to the type tags in the rich media are identified.

In one embodiment, after the electronic device identifies the audio content corresponding to each audio segment, a correspondence between the type tag determined according to the audio content and the audio segment may be further established. And inquiring corresponding audio clips according to the corresponding relation, and taking video parts in the time period of the audio clips in the rich media as scenes corresponding to the type marks.

Optionally, the electronic device may also record the start time and the end time of each audio segment after identifying the audio content corresponding to the audio segment. The electronic equipment can inquire the starting time and the ending time of the corresponding audio clip according to the playing instruction, and the video part in the time period of the starting time and the ending time is used as the scene corresponding to the type mark.

In an embodiment, the execution sequence between step 502 and step 504 may not be limited, for example, step 504 may be executed before step 502, that is, before the rich media is played, the scene corresponding to each type mark may be identified in advance, so that according to the playing instruction, the playing of the corresponding scene may be performed quickly.

Step 506, entering a playing picture of a scene corresponding to the scene type according to the playing instruction, and playing.

After determining the scene type to be played, the electronic device can enter the playing picture corresponding to the scene type and play the scene, so as to improve the playing flexibility. Optionally, the picture corresponding to the start time of the scene may be directly entered and played. Or the picture corresponding to the time earlier than the starting time by the preset duration is played. The preset time length may be any suitable time length, for example, 5 seconds, that is, the picture is switched to the picture earlier than the first 5 seconds of the start time of the corresponding scene and played according to the play instruction.

Referring also to fig. 4, the laugh mark 402 and the bird mark 404 thereon can be directly used as play buttons, respectively. When the click operation acting on the bird call mark 402 is detected, a play instruction for the scene corresponding to the laughter can be triggered, and the corresponding scene of the laughter is played according to the play instruction. For example, when the time period corresponding to the scene of the laughing is 3 minutes 0 seconds to 3 minutes 8 seconds, the scene of the laughing may be directly entered into a picture of 3 minutes 0 seconds and continuously played, or a picture of 2 minutes 55 seconds may be entered and played.

In the above embodiment, the playing instruction of the scene corresponding to the type mark is received, and then the playing instruction enters the playing picture of the scene corresponding to the scene type, so that the scene of the type mark in the video can be played quickly and accurately, and the flexibility of video playing is further improved.

In one embodiment, as shown in FIG. 6, step 506 includes:

step 602, obtaining the initial position of the audio content corresponding to the scene type in the audio information according to the playing instruction.

Step 604, determining the incoming playing picture according to the starting position.

The entered playing picture can be the playing picture at the initial position, or the playing picture with a preset time length earlier than the initial position. The preset time period may be any suitable time period, such as 5 seconds. For example, when the start position is 2 minutes and 5 seconds, the playback frame can be entered into the same playback frame of 2 minutes and 5 seconds according to the start position, or the playback frame can be entered into the playback frame 5 seconds earlier than the 2 minutes and 5 seconds, i.e., the playback frame is entered into the playback frame of 2 minutes and 0 seconds.

Step 606, obtaining the environment volume of the environment where the local computer is located; and determining the playing volume of the rich media according to the environment volume and the scene type.

Step 608, playing the entered playing picture according to the playing volume.

The ambient volume represents the magnitude of real-time sound in the environment in which the electronic device is located. When a playing instruction of the rich media from the user is received, the built-in voice acquisition device can be called to detect the environmental volume, and the environmental volume of the environment where the electronic equipment is located is extracted. The electronic equipment further presets a corresponding relation among the environment volume, the scene type and the playing volume, wherein the playing volume represents the rich media belonging to the scene type and is proper to the playing volume under the environment volume. And inquiring playing volume corresponding to the scene type and the environment volume according to the corresponding relation, and playing the rich media according to the playing volume. Or the playback volume may be provided to the user for selection so that the user may select to play the rich media using the determined playback volume. When the selection of the playing volume is received, the playing volume is used for playing the rich media, so that the flexibility of playing the rich media is further improved.

In one embodiment, the corresponding relationship may be embodied by a play volume comparison table, that is, a corresponding play volume comparison table is preset in the electronic device, and the comparison table records the corresponding play volume of different scene types in different environmental volumes. The electronic device can directly inquire the playing volume corresponding to the scene type and the environment volume from the comparison table, and the speed for determining the playing volume can be increased.

In one embodiment, the electronic device may preset a volume calculation model of the playback volume, set quantized values corresponding to different scene types, use the quantized values and the environmental volume as inputs of the volume calculation model, and operate the volume calculation model, thereby outputting the calculated playback volume.

Also referring to fig. 4, when it is determined that the type of the video includes a scene type including a bird call and a laugh, after a play instruction triggered by a click operation on the play button 406 is detected, an ambient volume may be obtained, and a play volume corresponding to the ambient volume and the bird call and laugh type may be obtained to prompt a user whether to play with the play volume, and when a selection of the play volume for playing is received, the rich media is played with the play volume.

In one embodiment, as shown in FIG. 7, another rich media processing method is provided, the method comprising:

step 702, acquiring audio information in rich media; and performing audio content identification on the audio information.

Alternatively, the rich media may be a video received from a server, such as rich media received via a chat application or the like forwarded by the server. The rich media may also be pre-stored for the electronic device. The electronic equipment can automatically initiate the following processing on the acquired rich media, and can also trigger the following processing on the rich media according to the received processing instruction on the rich media.

For example, after receiving a video sent by a friend through a chat application and finishing downloading, the electronic device may use the video as a rich media and automatically trigger a process of processing the video as described below.

The electronic device can perform audio extraction on the rich media to extract audio information in the rich media, and analyze the extracted audio information. Wherein, a preset audio extraction tool can be called to extract the audio information.

For the audio information, the electronic device may recognize according to a preset audio content recognition model, use the audio information as an input of the content recognition model, and operate the content recognition algorithm to obtain the sound content contained in the audio information, and the position and duration of the sound content in the audio information.

Alternatively, one piece of audio information may contain a plurality of pieces of sound content, such as music, a bird call, or a gunshot. The electronic device may classify different sound contents in advance, and form a scene type according to the classification. For example, when the sound content is natural sounds such as wind sound and sea sound, the sound content can be divided into natural scene types; and dividing the sound content into animal sounds such as dog calls, cat calls and the like into scene types of the animal sounds. The electronic device can determine the scene type to which the corresponding rich media belongs for the identified sound content, as well as the number of sound contents, the proportion of each sound content occupied in the audio information, and the like. For example, when the sound content is only one type, the type to which the sound content belongs may be directly used as the scene type of the video; when the sound content comprises a plurality of sound contents, the occupied proportion of each sound content in the whole audio information can be further detected, and the type to which the sound content exceeding the preset proportion belongs is taken as the scene type of the video. The preset ratio may be any set suitable ratio, and may be 10%, for example.

Step 704, obtain video frames in the rich media.

The video frame represents a still picture constituting a play picture in the video. The electronic device can further parse the rich media to obtain video frames constituting the play frame. All video frames in the rich media can be acquired, and partial video frames in the rich media can also be acquired. For example, a frame of video may be extracted at a predetermined sampling rate and a predetermined number of frames per interval. The preset number may be any suitable number of fixed settings, or may be determined according to the playing time of the video and the frame number of the video frames. The larger the number of frames of a video of the same duration, the larger the number of intervals can be. For example, if the video playing time is 10 minutes and the video frame number is 6000, one frame of video frame may be extracted every 5 frames or 8 frames.

In an embodiment, the execution sequence between the step 702 and the step 704 is not limited, for example, the step 702 and the step 704 may be executed simultaneously, or the step 704 may be executed first and then the step 702 may be executed.

Step 706, judging the strength of the sound signal in the rich media according to the identified audio content; and/or determining a topic to which the sound in the rich media belongs based on the identified audio content.

And 708, dividing the rich media into scene types matched with the judgment result, and displaying the scene types.

Optionally, dividing the rich media into scene types corresponding to the strongest sound in the audio content; and/or partitioning rich media into scene types that match the subject matter.

In one embodiment, the scene type of the rich media may be determined jointly by the video frame and the audio content. The electronic equipment can perform combined analysis on each frame of continuously extracted video frames and adjacent audio segments corresponding to the video frames, determine the scene type of the video, and divide the rich media into the corresponding scene types. The electronic equipment can analyze the picture of each frame of video, identify the picture information at different moments in the rich media, and determine the scene type of the video by combining the audio at each moment. By jointly confirming the scene type from the video frame and the audio information, the accuracy of the scene type determination can be further improved.

For example, when a sea exists in the frame and the audio at the time corresponding to the frame also belongs to the sound of the sea, it can be determined that the scene type of the video includes the scene type corresponding to the sea and the sound of the sea.

In step 710, a type flag for marking a scene type is set for the rich media.

A type flag may further be set for the video for the determined scene type of the rich media. For example, the type mark of the rich media can be recorded as "sea sound", and the type mark can be set on the preview screen of the rich media, or at any position on the interface in previewing the rich media. The user can know the scene type of the video through the type mark under the condition of not playing the video, and the information of the sound or the picture contained in the video is initially judged. For example, the type mark may be set at the upper right corner or the lower left corner during the preview or playing process of the rich media, so that the occlusion of the preview or playing picture may be reduced.

Step 712, a play instruction triggered by the type flag is received. And entering a playing picture of the scene corresponding to the scene type according to the playing instruction, and playing.

Optionally, the electronic device may also set the type flag as a play button that may be triggered to play instructions. The play button may be presented before or during play of the rich media. And when the click operation of the button is received, triggering a playing instruction of the scene corresponding to the mark.

Optionally, the scene may be further determined according to the video frame and the sound described above. When detecting the audio clip of the audio information corresponding to each type mark, the electronic device can acquire a video frame adjacent to the time according to the time of the audio clip in the whole audio information, detect whether the content in the video frame is matched with the type mark, determine the initial video frame matched with the type mark according to the detection result, take the initial video frame as the initial picture of the scene corresponding to the type mark, enter the initial picture to play, or enter a preset number of pictures before the initial picture to play.

For example, when the type is marked as "sea sound", an audio clip corresponding to the sea sound is within 2 min 3 sec to 3 min in the audio information, it is possible to detect whether a video frame located around 2 min 3 sec matches the sea sound, i.e. detect whether a frame of sea exists in the video frame, and further detect a video frame of a frame where sea starts to appear in the adjacent multi-frame video frames, and use the video frame as a start frame of a corresponding scene, for example, the start frame is located at a position of 1 min 58 sec, and may enter a position of 1 min 58 sec and play it, or may be played from a frame of 1 min 55 sec earlier than the position of 1 min 58 sec, for example, 1 min 55 sec.

The scene corresponding to the type mark is determined by combining the audio information and the video frame, so that the scene corresponding to the mark type is more accurate, and the accuracy of entering the scene can be improved.

In one embodiment, as shown in FIG. 8, a rich media processing device is provided. The device includes:

and an audio information obtaining module 802, configured to obtain audio information in the rich media.

A scene information identification module 804, configured to determine scene information included in the rich media according to the audio information;

and the classification module 806 is configured to classify the rich media into scene types matched with the scene information, and display the scene types.

A playing module 808, configured to play the rich media matching the scene type in response to a user selection of the scene type.

In one embodiment, the scene information includes strength information of the sound and/or information of a subject to which the sound belongs.

The scene information identification module 804 is further configured to perform audio content identification on the audio information; judging the strength of the sound signal in the rich media according to the identified audio content; and/or determining a topic to which the sound in the rich media belongs based on the identified audio content.

The classification module 806 is further configured to classify the rich media into scene types matching the determination result.

In one embodiment, the classification module 806 is further configured to classify rich media into scene types corresponding to the strongest sounds in the audio content; and/or partitioning rich media into scene types that match the subject matter.

In one embodiment, as shown in fig. 9, another rich media processing apparatus is provided, the apparatus further comprising:

an audio file generating module 810, configured to extract an audio clip matching the scene information from the audio information; an audio file is formed from the audio clips.

The playing module 808 is also used for playing audio files.

In one embodiment, as shown in fig. 10, there is provided still another rich media processing device, further comprising:

a video file generation module 812 for performing video separation on the rich media; and forming a video file according to the separated video information.

The playing module 808 is also configured to play the video file.

In one embodiment, as shown in fig. 11, there is provided still another rich media processing apparatus, further comprising:

a type flag module 814, configured to set a type flag for marking a scene type for the rich media.

The playing module 808 is further configured to receive a playing instruction triggered by acting on the type flag; and entering a playing picture of the scene corresponding to the scene type according to the playing instruction, and playing.

In one embodiment, the playing module 808 is further configured to obtain a starting position of the audio content corresponding to the scene type in the audio information according to the playing instruction; determining an incoming playing picture according to the initial position; acquiring the environmental volume of the environment where the local computer is located; determining the playing volume of the rich media according to the environment volume and the scene type; and playing the entered playing picture according to the playing volume.

The division of the modules in the rich media processing device is only used for illustration, and in other embodiments, the rich media processing device may be divided into different modules as needed to complete all or part of the functions of the rich media processing device.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the rich media processing method provided by the above embodiments.

An electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and the processor implements the steps of the rich media processing method provided by the embodiments when executing the computer program.

The embodiment of the application also provides a computer program product. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the steps of the rich media processing method provided by the embodiments described above.

The embodiment of the application also provides the electronic equipment. As shown in fig. 12, for convenience of explanation, only the portions related to the embodiments of the present application are shown, and details of the specific techniques are not disclosed, please refer to the method portion of the embodiments of the present application. The electronic device may be any terminal device including a mobile phone, a tablet computer, a PDA (Personal Digital Assistant), a POS (Point of Sales), a vehicle-mounted computer, a wearable device, and the like, taking the electronic device as the mobile phone as an example:

fig. 12 is a block diagram of a partial structure of a mobile phone related to an electronic device provided in an embodiment of the present application. Referring to fig. 12, the cellular phone includes: radio Frequency (RF) circuit 1210, memory 1220, input unit 1230, display unit 1240, sensor 1250, audio circuit 1260, wireless fidelity (WiFi) module 1270, processor 1280, and power supply 1290. Those skilled in the art will appreciate that the handset configuration shown in fig. 12 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The RF circuit 1210 may be configured to receive and transmit signals during information transmission or communication, and may receive downlink information of a base station and then process the downlink information to the processor 1280; the uplink data may also be transmitted to the base station. Typically, the RF circuitry includes, but is not limited to, an antenna, at least one Amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 1210 may also communicate with networks and other devices via wireless communication. The wireless communication may use any communication standard or protocol, including but not limited to Global System for mobile communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE)), e-mail, Short Messaging Service (SMS), and the like.

The memory 1220 may be used to store software programs and modules, and the processor 1280 executes various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1220. The memory 1220 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function (such as an application program for a sound playing function, an application program for an image playing function, and the like), and the like; the data storage area may store data (such as audio data, an address book, etc.) created according to the use of the mobile phone, and the like. Further, the memory 1220 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 1230 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone 1200. Specifically, the input unit 1230 may include a touch panel 1231 and other input devices 1232. The touch panel 1231, which may also be referred to as a touch screen, may collect touch operations performed by a user on or near the touch panel 1231 (e.g., operations performed by the user on or near the touch panel 1231 using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a preset program. In one embodiment, the touch panel 1231 can include two portions, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, and sends the touch point coordinates to the processor 1280, and can receive and execute commands sent by the processor 1280. In addition, the touch panel 1231 may be implemented by various types such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 1230 may include other input devices 1232 in addition to the touch panel 1231. In particular, other input devices 1232 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), and the like.

The display unit 1240 may be used to display information input by the user or information provided to the user and various menus of the cellular phone. Display unit 1240 may include a display panel 1241. In one embodiment, the Display panel 1241 may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. In one embodiment, touch panel 1231 can overlay display panel 1241, and when touch panel 1231 detects a touch operation thereon or nearby, the touch panel 1231 can transmit the touch operation to processor 1280 to determine the type of touch event, and then processor 1280 can provide corresponding visual output on display panel 1241 according to the type of touch event. Although in fig. 12, the touch panel 1231 and the display panel 1241 are implemented as two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1231 and the display panel 1241 may be integrated to implement the input and output functions of the mobile phone.

The cell phone 1200 may also include at least one sensor 1250, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1241 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1241 and/or the backlight when the mobile phone moves to the ear. The motion sensor can comprise an acceleration sensor, the acceleration sensor can detect the magnitude of acceleration in each direction, the magnitude and the direction of gravity can be detected when the mobile phone is static, and the motion sensor can be used for identifying the application of the gesture of the mobile phone (such as horizontal and vertical screen switching), the vibration identification related functions (such as pedometer and knocking) and the like; the mobile phone may be provided with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor.

Audio circuit 1260, speaker 1261, and microphone 1262 can provide an audio interface between a user and a cell phone. The audio circuit 1260 can transmit the received electrical signal converted from the audio data to the speaker 1261, and the audio signal is converted into a sound signal by the speaker 1261 and output; on the other hand, the microphone 1262 converts the collected sound signal into an electrical signal, which is received by the audio circuit 1260 and converted into audio data, and then the audio data is processed by the audio data output processor 1280, and then the processed audio data is transmitted to another mobile phone through the RF circuit 1210, or the audio data is output to the memory 1220 for subsequent processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1270, and provides wireless broadband internet access for the user. Although fig. 12 shows WiFi module 1270, it is understood that it is not an essential component of cell phone 1200 and may be omitted as desired.

The processor 1280 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1220 and calling data stored in the memory 1220, thereby performing overall monitoring of the mobile phone. In one embodiment, the processor 1280 may include one or more processing units. In one embodiment, the processor 1280 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, and the like; the modem processor handles primarily wireless communications. It is to be appreciated that the modem processor described above may not be integrated into the processor 1280.

The mobile phone 1200 further includes a power supply 1290 (e.g., a battery) for supplying power to various components, and preferably, the power supply may be logically connected to the processor 1280 through a power management system, so that the power management system may manage charging, discharging, and power consumption.

In one embodiment, the cell phone 1200 may also include a camera, a bluetooth module, and the like.

In the embodiment of the present application, the processor 1280 included in the mobile terminal implements the steps of the rich media processing method described above when executing the computer program stored in the memory.

Any reference to memory, storage, database, or other medium used herein may include non-volatile and/or volatile memory. Suitable non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and bus dynamic RAM (RDRAM).

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A rich media processing method, comprising:

acquiring audio information in the rich media;

extracting audio clips matched with the scene information from the audio information, and forming an audio file according to the audio clips;

and responding to the selection of the scene type by the user, and playing the audio file matched with the scene type.

2. The method according to claim 1, wherein the scene information includes intensity information of sound and/or information of a subject to which the sound belongs; the determining scene information contained in the rich media according to the audio information includes:

performing audio content identification on the audio information;

judging the strength of the sound signal in the rich media according to the identified audio content; and/or

Determining a subject to which a sound in the rich media belongs according to the identified audio content;

dividing the rich media into scene types matched with the scene information, wherein the dividing comprises the following steps:

and dividing the rich media into scene types matched with the judgment result.

3. The method according to claim 2, wherein the dividing the rich media into scene types matching the decision result comprises:

dividing the rich media into scene types corresponding to sounds with strongest signals in the audio content; and/or

And dividing the rich media into scene types matched with the theme.

4. The method according to claim 1, wherein after said dividing the rich media into scene types matching the scene information and presenting the scene types, further comprising:

performing video separation on the rich media;

forming a video file according to the separated video information;

the playing the audio file matched with the scene type comprises the following steps: and playing the video file.

5. The method according to any one of claims 1 to 4, further comprising, before said playing an audio file matching the scene type in response to the user's selection of the scene type:

setting a type mark for marking the scene type for the rich media;

the playing the audio file matched with the scene type in response to the selection of the scene type by the user comprises the following steps:

receiving a playing instruction triggered by acting on the type mark;

and entering a playing picture of the scene corresponding to the scene type according to the playing instruction, and playing.

6. The method according to claim 5, wherein entering a playing screen of a scene corresponding to the scene type according to the playing instruction and playing the playing screen comprises:

acquiring the initial position of the audio content corresponding to the scene type in the audio information according to the playing instruction;

determining an incoming playing picture according to the initial position;

acquiring the environmental volume of the environment where the local computer is located;

determining the playing volume of the rich media according to the environment volume and the scene type;

and playing the entered playing picture according to the playing volume.

7. A rich media processing apparatus, the apparatus comprising:

the audio file generation module is used for extracting audio clips matched with the scene information from the audio information and forming an audio file according to the audio clips;

and the playing module is used for responding to the selection of the scene type by the user and playing the audio file matched with the scene type.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 6 are implemented when the computer program is executed by the processor.