WO2021004362A1

WO2021004362A1 - Audio data processing method and apparatus, and electronic device

Info

Publication number: WO2021004362A1
Application number: PCT/CN2020/099864
Authority: WO
Inventors: 贾锦杰; 廖多依
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2019-07-10
Filing date: 2020-07-02
Publication date: 2021-01-14
Also published as: CN112287129A

Abstract

Disclosed are an audio data processing method and apparatus, and an electronic device. The processing method comprises: acquiring audio feedback data generated during the playing of main audio data (S2100); and merging the audio feedback data with the main audio data to generate merged audio data for playing (S2200). When different users play a media file with the main audio data by means of respective terminal devices and in different spaces, the users can achieve an on-site effect of enjoying the media file in the same space with the others.

Description

Audio data processing method, device and electronic equipment

This application claims the priority of the Chinese patent application with the application number 201910619886.0 and the invention title of "audio data processing method, device and electronic equipment" filed on July 10, 2019, the entire content of which is incorporated into this application by reference.

Technical field

This application relates to the field of Internet technology, and more specifically, to an audio data processing method, device, electronic equipment, and computer-readable storage medium.

Background technique

With the rapid development of audio, video and other media file playback technologies, applications that provide media file playback services nowadays usually provide users with a comment function so that users can post comments during the playback of media files. In the prior art, these comments are arranged linearly, and the receiving party of any media file is sensoryly separated from the receiving of the media file and the comment content, and it is impossible for multiple people to receive and comment on the media file. A sense of presence.

Summary of the invention

An objective of the embodiments of the present application is to provide a new technical solution for a processing method of audio data.

According to the first aspect of the present application, there is provided an audio data processing method, which includes:

Obtain audio feedback data generated during the playback of the main audio data;

The audio feedback data and the main audio data are combined to generate combined audio data for playback.

Optionally, the combining the audio feedback data with the main audio data includes:

Acquiring the quantity of the audio feedback data generated within the set playing period of the main audio data;

Determine the corresponding merging effect according to the number, where the merging effect at least reflects the volume ratio of each data participating in the merging;

According to the merging effect, the audio feedback data generated in the set play period is merged with the main audio data.

Detecting the idle gap of the main audio data adjacent to each of the audio feedback data according to the play period of the main audio data corresponding to each of the audio feedback data when it is generated;

Align each of the audio feedback data with the adjacent free gaps, and perform the combination.

Setting each piece of data including the main audio data and the audio feedback data to occupy a different audio track;

The audio feedback data is combined with the main audio data through audio track synthesis.

Optionally, the acquiring audio feedback data generated during the playback of the main audio data includes:

Acquire audio feedback data that meets the target classification generated during the playback of the main audio data;

The generating of the combined audio data for playback includes:

The combined audio data is generated for playback by terminal devices that meet the target classification.

Optionally, the method further includes:

Acquiring the characteristic value of the set user characteristic corresponding to the terminal device playing the main audio data;

According to the characteristic value, the target classification corresponding to the terminal device is determined.

Optionally, the set user characteristics include set characteristics corresponding to audio feedback data generated by the user of the terminal device during the playback process of the main audio data.

Optionally, the main audio data is audio data of a video file, and the method further includes:

In the video playback window of the video file, the audio waveform representing the audio feedback data is displayed in the form of a bullet screen.

Acquire the voice comment published during the playing process of the main audio data, and use the voice comment as the audio feedback data at least.

Acquire text comments published during the playback of the main audio data;

The text comment is converted into corresponding audio data, and at least the converted audio data is used as the audio feedback data.

Obtain the expression features published during the playback of the main audio data;

The expression feature is converted into corresponding audio data, and at least the converted audio data is used as the audio feedback data.

Optionally, the main audio data is audio data of a live media file.

Optionally, the method further includes:

In response to the instruction to turn on the live sound effect function, the operation of merging the audio feedback data with the main audio data is performed.

According to the second aspect of the present application, there is also provided a method for processing audio data, which is implemented by a terminal device, and the method includes:

Obtain the main audio data selected to be played;

Acquiring live audio data corresponding to the main audio data, where the live audio data includes at least other users' audio feedback data for the main audio data;

Perform a processing operation of playing the live audio data while playing the main audio data.

Optionally, the live audio data further includes audio feedback data that the user corresponding to the terminal device feeds back on the main audio data.

According to the third aspect of the present application, there is also provided a method for processing audio data, which is implemented by a terminal device, and the method includes:

In response to the operation of playing the target media file, playing the target media file, where the target media file includes main audio data;

Perform a processing operation of playing the live audio data along with the main audio data during the process of playing the target media file.

Optionally, the acquiring live audio data corresponding to the main audio data includes:

Acquire audio feedback data of other users for the main audio data from the server as the live audio data.

Optionally, the method further includes:

Acquiring audio feedback data of the user corresponding to the terminal device for the main audio data;

Upload the audio feedback data of the user to the server.

According to the fourth aspect of the present application, there is also provided an audio data processing device, including:

A data acquisition module for acquiring audio feedback data generated during the playback of the main audio data; and,

The audio processing module is used to combine the audio feedback data with the main audio data to generate the combined audio data for playback.

According to the fifth aspect of the present application, there is also provided an electronic device, including the processing device according to the fourth aspect of the present application; or, including:

Memory, used to store executable instructions;

The processor is configured to run the electronic device according to the control of the executable instruction to execute the processing method according to the first, second or third aspect of the present application.

Optionally, the electronic device is a terminal device without a display device.

Optionally, the electronic device is a terminal device, the electronic device is a terminal device, and the terminal device further includes an input device for the corresponding user to input feedback content for the main audio data, and The feedback content is sent to the processing device or the processor, so that the processing device or the processor generates audio feedback data of the corresponding user for the main audio data according to the feedback content.

Optionally, the electronic device is a terminal device, and the terminal device further includes an audio output device, and the audio output device is configured to play the main audio data while playing the main audio data according to the control of the processing device or the processor. Corresponding audio feedback data.

According to the sixth aspect of the present application, there is also provided a computer-readable storage medium, the computer-readable storage medium stores a computer program that can be read and executed by a computer, and the computer program is used to be read by the computer. When it is running, execute the processing method according to the first, second or third aspect of the present application.

One beneficial effect of the embodiment of the present application is that the audio data processing method of this embodiment combines the main audio data with the audio feedback data generated during the playback of the main audio data, so that any terminal device can play the main audio data. At the same time, it can also play audio feedback data from other users. In this way, any user can listen to the main audio data with other users and post comments when listening to the main audio data through their respective terminal devices. Effect, get on-site experience.

Through the following detailed description of exemplary embodiments of the present application with reference to the accompanying drawings, other features and advantages of the present application will become clear.

Description of the drawings

The accompanying drawings incorporated in the specification and constituting a part of the specification illustrate the embodiments of the present application, and together with the description are used to explain the principle of the present application:

Figure 1a is a schematic diagram of an application scenario illustrating the effects of an embodiment of the present application;

Figure 1b is a hardware configuration diagram of an alternative data processing system that can be used to implement the audio data processing method of the embodiment of the present application;

Fig. 2 is a schematic flowchart of a processing method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an example of guiding the user to input audio feedback data in the play window of the target media file;

4 is a schematic diagram of inserting audio feedback data into adjacent free gaps of main audio data during audio mixing;

5 is a schematic diagram of an example of guiding a user to input an instruction to turn on a live sound effect function;

Fig. 6a is an interactive schematic diagram of a processing method according to an example of the present application;

Fig. 6b is an interactive schematic diagram of a processing method according to another example of the present application;

FIG. 7 is a schematic flowchart of a processing method according to another embodiment of the present application;

FIG. 8 is a schematic flowchart of a processing method according to a third embodiment of the present application;

Fig. 9 is a schematic functional block diagram of an audio data processing device according to an embodiment of the present application;

Fig. 10a is a schematic block diagram of an electronic device according to an embodiment of the present application;

Fig. 10b is a schematic block diagram of an electronic device according to another embodiment of the present application.

Detailed ways

Various exemplary embodiments of the present application will now be described in detail with reference to the accompanying drawings. It should be noted that unless specifically stated otherwise, the relative arrangement of components and steps, numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present application.

The following description of at least one exemplary embodiment is actually only illustrative, and in no way serves as any restriction on the application and its application or use.

The technologies, methods, and equipment known to those of ordinary skill in the relevant fields may not be discussed in detail, but where appropriate, the technologies, methods, and equipment should be regarded as part of the specification.

In all the examples shown and discussed herein, any specific value should be interpreted as merely exemplary and not as limiting. Therefore, other examples of the exemplary embodiment may have different values.

It should be noted that similar reference numerals and letters indicate similar items in the following drawings, so once a certain item is defined in one drawing, it does not need to be further discussed in subsequent drawings.

At present, media files have become the main medium of information transmission. With the development of Internet technology, people can not only choose to enjoy the content of media files with others in the place where the media files are played, but also can use their own terminal devices to independently enjoy the content of the media files in various places Enjoy the content of the media file. The above media files can be video files that contain audio data and image data. The terminal equipment that supports the playback of video files requires a display device and an audio output device. The above media files can also be pure audio files that only contain audio data, supporting pure audio playback The file terminal device requires an audio output device, but may not have a display device, for example, a smart speaker. Here, for the live mode that many people appreciate together, everyone can feel the various voice feedbacks of other people on the media files. These voice feedbacks include feedback of language comments, and for example, happy, sigh, sad, and silent Such as the feedback of expression characteristics, etc., so that people can get rich and three-dimensional sensory experience on the spot. As for the online mode that individuals enjoy individually through their own terminal devices, at present, they can only simply post text comments through the Internet, and cannot obtain the on-site sensory experience, but this mode has the convenience that the on-site mode cannot match.

In order to solve the lack of on-site sensory experience in the online mode, when a user enjoys the content of a media file through a personal terminal device, at least the audio feedback data of others for the same media file can be combined with the main audio data of the media file. Play together to get a sensory experience equivalent to live mode. An application scenario is shown in Fig. 1a, where user A, user B, user C, and user D enjoy the content of the same media file in different spaces through their respective terminal devices 1200 at the same time or at different times. , User A, User B, and User C all posted language comments during the same set play time period. Due to the spatial separation, each user cannot actually perceive the sound feedback of other users on the media file. However, through the embodiments of this application In the process of playing media files through each user’s terminal device 1200, the audio feedback data of other users for the same media file can be combined and played together with the main audio data of the media file, so that each user can experience The sound feedback of other users to the media file is equivalent to the on-site effect of the user A, user B, user C, and user D enjoying the media file through the same terminal device in the same place as shown in the lower part of FIG. 1a.

Fig. 1b is a schematic diagram of the structure of a data processing system to which the audio data processing method according to an embodiment of the present application can be applied.

As shown in FIG. 1b, the data processing system 1000 of this embodiment includes a server 1100, a terminal device 1200, and a network 1300.

The server 1100 may be, for example, a blade server, a rack server, etc. The server 1100 may also be a server cluster deployed in the cloud, which is not limited here.

As shown in FIG. 1b, the server 1100 may include a processor 1110, a memory 1120, an interface device 1130, a communication device 1140, a display device 1150, and an input device 1160. The processor 1110 may be, for example, a central processing unit CPU or the like. The memory 1120 includes, for example, ROM (Read Only Memory), RAM (Random Access Memory), nonvolatile memory such as a hard disk, and the like. The interface device 1130 includes, for example, a USB interface, a serial interface, and the like. The communication device 1140 can perform wired or wireless communication, for example. The display device 1150 is, for example, a liquid crystal display. The input device 1160 may include, for example, a touch screen, a keyboard, and the like.

In this embodiment, the server 1100 can be used to participate in implementing the data processing method according to any embodiment of the present application.

Applied to the embodiment of the present application, the memory 1120 of the server 1100 is used to store instructions, and the instructions are used to control the processor 1110 to operate to support the implementation of the processing method according to any embodiment of the present application. Technicians can design instructions according to the scheme disclosed in this application. How the instruction controls the processor to operate is well known in the art, so it will not be described in detail here.

Those skilled in the art should understand that although multiple devices of the server 1100 are shown in FIG. 1b, the server 1100 in the embodiment of the present application may only involve some of the devices, for example, only the processor 1110 and the memory 1120.

As shown in FIG. 1b, the terminal device 1200 may include a processor 1210, a memory 1220, an interface device 1230, a communication device 1240, a display device 1250, an input device 1260, an audio output device 1270, an audio pickup device 1280, and so on. The processor 1210 may be a central processing unit (CPU), a microprocessor MCU, or the like. The memory 1220 includes, for example, ROM (Read Only Memory), RAM (Random Access Memory), nonvolatile memory such as a hard disk, and the like. The interface device 1230 includes, for example, a USB interface, a headphone interface, and the like. The communication device 1240 can perform wired or wireless communication, for example. The display device 1250 is, for example, a liquid crystal display, a touch display, or the like. The input device 1260 may include, for example, a touch screen, a keyboard, and the like. The terminal device 1200 may output audio information through an audio output device 1270, which includes, for example, a speaker. The terminal device 1200 may pick up the voice information input by the user through an audio pickup device 1280, which includes, for example, a microphone.

The terminal device 1200 may be a smart phone, a portable computer, a desktop computer, a tablet computer, a wearable device, a smart speaker, a set-top box, a smart TV, a voice recorder, a camcorder, etc., wherein the terminal device 1200 may have an audio output device 1270 to perform To play media files, an audio output device 1270 can also be connected to play media files.

In this embodiment, the terminal device 1200 can be used to participate in implementing the data processing method according to any embodiment of the present application.

In the embodiment applied to the present application, the memory 1220 of the terminal device 1200 is used to store instructions, and the instructions are used to control the processor 1210 to operate to support the implementation of the processing method according to any embodiment of the present application. Technicians can design instructions according to the scheme disclosed in this application. How the instruction controls the processor to operate is well known in the art, so it will not be described in detail here.

Those skilled in the art should understand that although multiple devices of the terminal device 1200 are shown in FIG. 1b, the terminal device 1200 in the embodiment of the present application may only involve some of the devices, for example, only the processor 1210 and the memory 1220 and so on.

The network 1300 may be a wireless network or a wired network, and may be a local area network or a wide area network. The terminal device 1200 may communicate with the server 1100 through the network 1300.

The data processing system 1000 shown in FIG. 1b is only for explanatory purposes, and is by no means intended to limit the application, its applications or uses. For example, although FIG. 1b only shows one server 1100 and one terminal device 1200, it does not mean to limit the respective numbers. The data processing system 1000 may include multiple servers 1100 and/or multiple terminal devices 1200.

The audio data processing method according to any embodiment of the present application may be implemented by the server 1100 as required, may also be implemented by the terminal device 1200, or jointly implemented by the server 1100 and the terminal device 1200, which is not limited herein.

Fig. 2 is a schematic flowchart of a method for processing audio data according to an embodiment of the present application.

According to FIG. 2, the processing method of this embodiment may include the following steps S2100 to S2200.

Step S2100: Obtain audio feedback data generated during the playback of the main audio data.

In this embodiment, the main audio data is the audio data of the target media file to be played. The target media file can be a pure audio file or a video file. The target media file can be a live broadcast file or a recorded broadcast file, which is not limited here.

In this embodiment, in this step S2100, all the audio feedback data generated during the playback of the main audio data may be acquired and combined in the following step S2200, or it may be acquired during the playback of the main audio data according to the set conditions. The generated part of the audio feedback data is merged in the following step S2200, which is not limited here.

In this embodiment, any feedback content published by any user during the playback of the main audio data, that is, during the playback of the target media file, corresponds to a piece of audio feedback data. For example, the feedback content may be one piece of audio feedback. Voice comment, this voice comment is about to form a piece of audio feedback data; for example, the content of any one feedback can also be a text comment, and the text comment can be converted into corresponding audio data, and the converted audio data is enough A piece of audio feedback data is formed; for another example, the content of any one-time feedback can also be an input expression feature, the expression feature can be an input emoji, voice expression, etc., the expression feature can be converted into corresponding audio data, and it can also be formed One piece of audio feedback data.

In an example, obtaining audio feedback data generated during the playback of the main audio data in step S2100 may include: obtaining voice comments fed back during the playback of the main audio data, and at least using the voice comments as the main audio Audio feedback data generated during data playback.

In this example, users can input voice comments through their respective terminal devices. Taking Figure 3 as an example, an entry for guiding users to input voice comments (for example: press and hold to send voice comments) can be provided in the play window of the target media file to be played. For the user to post a voice comment through the portal, for example, long-press the portal, the terminal device can collect the voice comment through an audio input device such as a microphone to form audio feedback data.

In this example, the user can also post a voice comment by operating the physical buttons set by the terminal device, which is not limited here.

In an example, obtaining audio feedback data generated during the playback of the main audio data in step S2100 may also include: obtaining text comments fed back during the playback of the main audio data; and converting the text comments into corresponding Audio data, and at least use the converted audio data as audio feedback data generated during the playback of the main audio data.

In this example, when the text comment is converted into corresponding audio data, the user’s voice feature can be obtained according to the corresponding user’s voice collected in advance, and then the text comment is converted based on the voice feature, so that the converted audio data Reflect the voice characteristics of the user. In this example, the text comment can also be converted based on the default voice feature. The default voice feature can be a voice feature set by the system or a voice feature selected by the user, which is not limited here.

In this example, when the text comment is converted into corresponding audio data, the emotional characteristics expressed by the text comment can also be recognized, so that the converted audio data reflects the emotional characteristics intended by the text comment.

In this example, the user can input the content of the text comment through the physical keyboard, virtual keyboard or touch screen provided by the terminal device, and can also post the text comment by simply selecting the preset text content provided by the terminal device.

In an example, obtaining the audio feedback data generated during the playback of the main audio data in step S2100 may also include: obtaining the expression characteristics fed back during the playback of the main audio data; and converting the expression characteristics into corresponding audio Data, and at least use the converted audio data as audio feedback data generated during the playback of the main audio data.

In this example, the expression features can be pre-stored in the terminal device, and the user can perform emotional feedback on the main audio data being played by selecting the expression features that can express their emotions. The expression features may include symbolic expressions and voice expressions. Voice expressions may include voice expressions and/or sound effect expressions.

Symbolic expressions are symbols, static pictures or dynamic pictures, etc. that express emotions or themes, and are used for users to choose to express their emotions or feelings in the process of voice communication. For the conversion of symbolic expressions, the corresponding audio data can be converted according to the emotions or feelings expressed by the symbolic expressions.

Voice expressions are voice content expressing specific emotions or topics, and are used for users to choose to express their emotions or feelings in the process of voice communication. For voice expression conversion, the voice content in the voice expression can be directly extracted as the converted audio data.

The voice content of the voice expression is the voice corresponding to the emotion or theme expressed by the voice expression, and is the voice expression with language content. The voice content of the voice expression may be recorded by a specific person such as a celebrity, celebrity, voice actress, etc. according to a preset theme or content, or may be recorded by the user according to his own emotional expression needs.

Users usually expect to express their feelings or emotions through the language content when the voice expression is played.

The sound content of the sound effect expression is the sound effect corresponding to the emotional feature of the sound effect expression, and is the sound expression without language content. Users usually expect to express their feelings or emotions through the sound effects generated when the sound expression is played. The sound content of the sound expression can be recorded for various preset themes or emotional expression needs.

In this embodiment, for any playback period of the target media file divided according to any time interval, this step S2100 can be to obtain all the audio feedback data generated by the accumulation of the corresponding playback period, or to obtain the audio feedback data generated within a set time. The audio feedback data corresponding to the playback period.

For example, dividing the target media file at intervals of 5 minutes, playing 0 to 5 minutes as the first playing period, playing 5 to 10 minutes as the second playing period, and so on. Taking the first play period as an example, step S2100 can be to obtain all the audio feedback data generated corresponding to the accumulation of the first play period, or to obtain the audio feedback data generated on the day corresponding to the second play period, etc. , There is no limitation here.

For another example, the target media file is a live media file, and the audio feedback data generated during any playback period is the audio feedback data generated during the playback time corresponding to the arbitrary playback period. Therefore, for the live media file, the combined audio data Will be able to reflect the live effect during the live broadcast.

In an example, the step S2100 may be implemented by a server, such as the server 1100 in FIG. 1. In this example, obtaining audio feedback data generated during the playback of the main audio data in step S2100 may include: obtaining audio feedback data generated by the corresponding user during the playback of the main audio data from each terminal device.

In an example, this step S2200 may be implemented by a terminal device, such as the terminal device 1200 in FIG. 1. In this example, obtaining audio feedback data generated during the playback of the main audio data in step S2200 may include: obtaining audio feedback data generated by other users during the playback of the main audio data from the server.

Step S2200: Combine the acquired audio feedback data with the main audio data to generate a combined audio file for playback.

The combination in this embodiment may use any existing mixing means to mix the audio feedback data with the main audio data to form an audio file mixed with the audio feedback data.

The merging in this embodiment may also refer to the establishment of a temporal correspondence between the audio feedback data and the main audio data to form an audio file embodying the mapping relationship, and at least the main audio data can be played through different channels. And audio feedback data to achieve the effect of "mixing" for the user. Here, for the audio feedback data part, all audio feedback data can be mixed to occupy one channel, or all audio feedback data can be processed into multiple audio files occupying multiple channels, which is not limited here. , As long as the user can feel the "mixing" effect of the audio feedback data being played together with the main audio data.

In this embodiment, for any user who is receiving the target media file, the merging process according to this step can be performed continuously as the target media file is played and the audio feedback data is continuously generated, so as to be based on the continuously generated merged The audio file continues to play the target media file until the playback ends.

In an example, this step S2200 can be implemented by a server or a terminal device, which is not limited here.

In an example, combining the acquired audio feedback data with the main audio data in this step S2200 may include: performing the audio feedback data and the main audio data according to the playback period of the main audio data corresponding to each audio feedback data when it is generated. Consolidation of audio data.

In this example, the play period of the main audio data is divided based on the relative play time of the target media file, where the relative reference point of the relative play time is the start play point of the target media file. For example, playing 0-5 minutes is the first playing period, playing 5-10 minutes is the second playing period, and so on.

In this example, the length of the playback period can be set according to needs, and the length can be fixed and can also be adjusted adaptively.

For example, the length of the set playing period is 5 minutes, and the playing period of the main audio data corresponding to the audio feedback data when it is generated is the playing period of 5-10 minutes of playing the target media file. For another example, the length of the set playing period is 2 minutes, and the playing period of the main audio data corresponding to the audio feedback data generated is the playing period of 1 to 2 minutes of playing the target media file.

In this example, when each terminal device obtains the audio feedback data generated by the corresponding user, it can record the generation time of the audio feedback data and the corresponding playback period.

In this example, when the audio data is merged, at least for part of the audio feedback data, the starting position of each audio feedback data can be set to be aligned with the starting position of the corresponding main audio data play period.

In this example, when the audio data is merged, at least for part of the audio feedback data, the start position of each audio feedback data can be allowed to lag behind the corresponding play period of the main audio data. According to the merging process of this example, when the user plays the merged audio data through his personal terminal device, he can feel that the audio feedback of all users changes with the playback of the main audio data, including changes in the number of feedbacks and/ Or feedback content changes, etc., to provide a more realistic on-site experience.

In an example, combining the acquired audio feedback data with the main audio data in this step S2200 may include the following steps S2211 to S2213:

Step S2211: Obtain the number of audio feedback data generated within the set playing period of the main audio data.

The set play period can be preset according to real-time requirements. For example, if it is set to perform quantity statistics per minute, the divided set period includes: the first playing period of 0 to 5 minutes, the playing period of 5 to 10 minutes, ..., and so on.

Step S2212: Determine the corresponding merging effect according to the number, where the merging effect at least reflects the volume ratio of each data participating in the merging.

In this step S2212, the mapping data representing the correspondence between the quantity and the merging effect may be pre-stored, so as to find the merging effect corresponding to the quantity obtained in step S2211 in the mapping data.

For example, the combined effect includes a living room scene effect, a theater scene effect, a square scene effect, and so on. The quantity situation corresponding to various scenes is: the living room scene is smaller than the theater scene, and the theater scene is smaller than the square scene. The volume ratio reflected by various scene effects is: For the volume ratio of the audio feedback data to the main audio data, the living room scene is smaller than the theater scene, and the theater scene is smaller than the square scene.

For the living room scene effect, the corresponding number is, for example, less than or equal to 20 people. In the living room scene, each user in the scene can hear the audio feedback of other users clearly. Therefore, the volume ratio reflected by the living room scene effect can be set It is: after merging, on the basis of listening to the content of the main audio data, the content of each audio feedback data participating in the merging can be heard.

For a theater scene effect, the corresponding number is, for example, greater than 20 people and less than or equal to 200 people. In this theater scene, the various audio feedback in the scene can only be vague and audible. Therefore, the volume configuration reflected by the theater scene effect The ratio can be set to: after merging, on the basis of listening to the content of the main audio data, the content of each audio feedback data participating in the merging can be heard vaguely.

For the square scene effect, the corresponding number is, for example, more than 200 people. In this square scene, the audio feedback in the scene is not audible, and only various noises can be heard. Therefore, the volume ratio reflected by the theater scene effect can be set to : After merging, you can only hear the content of the main audio data, but cannot hear the content of the audio feedback data participating in the merging. That is, in the square scene, you can only feel the noise of multiple people performing audio feedback.

In addition, if the audio feedback data corresponding to a certain playback period is not generated during the playback of the main audio data, the combined audio data will only have the main audio data in the part that corresponds to the playback period or lags behind the playback period. Content. When the terminal device is playing this part of the content, the user can only hear the audio content of the main audio data without any audio feedback data content. Therefore, all users can feel the silent scene when all users enjoy this part of the content. Atmosphere.

Step S2213: According to the combination effect determined in step S2212, the audio feedback data generated during the set playing period is combined with the main audio data.

In this example, the combination according to step S2213 can be performed in the corresponding playback period of the main audio data according to the determined combination effect, or the combination according to step S2213 can be performed in the next playback period of the corresponding playback period of the main audio data. Not limited.

Taking the number of audio feedback data generated during the 0th to 5th minutes of the main audio data as an example, the number obtained in step S2211 is 15, and the combination effect determined according to the number in step S2212 is the living room scene effect , Then in step S2213, according to the effect of the living room scene, the audio feedback data generated during the 0 to 5 minutes playing period will be merged with the main audio data. Due to the merging process, the playing time of the audio feedback data will be longer than that of the same audio feedback data The generation time is delayed. Therefore, this can be the combination of the audio feedback data generated during the playback period of 0 to 5 minutes with the part of the main audio data corresponding to the playback period of 5 to 10 minutes, or the combination of the main audio data Corresponding to the partial merging of the playback period from 2 to 7 minutes, the specific delay time is related to the processing speed and the set sampling time interval for reading the audio feedback data, which is not limited here.

According to the above steps S2211 to S2213, the merging process of this example can make the merged audio data reflect the impact of the amount of audio feedback data on the auditory effect, and then realize the simulation of the live effect of the corresponding number of audiences on the main audio data. , To enhance the user's on-site experience.

In an example, combining the audio feedback data with the main audio data in this step S2200 may include the following steps S2221 to S2222.

Step S2221: Detect the idle gap of the main audio data adjacent to each audio feedback data according to the play period of the main audio data corresponding to each audio feedback data when it is generated.

This idle gap is a time gap in the main audio data where there is no audio content.

Taking FIG. 4 as an example, in the data stream of the main audio data in FIG. 4, the grid part identifies that it has audio content, and the blank part indicates the idle gaps in the main audio data, and audio feedback data can be inserted in the idle gaps.

Through this step S2221, each free gap of the main audio data can be used as a combined slot to perform a combined operation in each combined slot.

Step S2222: align each audio feedback data with an adjacent idle gap, and merge the audio feedback data with the main audio data.

According to the alignment in step S2222, the start position of each audio feedback data can be aligned with any position of the adjacent free gap to merge, for example, the start position of each audio feedback data is aligned with the adjacent free gap. The starting positions of the free gaps are relatively aligned and merged, which is not limited here.

According to the above steps S2221 to S2222, by aligning each audio feedback data with the adjacent idle gap of the main audio data to merge each data, the influence of the audio feedback data on the main audio data can be reduced as much as possible.

In an example, combining the audio feedback data with the main audio data in step S2200 may include the following steps S2231 to S2232.

Step S2231, setting that each data including the main audio data and the audio feedback data occupies a different audio track.

For example, if 10 pieces of audio feedback data are generated during the set playback period of the main audio data, when these 10 pieces of audio feedback data are combined with the main audio data, it is equivalent to merging 11 pieces of audio data. Here, You can set 11 audio tracks so that each data occupies a different audio track to be combined.

Step S2232: Combine the audio feedback data with the main audio data through audio track synthesis.

According to the above steps S2231 to S2232, the audio processing according to this example can use audio track synthesis technology to merge audio data, which is beneficial to reduce the difficulty of audio merging and obtain a good merging effect.

Regarding the above non-limiting examples of various combinations of processing, they can be used alone or in any combination as needed.

In this embodiment, the combined audio data generated in step S2200 for playback may be: for the terminal device that currently plays the target media file, after each audio file is updated through the combination, the current playback time is continued after the update is played Audio files.

The generation of the merged audio data for playback in step S2200 may be implemented by the terminal device or the server.

In the example where the server participates in the implementation, generating the combined audio data for playback in this step may include: sending the combined audio data to the terminal device for playback.

In the example where the terminal device participates in the implementation, generating the combined audio data for playback in this step may include: generating the combined audio data to drive the audio output device to play.

According to the above steps S2100 to S2200, the audio data processing method of this embodiment is to combine the main audio data of the target media file selected by the user for playback with the audio feedback data generated during the playback of the main audio data to obtain the combined The audio data is available for playback. In this way, when any user plays the target media file through his or her own terminal device, he can get the live listening effect of enjoying the target media file with other people, and then get the live experience.

In an embodiment, the user may be allowed to choose whether to enable the live sound effect. Therefore, in this embodiment, the processing method may further include a step of detecting whether the live sound effect is turned on, so as to execute the above step S2200 in response to the instruction to enable the live sound effect function. The operation of merging the acquired audio feedback data with the main audio data.

The above additional steps in this embodiment can be implemented by the terminal device, that is, the terminal device merges the acquired audio feedback data with the main audio data in response to the user's input to enable the live sound effect function. The instruction may be triggered by the user through a physical button of the terminal device, or may be triggered by a virtual button (control) provided by the application playing the target media file. For example, the instruction may be triggered by a virtual button for enabling live sound effects as shown in FIG. 5.

The above additional steps in this embodiment can also be implemented by the server, that is, the server responds to the instruction sent by the terminal device to turn on the live sound function, and provides the combined audio data to the terminal device for playback, or provides the terminal device with audio feedback data Merging with the main audio data to form merged audio data for playback. The instruction sent by the terminal device may be generated based on the instruction triggered by the user.

This embodiment allows the user to choose whether to play the combined audio data. If the audio feedback data is not desired to be played, they can also choose to play only the main audio data of the target media file to achieve diversified choices.

In one embodiment, for any user who receives the target media file, the audio feedback data participating in the merging may be the same. In step S2100, all audio feedback data generated during the playback of the main audio data may be acquired. Combining can also be to obtain and combine the filtered part of the audio feedback data according to the set filtering conditions, which is not limited here.

In another embodiment, for different types of users, the audio feedback data participating in the merging may be different, that is, according to user preferences, different audio feedback data can be filtered for different types of users and combined to obtain thousands of people. The live effect of the face.

In this other embodiment, obtaining the audio feedback sound data generated during the playback of the main audio data in the above step S2100 may include: obtaining audio feedback data that meets the target classification generated during the playback of the main audio data.

In this embodiment, each target classification may be set in advance, which may be a classification based on at least one of the user's age, gender, education, and preferences, for example, five target classifications are set according to the user's age.

For example, for a target classification whose user age is under 20 (including 20 years old), audio feedback data generated by a user under 20 can be obtained in this step S2100 to form audio feedback data conforming to the target classification.

In this other embodiment, generating the combined audio data for playback in the above step S2200 includes: generating the combined audio data for playback by terminal devices that meet the target classification.

In the case where the target classification is based on user attributes, the terminal device that conforms to the target classification refers to the user corresponding to the terminal device, that is, the user who uses the terminal device, and conforms to the target classification.

In an example, the other embodiment may be implemented by the server. In this example, step S2110 may include: for each target classification set, the server obtains the corresponding target classification generated during the playback of the target media file. Audio feedback data.

Further, the server may deliver the acquired audio feedback data that meets the target classification to a terminal device that matches the target classification to merge with the main audio data.

Further, the server can also merge the audio feedback data with the main audio data after acquiring the audio feedback data that meets the target classification, and send the combined audio data to the terminal device that matches the target classification for playback .

In an example, the other embodiment may also be implemented by the terminal device. In this example, step S2110 may include: the terminal device obtains from the server audio feedback data that corresponds to the target category to which the user belongs, so as to communicate with the main audio data The merger.

In this example, the target category to which the corresponding user belongs, that is, the target category matching the terminal device, can be selected and determined by the user according to the target category provided, or determined according to the user characteristics of the corresponding user.

According to the processing method of this embodiment, different on-site effects can be provided for different types of users, thereby improving the fit between the provided on-site effects and users, and improving user experience.

In an embodiment, the processing method of the present application may further include: acquiring a characteristic value of a set user characteristic corresponding to a terminal device that plays the main audio data; and, according to the characteristic value, determining the target category to which the terminal device belongs.

In this embodiment, the set user characteristic corresponding to the terminal device refers to the set user characteristic of the user corresponding to the terminal device, that is, the set user characteristic of the user who uses the terminal device.

In an example, the set user characteristics include any one or more of age, education, gender, hobbies, and preferred language types. These feature values for setting user characteristics can be determined based on the user's registration information, or based on the historical usage data generated by the user using this application (the application that provides the target media file), or based on the historical usage generated by the user through the use of other applications The data is confirmed and not limited here.

In an example, the set user characteristics may include the set characteristics of the audio feedback data generated by the user during the playback of the main audio data. The setting feature includes, for example, any one or two of voice features and emotional features. In this example, the corresponding user can be assigned to target categories with similar language types based on the set feature, or the corresponding users can be assigned to target categories with the opposite language type based on the set feature, which is not limited here.

Sound characteristics refer to characteristics related to sound attributes embodied in audio feedback data. The sound characteristics may include volume characteristics, rhythm characteristics, pitch characteristics, and the like.

Emotional features refer to the features related to the user's emotions or feelings reflected in the audio feedback data. The emotional features can include the type of emotion, the degree of emotion, and the theme of expression. The emotion type can be a preset type according to human emotion and emotion classification. For example, the emotion type can include anger, happiness, sadness, joy, etc. The emotion level can include the emotion level of the corresponding emotion type, for example, the emotion type of anger can include Various degrees of anger, such as anger, anger, and slight anger.

When extracting the sound features of the audio feedback data, voice analysis can be performed on the audio feedback data to extract the corresponding volume characteristics and rhythm characteristics. For example, common voice signal analysis methods can be used to determine the volume and rhythm of the audio feedback data, and correspondingly obtain the volume characteristics and rhythm characteristics of the audio feedback data.

When extracting the emotional features of the audio feedback data, the content of the audio feedback data can be converted into the corresponding text, and the emotional keywords can be extracted from the text according to the pre-built emotional vocabulary, and the emotional keywords can be processed through the emotional structure model. Structured analysis obtains the emotional type and emotional degree of emotional keywords as the emotional characteristics of the audio feedback data.

For example, the audio feedback data can be passed through a speech recognition engine or speech-to-text tools, plug-ins, etc., to obtain the corresponding text.

The emotional vocabulary includes a plurality of emotional vocabularies that respectively reflect different human emotions or human emotions. For example, these emotional vocabularies can be excavated manually or by machines to construct an emotional vocabulary in advance.

According to the emotional vocabulary, the vocabulary obtained by segmenting the audio feedback data and the emotional vocabulary included in the emotional vocabulary can be analyzed for similarity through cosine similarity and other methods, and emotional vocabulary with similarity higher than the preset similarity threshold can be extracted As emotional keywords.

The emotional structured model can be a vocabulary model obtained by classifying and structuring the collected emotional vocabulary related to emotions. Each emotion vocabulary included in the emotion structure model has a corresponding emotion type and emotion degree.

In one example, the emotional vocabulary obtained through manual or machine mining in advance can be classified into different levels according to human emotions or human emotions. For example, each emotion type is divided into major categories, and each major category includes The emotional vocabulary of the same emotional type is further subdivided into different sub-categories according to the different emotional level in each major category. Under each sub-category, the emotional vocabulary can be sorted according to the emotional level to form different classification levels. The structure of the emotional vocabulary is organized to obtain the emotional structured model.

Through the emotional structured model, the emotional keyword is structured and analyzed, and the emotional vocabulary corresponding to the emotional keyword can be found in the emotional structured model, and the emotional type of the emotional keyword can be determined according to the emotional type and emotional degree of the emotional vocabulary And the degree of emotion corresponds to the emotional characteristics of the audio feedback data.

In the case that the audio feedback data is an expression feature, the required feature value of the set feature can be determined directly according to the expression feature, for example, an angry expression represented by an expression feature, the feature value of the corresponding emotion feature can be determined directly based on the expression feature .

This step of this embodiment can be implemented by the server according to the characteristic value of the set user characteristic corresponding to the user provided by the terminal device, or it can be implemented by the terminal device. In the example implemented by the terminal device, each terminal device determines the corresponding The target classification described by the corresponding user.

According to the processing method of this embodiment, the target classification to which the user or the terminal device belongs is determined according to the characteristic value of the user characteristic, which can improve the accuracy of determining the target classification, and does not require the user to set the desired target classification through additional operations, thereby achieving intelligence classification.

In one embodiment, the main audio data is the audio data of the video file. In this regard, the processing method of this embodiment may further include: displaying the audio representing the audio feedback data in the form of a bullet screen in the video playback window of the video file Waveform.

The audio waveform representing the audio feedback data is a graphical representation of the audio feedback data. For example, the audio waveform displayed in the playback window shown in Figure 5.

In one example, the sound characteristics and emotional characteristics of the audio feedback data may be acquired first, and then the audio waveform is generated according to the sound characteristics and emotional characteristics of the audio feedback data.

In an example, the display shape of the audio waveform can be set according to the sound characteristics of the audio feedback data.

In this example, the display shape may include the amplitude of the audio waveform, the interval of the waveform period, and the duration of the waveform. For example, the sound characteristics of the audio feedback data include rhythm characteristics and volume characteristics. The waveform period interval of the audio waveform can be set according to the rhythm reflected by the rhythm characteristics. For example, the faster the rhythm, the shorter the waveform period interval, etc., according to the volume characteristics of the volume. Set the waveform amplitude of the audio waveform, such as the louder the volume, the larger the waveform amplitude, etc.

In an example, the display color of the audio waveform can be set according to the emotional characteristics of the audio feedback data.

In this example, different types of display colors can be set according to different emotion types. For example, the emotion type is "angry", the display color is set to red, the emotion type is "happy", and the display color is set to green, which is different for the same emotion type. The emotion level setting is different for the same type of display color. For example, the emotion type is "happy", the emotion level is "big joy", the display color is dark green, the emotion level is "a little happy", and the display color is light green. and many more.

According to the processing method of this embodiment, the audio waveform is displayed in the form of a barrage in the video playback window. While obtaining the on-site auditory effect, it can also intuitively feel the voice characteristics of other users through the graphical expression of the audio feedback data. And emotional characteristics.

Fig. 6a is an exemplary flowchart of a method for processing audio data according to an example of the present application. In this example, the audio feedback data provided by the server to each terminal device may be the same. Therefore, only one terminal device is shown in the figure. In this example, the processing method may include the following steps:

In step S1210, the terminal device 1200 collects the audio feedback data generated by the corresponding user during the playback process of the target media file, that is, during the playback process of the main audio data, and uploads it to the server 1100.

In another example, the terminal device 1200 shown in the figure may not generate audio feedback data. Instead, other terminal devices 1200 collect the audio feedback data generated by the corresponding user during the playback of the target media file and upload it to the server 1100.

In step S1110, the server 1100 obtains the audio feedback data uploaded by each terminal device including the terminal device shown in the figure.

In step S1120, the server 1100 delivers the acquired audio feedback data to each terminal device 1200 that is playing the target media file to merge the audio data.

In step S1220, the terminal device 1200 obtains audio feedback data provided by the server 1100.

In step S1230, the terminal device 1200 merges the acquired audio feedback data with the main audio data of the target media file to generate a merged target media file.

The terminal device 1200 combines the main audio data and the acquired audio feedback data, for example, by means of audio mixing.

Step S1240: When the terminal device 1200 plays the target media file, it plays the combined audio data instead of playing the separate main audio data. That is, the user corresponding to the terminal device 1200 can at least listen to other users while listening to the main audio data. Audio feedback data generated in the process of playing the main audio data.

Fig. 6b is an exemplary flowchart of a method for processing audio data according to another example of the present application. In this example, the audio feedback data provided by the server to each terminal device may be different. The figure shows two terminal devices conforming to different target categories, namely terminal device 1200-1 and terminal device 1200-2. In this example, the processing method may include the following steps:

In step S1210-1, the terminal device 1200-1 collects and uploads the audio feedback data generated by the corresponding user during the playback of the target media file to the server 1100.

In step S1210-2, the terminal device 1200-2 collects and uploads the audio feedback data generated by the corresponding user during the playback of the target media file to the server 1100.

In another example, the terminal device 1200-1 and/or the terminal device 1200-2 shown in the figure may not generate audio feedback data, but the other terminal device 1200 collects the corresponding user during the playback of the target media file. The generated audio feedback data is uploaded to the server 1100.

In step S1110, the server 1100 obtains audio feedback data uploaded by each terminal device including the terminal device 1200-1 and the terminal device 1200-2.

In step S1120-1, the server 1100 delivers the user voice data that is generated during the playback of the target media file and conforms to the target classification to which the terminal device 1200-1 belongs to the terminal device 1200-1 to merge the audio data.

In step S1120-2, the server 1100 delivers the user sound data generated during the playback of the target media file and conforms to the target classification to which the terminal device 1200-2 belongs to the terminal device 1200-2 to merge the audio data.

Step S1220-1: The terminal device 1200-1 obtains the audio feedback data provided by the server 1100.

In step S1230-1, the terminal device 1200-1 combines the acquired audio feedback data with the main audio data of the target media file to generate combined audio data A.

In step S1240-1, the terminal device 1200-1 plays the combined audio data A during the process of playing the target media file, where the auditory effect of playing the combined audio data A is: the user corresponding to the terminal device 1200-1 While listening to the main audio data, it is also possible to listen to the audio feedback data that matches the target classification of the terminal device 1200-1.

In step S1220-2, the terminal device 1200-2 obtains the audio feedback data provided by the server 1100.

In step S1230-2, the terminal device 1200-2 combines the acquired audio feedback data with the main audio data of the target media file to generate combined audio data B.

In step S1240-2, the terminal device 1200-2 plays the combined audio data B during the process of playing the target media file, where the auditory effect of playing the combined audio data B is: the user corresponding to the terminal device 1200-2 While listening to the main audio data, it is also possible to listen to the audio feedback data that matches the target classification of the terminal device 1200-2.

In this example, since the terminal device 1200-1 and the terminal device 1200-2 belong to different target categories, the merged audio data A and the merged audio data B will be different, achieving a scene effect of thousands of people .

FIG. 7 is a schematic flowchart of a method for processing audio data according to this embodiment. The processing method is implemented by a terminal device, such as the terminal device 1200 in FIG. 1, where the terminal device in this embodiment may have a display device. The device may also be a device without a display device; it may have an audio output device itself, or an external audio output device may be connected wirelessly or wiredly.

As shown in FIG. 7, the method of this embodiment may include the following steps S7100 to S7300:

In step S7100, the terminal device 1200 obtains the main audio data selected to be played.

The main audio data selected to be played is the audio data of the target media file selected by the user using the terminal device 1200, and the target media file may be a pure audio file or a video file.

In step S7200, the terminal device 1200 obtains live audio data corresponding to the main audio data, where the live audio data includes at least other users' audio feedback data for the main audio data.

The live audio data may also include audio feedback data generated by the user corresponding to the terminal device 1200 for the main audio data. That is, for any terminal device 1200, not only the audio feedback data of other users participates in the merging of audio data, but the terminal device is used Audio feedback data generated by users of 1200 can also participate in the merging of audio data.

In this step S7200, the acquired live audio data may be live audio data conforming to the target classification of the terminal device 1200, or may be live audio data that is the same for any terminal device 1200, which is not limited herein.

In an example, the terminal device 1200 may obtain all audio feedback data from the server, including audio feedback data generated by other users, and may also include audio feedback data generated by the user corresponding to the terminal device 1200.

In another example, the terminal device 1200 may only obtain audio feedback data generated by other users from the server, and obtain audio feedback data generated by the corresponding user locally.

In an example, after the terminal device 1200 obtains the live audio data, the live audio data and the main audio data are combined to obtain the combined audio data.

In another example, it may be merged by the server and provided to the terminal device. In this example, the above steps S7100 and S7200 refer to acquiring the merged audio data, where the merged audio data includes the main Audio data and live audio data.

In step S7300, the terminal device 1200 performs a processing operation of playing the corresponding live audio data while playing the main audio data.

In the example of combining live audio data and main audio data by the terminal device 1200, the processing operation includes the combining process, and driving the audio output device according to the combined audio data to play the corresponding live audio data while playing the main audio data. For audio data, any one or more of the methods provided in Embodiment 1 of the above method can be used for the merging process, which will not be repeated here.

In an example in which the terminal device 1200 directly receives the merged audio data provided by the server 1100, the processing operation includes: according to the merged audio data, the audio output device is driven to play the main audio data while playing the corresponding live audio data.

In this step S7200, the terminal device 1200 may drive the audio output device to be able to play the main audio according to the combined audio data, for example, according to the mixed audio data, or according to the correspondence between the main audio data and the live audio data. At the same time as the data, the corresponding live audio data is played to realize the live effect of enjoying the target media file with other people.

In this embodiment, the terminal device may be a smart phone, a laptop computer, a desktop computer, a tablet computer, a wearable device, a smart speaker, a set-top box, a smart TV, a voice recorder, a camcorder, etc., which are not limited here.

According to the processing method of this embodiment, the terminal device can play the acquired live audio data along with the main audio data of the target media file during the process of playing the target media file selected by the user, so that the user can obtain the main audio data and the live audio data. The live listening experience of audio data mixed together. Therefore, according to the processing method of this embodiment, when any user plays the target media file through his terminal device, he/she can obtain the live listening effect of enjoying the target media file with other people, thereby obtaining the live experience.

FIG. 8 is a schematic flowchart of a method for processing audio data according to this embodiment. The processing method is implemented by a terminal device, such as the terminal device 1200 in FIG. 1, where the terminal device in this embodiment may have a display device. The device may also be a device without a display device; it may have an audio output device itself, or an external audio output device may be connected wirelessly or wiredly.

As shown in FIG. 8, the processing method of this embodiment may include the following steps S8100 to S8300:

Step S8100: In response to the operation of playing the target media file, the terminal device 1200 plays the target media file, where the target media file includes main audio data.

Step S8200: Acquire live audio data corresponding to the main audio data, where the live audio data includes at least other users' audio feedback data for the main audio data.

In this step S8200, the live audio data may also include the user corresponding to the terminal device, that is, the local user, audio feedback data for the main audio data. The audio feedback data of the local user can be obtained from the server together with the audio feedback data of other users, or it can be obtained locally, which is not limited here.

In an example, obtaining live audio data corresponding to the main audio data in step S8200 may include: obtaining audio feedback data of other users for the main audio data from the server to form live audio data.

In an example, obtaining live audio data corresponding to the main audio data in step S8200 may further include: obtaining audio feedback data of the user corresponding to the terminal device for the main audio data from the server or locally to form live audio data.

In step S8300, the terminal device 1200 performs a processing operation of playing live audio data along with the main audio data of the target media file during the process of playing the target media file. In an example, the processing operation may include: terminal device 1200 merging processing, that is, merging live audio data with main audio data, and driving the audio output device to play the live audio data while playing the main audio data according to the merged audio data, Wherein, the merging process can adopt any one or more of the methods provided in Embodiment 1 of the above method, which will not be repeated here.

In another example, the processing operation may include: the terminal device 1200 obtains the combined audio data provided by the server 1100, where the combined audio data is audio data obtained by combining the main audio data and the live audio data, and , According to the combined audio data to drive the audio output device to play the live audio data while playing the main audio data.

In step S8300, the terminal device 1200 will drive the audio output device to play the live audio data at the same time as the main audio data according to the combined form, such as the mixed audio form or the multi-channel form, so as to follow the main The audio data playback corresponds to the live audio data to realize the live effect of enjoying the target media file with others.

In an embodiment, the processing method may further include: obtaining audio feedback data fed back by the user corresponding to the terminal device for the main audio data; uploading the user's audio feedback data to the server.

According to this embodiment, after uploading the user's audio feedback data to the server, the server can send the user's audio feedback data to the terminal devices of other users, so that other users who are also playing the target media file can receive Audio feedback data to the user.

Fig. 9 is a schematic block diagram of an audio data processing device according to an embodiment of the present application.

As shown in FIG. 9, the processing device 9000 of this embodiment includes a data acquisition module 9100 and an audio processing module 9200.

The data acquisition module 9100 is used to acquire audio feedback data generated during the playback of the main audio data.

The audio processing module 9200 is used to combine the audio feedback data with the main audio data, and generate the combined audio data for playback.

In one embodiment, when the above audio processing module 9200 merges the audio feedback data with the main audio data, it can be used to: obtain the quantity of the audio feedback data generated during the set playing period of the main audio data; according to the quantity The corresponding merging effect is determined, where the merging effect at least reflects the volume ratio of each data participating in the merging; and, according to the merging effect, the audio feedback data generated during the set playback period is merged with the main audio data.

In one embodiment, when the above audio processing module 9200 combines the audio feedback data with the main audio data, it can be used to: detect the main audio data according to the play period of the main audio data corresponding to each audio feedback data when it is generated. And align each audio feedback data with the adjacent idle gaps for merging.

In one embodiment, when the above audio processing module 9200 combines the audio feedback data with the main audio data, it can be used to: set each data including the main audio data and the audio feedback data to occupy a different audio track. ; And, merge the audio feedback data with the main audio data through audio track synthesis.

In one embodiment, the processing device 9000 may further include a detection module configured to detect whether the live sound effect function is enabled, and in response to the instruction to enable the live sound effect function, notify the audio processing module 9200 to perform the combination of audio feedback data with the main audio Operation of data merging.

In one embodiment, when the above data acquisition module 9100 acquires the audio feedback data generated during the playback of the main audio data, it may include: acquiring the voice comments fed back during the playback of the main audio data, and at least the voice Comments are used as audio feedback data.

In one embodiment, when the above data acquisition module 9100 acquires the audio feedback data generated during the playback of the main audio data, it may include: acquiring the text comments fed back during the playback of the main audio data; and adding the text comments Convert it into corresponding audio data, and at least use the converted audio data as audio feedback data.

In one embodiment, when the above data acquisition module 9100 acquires the audio feedback data generated during the playback of the main audio data, it may include: acquiring the expression characteristics fed back during the playback of the main audio data; and The expression features are converted into corresponding audio data, and at least the converted audio data is used as audio feedback data.

In one embodiment, when the above data acquisition module 9100 acquires audio feedback data generated during the playback of the main audio data, it may be used to: acquire audio feedback data that meets the target classification generated during the playback of the main audio data , So that the audio processing module 9200 generates merged audio data for playback by terminal devices that meet the target classification.

In an embodiment, the processing device 9000 may further include a classification module configured to: obtain a characteristic value of a set user characteristic corresponding to a terminal device that plays the main audio data; and, according to the characteristic value, determine the Target classification corresponding to the terminal device.

In an embodiment, the setting user characteristics may include: setting characteristics corresponding to the audio feedback data generated by the user of the terminal device during the playback process of the main audio data.

In one embodiment, the main audio data is the audio data of the video file, and the processing device 9000 may further include a display processing module, which is used to display the audio feedback data in the form of a barrage in the display window. Audio waveform.

This embodiment provides an electronic device. As shown in FIG. 10a, the electronic device 100 includes a processing device 9000 according to any embodiment of the present application.

In another embodiment, as shown in FIG. 10b, the electronic device 100 may include a memory 110 and a processor 120. The memory 110 is used to store executable instructions; the processor 120 is used to execute commands according to the executable instructions. Control and execute the processing method as in any method embodiment of this application.

In this embodiment, the electronic device 1000 may be a server, such as the server 1100 in FIG. 1, or any terminal device, such as the terminal device 1200 in FIG. 1, and may also include a server and a terminal device, such as The server 1100 and the terminal device 1200 in FIG. 1 are not limited here.

In one embodiment, the electronic device 100 is a terminal device. The terminal device may be a device with a display device or a device without a display device. For example, the terminal device is a set-top box, a smart speaker, etc.

In one embodiment, the electronic device 100 is a terminal device. The terminal device may also include an input device for the corresponding user to post feedback content for the main audio data and send the feedback content to the above processing device 9000 or The processor 120 is used for the processing device 9000 or the processor 120 to generate audio feedback data corresponding to the user's main audio data according to the feedback content.

The input device may include at least one of an audio input device, a physical keyboard, a virtual keyboard, and a touch screen.

Further, the processing device or processor of the terminal device can also be used to control the communication device to send the audio feedback data of the corresponding user to the server, so that the server can send the audio feedback data of the corresponding user to the terminal equipment of other users. , Other users can receive the user’s audio feedback data while playing the same target media file.

In one embodiment, the electronic device 100 is a terminal device. The terminal device may also include an audio output device. The audio output device is used to play the main audio data while playing the corresponding audio according to the control of the processing device or the processor. Feedback data. Of course, in other embodiments, the terminal device may also be connected to the audio output device in a wired or wireless manner to play the combined audio data.

In this embodiment, a computer-readable storage medium is also provided. The computer-readable storage medium stores a computer program that can be read and run by a computer. , Execute the audio data processing method as described in any of the above embodiments of this application.

This application can be a system, method and/or computer program product. The computer program product may include a computer readable storage medium loaded with computer readable program instructions for enabling a processor to implement various aspects of the present application.

The computer-readable storage medium may be a tangible device that can hold and store instructions used by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples of computer-readable storage media (non-exhaustive list) include: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM) Or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD-ROM), digital versatile disk (DVD), memory stick, floppy disk, mechanical encoding device, such as a printer with instructions stored thereon The protruding structure in the hole card or the groove, and any suitable combination of the above. The computer-readable storage medium used here is not interpreted as a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (for example, light pulses through fiber optic cables), or through wires Transmission of electrical signals.

The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to various computing/processing devices, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, optical fiber transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network, and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device .

The computer program instructions used to perform the operations of this application may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or in one or more programming languages Source code or object code written in any combination, the programming language includes object-oriented programming languages such as Smalltalk, C++, etc., and conventional procedural programming languages such as "C" language or similar programming languages. Computer-readable program instructions can be executed entirely on the user's computer, partly on the user's computer, executed as a stand-alone software package, partly on the user's computer and partly executed on a remote computer, or entirely on the remote computer or server carried out. In the case of a remote computer, the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (for example, using an Internet service provider to access the Internet connection). In some embodiments, an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), can be customized by using the status information of the computer-readable program instructions. The computer-readable program instructions are executed to realize various aspects of the present application.

Here, various aspects of the present application are described with reference to the flowcharts and/or block diagrams of the methods, devices (systems) and computer program products according to the embodiments of the present application. It should be understood that each block of the flowcharts and/or block diagrams and combinations of blocks in the flowcharts and/or block diagrams can be implemented by computer-readable program instructions.

These computer-readable program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine such that when these instructions are executed by the processor of the computer or other programmable data processing device , A device that implements the functions/actions specified in one or more blocks in the flowchart and/or block diagram is produced. It is also possible to store these computer-readable program instructions in a computer-readable storage medium. These instructions make computers, programmable data processing apparatuses, and/or other devices work in a specific manner, so that the computer-readable medium storing instructions includes An article of manufacture, which includes instructions for implementing various aspects of the functions/actions specified in one or more blocks in the flowchart and/or block diagram.

It is also possible to load computer-readable program instructions onto a computer, other programmable data processing device, or other equipment, so that a series of operation steps are executed on the computer, other programmable data processing device, or other equipment to produce a computer-implemented process , So that the instructions executed on the computer, other programmable data processing apparatus, or other equipment realize the functions/actions specified in one or more blocks in the flowcharts and/or block diagrams.

The flowcharts and block diagrams in the drawings show the possible implementation of the system architecture, functions, and operations of the system, method, and computer program product according to multiple embodiments of the present application. In this regard, each block in the flowchart or block diagram may represent a module, program segment, or part of an instruction, and the module, program segment, or part of an instruction contains one or more functions for implementing the specified logical function. Executable instructions. In some alternative implementations, the functions marked in the block may also occur in a different order from the order marked in the drawings. For example, two consecutive blocks can actually be executed in parallel, or they can sometimes be executed in the reverse order, depending on the functions involved. It should also be noted that each block in the block diagram and/or flowchart, and the combination of the blocks in the block diagram and/or flowchart, can be implemented by a dedicated hardware-based system that performs the specified functions or actions Or it can be realized by a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

The embodiments of the present application have been described above, and the above description is exemplary and not exhaustive, and is not limited to the disclosed embodiments. Without departing from the scope and spirit of the described embodiments, many modifications and changes are obvious to those of ordinary skill in the art. The choice of terms used herein is intended to best explain the principles, practical applications, or technical improvements in the market of the embodiments, or to enable other ordinary skilled in the art to understand the embodiments disclosed herein. The scope of the application is defined by the appended claims.

Claims

A processing method of audio data, including:

Obtain audio feedback data generated during the playback of the main audio data;

The audio feedback data and the main audio data are combined to generate combined audio data for playback.
The processing method according to claim 1, wherein said combining said audio feedback data with said main audio data comprises:

Acquiring the quantity of the audio feedback data generated within the set playing period of the main audio data;

Determine the corresponding merging effect according to the number, where the merging effect at least reflects the volume ratio of each data participating in the merging;

According to the merging effect, the audio feedback data generated in the set play period is merged with the main audio data.
The processing method according to claim 1, wherein said combining said audio feedback data with said main audio data comprises:

Detecting the idle gap of the main audio data adjacent to each of the audio feedback data according to the play period of the main audio data corresponding to each of the audio feedback data when it is generated;

Align each of the audio feedback data with the adjacent free gaps, and perform the combination.
The processing method according to claim 1, wherein said combining said audio feedback data with said main audio data comprises:

Setting each piece of data including the main audio data and the audio feedback data to occupy a different audio track;

The audio feedback data is combined with the main audio data through audio track synthesis.
The processing method according to claim 1, wherein said acquiring audio feedback data generated during the playing process of the main audio data comprises:

Acquire audio feedback data that meets the target classification generated during the playback of the main audio data;

The generating of the combined audio data for playback includes:

The combined audio data is generated for playback by terminal devices that meet the target classification.
The processing method according to claim 5, wherein the method further comprises:

Acquiring the characteristic value of the set user characteristic corresponding to the terminal device playing the main audio data;

According to the characteristic value, the target classification corresponding to the terminal device is determined.
8. The processing method according to claim 6, wherein the set user characteristics include set characteristics corresponding to audio feedback data generated by the user of the terminal device during the playback of the main audio data.
The processing method according to claim 1, wherein the main audio data is audio data of a video file, and the method further comprises:

In the video playback window of the video file, the audio waveform representing the audio feedback data is displayed in the form of a bullet screen.
The processing method according to claim 1, wherein said acquiring audio feedback data generated during the playing process of the main audio data comprises:

Acquire the voice comment fed back during the playing process of the main audio data, and use the voice comment as the audio feedback data at least.
The processing method according to claim 1, wherein said acquiring audio feedback data generated during the playing process of the main audio data comprises:

Obtain the text comments fed back during the playback of the main audio data;

The text comment is converted into corresponding audio data, and at least the converted audio data is used as the audio feedback data.
The processing method according to claim 1, wherein said acquiring audio feedback data generated during the playing process of the main audio data comprises:

Obtain the expression characteristics fed back during the playback of the main audio data;

The expression feature is converted into corresponding audio data, and at least the converted audio data is used as the audio feedback data.
The processing method according to claim 1, wherein the main audio data is audio data of a live media file.
The processing method according to any one of claims 1 to 12, wherein the method further comprises:

In response to the instruction to turn on the live sound effect function, the operation of merging the audio feedback data with the main audio data is performed.
A method for processing audio data, implemented by a terminal device, the method including:

Obtain the main audio data selected to be played;

Acquiring live audio data corresponding to the main audio data, where the live audio data includes at least other users' audio feedback data for the main audio data;

Perform a processing operation of playing the live audio data while playing the main audio data.
The processing method according to claim 14, wherein the live audio data further includes audio feedback data of the user corresponding to the terminal device for the main audio data.
A method for processing audio data, implemented by a terminal device, the method including:

In response to the operation of playing the target media file, playing the target media file, where the target media file includes main audio data;

Acquire live audio data corresponding to the main audio data, where the live audio data includes at least other users’ audio feedback data for the main audio data; the execution is accompanied by the process of playing the target media file. The main audio data plays the processing operation of the live audio data.
The processing method according to claim 16, wherein said acquiring live audio data corresponding to said main audio data comprises:

Acquire audio feedback data of other users for the main audio data from the server as the live audio data.
The processing method according to claim 16, wherein the method further comprises:

Acquiring audio feedback data of the user corresponding to the terminal device for the main audio data;

Upload the audio feedback data of the user to the server.
An audio data processing device includes:

A data acquisition module for acquiring audio feedback data generated during the playback of the main audio data; and an audio processing module for combining the audio feedback data with the main audio data to generate the combined audio data for use Play.
An electronic device, comprising the processing device according to claim 19; or, comprising:

Memory, used to store executable instructions;

The processor is configured to run the electronic device to execute the processing method according to any one of claims 1-18 according to the control of the executable instruction.
The electronic device according to claim 20, wherein the electronic device is a terminal device without a display device.
The electronic device according to claim 20, wherein the electronic device is a terminal device, and the terminal device further comprises an input device for the corresponding user to input feedback content for the main audio data, and the The feedback content is sent to the processing device or the processor, so that the processing device or the processor generates audio feedback data of the corresponding user for the main audio data according to the feedback content.
The electronic device according to claim 20, wherein the electronic device is a terminal device, the terminal device further comprises an audio output device, and the audio output device is configured to be controlled by the processing device or the processor, Play the corresponding audio feedback data while playing the main audio data.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that can be read and executed by a computer, and the computer program is used to execute a computer program according to claim 1 when read and executed by the computer. The processing method described in any one of to 18.