CN114466241A

CN114466241A - Display device and audio processing method

Info

Publication number: CN114466241A
Application number: CN202210102840.3A
Authority: CN
Inventors: 王海盈; 邢文峰
Original assignee: Hisense Visual Technology Co Ltd
Current assignee: Hisense Visual Technology Co Ltd
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2022-05-10

Abstract

The application relates to a display device and an audio processing method, which are applied to the technical field of audio processing, wherein the display device comprises: the controller is configured to obtain song audio data, and perform voice separation on the song audio data to obtain original singing voice audio data and accompaniment audio data; the microphone is configured to collect singing voice audio data input by a user in real time; the controller is further configured to determine an original singing gain according to the energy of the original singing vocal audio data in each time period and the energy of the singing vocal audio data collected in the time period; according to the original singing gain, performing gain processing on original singing voice audio data in a time period to obtain target voice audio data; the accompaniment audio data, the target voice audio data and the singing voice audio data in the time period are combined, and sound effect enhancement processing is carried out to obtain target audio data; the audio output interface is configured to output the target audio data. The application can improve the accompaniment effect when singing.

Description

Display device and audio processing method

Technical Field

The present application relates to the field of audio processing technologies, and in particular, to a display device and an audio processing method.

Background

Currently, the karaoke function of a television is usually completed in the karaoke APP. The K song APP has rich functions and better user experience, but the media resources of the K song APP are limited. For example, an original singer a of a song is a male singer, and a flipper singer B is a female singer. When a female user C wants to sing the song, only the accompaniment video of the original singer a may be entered in the karaoke APP, but there is no accompaniment video of the singer B, resulting in that a suitable accompaniment cannot be found.

Since the sound wave waveform of the human voice is the same or similar in the left and right channels of the song, in the related art, the human voice in the stereo song is canceled by subtracting the two channels. However, this method may lose bass (i.e., a frequency band of 400Hz or less) in a song. The bass part of some songs mainly comprises a drum or a bass, and the waveforms of the bass part of the drum or the bass part of the bass part in the left and right sound channels are basically the same, so that the bass part of music can be eliminated when human voice is eliminated, finally obtained accompaniment sound is weak, singing accompaniment feeling is avoided, and user experience is poor.

Disclosure of Invention

In order to solve the above technical problem, the present application provides a display device, an audio processing method, a storage medium, and a program product.

According to a first aspect of the present application, there is provided a display device including:

a controller configured to: acquiring song audio data, and performing voice separation on the song audio data to obtain original singing voice audio data and accompaniment audio data;

the controller further configured to: determining an original singing gain according to the energy of the original singing voice audio data in each time period and the energy of the singing voice audio data collected in the time period;

according to the original singing gain, gain processing is carried out on the original singing voice audio data in the time period to obtain target voice audio data;

the accompaniment audio data, the target voice audio data and the singing voice audio data in the time period are combined, and sound effect enhancement processing is carried out to obtain target audio data;

an audio output interface configured to: and outputting the target audio data.

In some embodiments, the original singing gain is less than or equal to a preset gain threshold.

In some embodiments, the controller is configured to: if the energy of the voice data of the singing person is smaller than a preset energy threshold value, setting the original singing gain as the preset gain threshold value;

and if the energy of the voice audio data of the singing person is more than or equal to the preset energy threshold, determining the original singing gain according to the energy ratio between the energy of the voice audio data of the singing person and the energy of the voice audio data of the original singing person, so that the original singing gain is smaller than the preset gain threshold.

In some embodiments, the controller is further configured to:

acquiring an original singing gain corresponding to a previous time period, and if the original singing gain corresponding to the current time period is the same as the original singing gain corresponding to the previous time period, prolonging the time period until the prolonged time period is smaller than a first time threshold;

if the original singing gain corresponding to the current time period is different from the original singing gain corresponding to the previous time period, shortening the time period until the shortened time period is greater than a second time threshold, wherein the first time threshold is greater than the second time threshold.

In some embodiments, the controller is further configured to:

generating first vocal accompaniment audio data according to the original vocal voice audio data in each time period;

determining vocal accompaniment gain according to the energy of the voice data of the singing person collected in the time period; wherein the vocal accompaniment gain is positively correlated with the energy of the vocal voice data collected in the time period;

gain processing is carried out on the first vocal accompaniment audio data through the vocal accompaniment gain to obtain second vocal accompaniment audio data; wherein the energy of the second vocal accompaniment audio data is less than the energy of the vocal audio data;

the controller is specifically configured to: and combining the accompaniment audio data, the second accompaniment audio data, the target voice audio data and the singing voice audio data in the time period, and performing sound effect enhancement treatment to obtain the target audio data.

In some embodiments, the controller is configured to: acquiring a plurality of different delays and gains corresponding to the delays;

aiming at each time delay, carrying out time delay processing on the original singing voice audio data in each time period according to the time delay to obtain first time delay audio data;

performing gain processing on the delayed audio data according to the gain corresponding to the delay to obtain second delayed audio data;

and combining the plurality of second delayed audio data to obtain the first vocal accompaniment audio data.

In some embodiments, the controller is configured to: determining a sound zone to which the original vocal voice data belongs;

and performing tone rising processing or tone falling processing on the original vocal voice audio data according to the sound area to obtain first vocal accompaniment audio data.

In some embodiments, the controller is configured to: if the sound zone is a bass zone, tone reduction processing is carried out on the original vocal voice audio data to obtain first vocal accompaniment audio data;

if the sound zone is a high sound zone, performing tone-up processing on the original vocal voice audio data to obtain first vocal accompaniment audio data;

if the sound zone is a middle sound zone, performing tone rising processing and tone falling processing on the original vocal audio data to respectively obtain first vocal audio data and second vocal audio data;

and taking the first person sound audio data and the second person sound audio data as first vocal accompaniment audio data.

According to a second aspect of the present application, there is provided an audio processing method comprising:

acquiring song audio data, and performing voice separation on the song audio data to obtain original singing voice audio data and accompaniment audio data;

determining an original singing gain according to the energy of the original singing voice audio data in each time period and the energy of the singing voice audio data collected in the time period;

and combining the accompaniment audio data, the target voice audio data and the singing voice audio data in the time period, and performing sound effect enhancement processing to obtain and output the target audio data.

In some embodiments, the determining an original singing gain according to the energy of the original vocal audio data in each time period and the energy of the vocal audio data collected in the time period includes:

if the energy of the voice data of the singing person is smaller than a preset energy threshold value, setting the original singing gain as the preset gain threshold value;

In some embodiments, the method further comprises:

determining vocal accompaniment gain according to the energy of the voice data of the singing person collected in the time period; wherein, the vocal accompaniment gain and the energy of the vocal voice data collected in the time period are in positive correlation;

and combining the accompaniment audio data, the target voice audio data and the singing voice audio data in the time period, and performing sound effect enhancement treatment to obtain the target audio data, wherein the method specifically comprises the following steps:

and combining the accompaniment audio data, the second accompaniment audio data, the target voice audio data and the singing voice audio data in the time period, and performing sound effect enhancement treatment to obtain the target audio data.

In some embodiments, the generating first vocal accompaniment audio data from the original vocal audio data in each time period comprises:

acquiring a plurality of different delays and gains corresponding to the delays;

for each time delay, carrying out time delay processing on the original singing voice audio data in each time period according to the time delay to obtain first time delay audio data;

determining a sound zone to which the original vocal voice data belongs;

and performing tone rising processing or tone falling processing on the original vocal voice audio data according to the sound zone to obtain first vocal accompaniment audio data.

In some embodiments, the performing, according to the sound zone, an ascending tone process or a descending tone process on the original vocal sound audio data includes:

if the sound zone is a bass zone, tone reduction processing is carried out on the original vocal voice audio data to obtain first vocal accompaniment audio data;

if the sound zone is a middle sound zone, performing tone rising processing and tone falling processing on the original singing voice audio data to respectively obtain first voice audio data and second voice audio data;

According to a third aspect of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the audio processing method of the second aspect.

According to a fourth aspect of the present application, there is provided a computer program product which, when run on a computer, causes the computer to perform the audio processing method of the second aspect.

Compared with the related art, the technical scheme provided by some embodiments of the application has the following advantages:

aiming at the song audio data, the original singing voice audio data and the accompaniment audio data can be obtained through voice separation. Thus, for any song, even the song not contained in the K song APP can realize the K song through the method. And determining the original voice gain according to the energy of the voice data of the singing voice collected in real time and the energy of the voice data of the original singing voice, and performing gain processing on the voice data of the original singing voice according to the original voice gain to obtain the voice data of the target voice. Because the gain of singing originally confirms according to the energy of the vocal audio data of singing and the energy of the vocal audio data of singing originally, consequently, merge target vocal audio data to accompaniment audio data, that is, according to the singing condition of user, merge vocal audio data of singing originally to accompaniment audio data in, for example, can merge vocal audio data of singing originally into accompaniment audio data entirely, or, merge vocal audio data of singing originally to accompaniment audio data, thereby promote the accompaniment effect when the user sings, promote user experience.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate some embodiments of the present application or technical solutions in the related art, the drawings needed to be used in the description of the embodiments or the related art will be briefly described below, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control apparatus according to one or more embodiments of the present application;

fig. 2 is a block diagram of a hardware configuration of a display apparatus 200 according to one or more embodiments of the present application;

fig. 3 is a block diagram of a hardware configuration of the control apparatus 100 according to one or more embodiments of the present application;

fig. 4 is a schematic diagram of a software configuration in a display device 200 according to one or more embodiments of the present application;

FIG. 5 is a schematic illustration of an icon control interface display of an application in a display device 200 according to one or more embodiments of the present application;

FIG. 6A is a diagram illustrating a system architecture of an audio processing method according to some embodiments of the present application;

FIG. 6B is a schematic diagram of an audio processing method according to some embodiments of the present application;

FIG. 7 is a schematic illustration of sound separation;

FIG. 8 is a schematic diagram of an audio processing method according to some embodiments of the present application;

FIG. 9A is a schematic view of the distribution angles of standard sound boxes or home audio enclosures;

FIG. 9B is a schematic view of the angle of the television speaker;

FIG. 9C is a schematic diagram of changing the power distribution of the television speakers;

FIG. 10 is a schematic representation of the function f (x) according to some embodiments of the present application;

FIG. 11A is a diagram illustrating a system architecture of an audio processing method according to some embodiments of the present application;

FIG. 11B is a schematic diagram of an audio processing method according to some embodiments of the present application;

FIG. 12 is a schematic diagram of an audio processing method according to some embodiments of the present application;

FIG. 13A is a diagram illustrating a system architecture of an audio processing method according to some embodiments of the present application;

FIG. 13B is a schematic diagram of an audio processing method according to some embodiments of the present application;

FIG. 14 is a schematic diagram of a speaker distribution;

FIG. 15A is a diagram illustrating a system architecture of an audio processing method according to some embodiments of the present application;

FIG. 15B is a schematic diagram of an audio processing method according to some embodiments of the present application;

FIG. 16 is a schematic illustration of a time domain transformation of original vocal audio data according to some embodiments of the present application;

fig. 17 is a schematic diagram of frequency domain transformation of original vocal audio data according to some embodiments of the present application.

FIG. 18 is a flow chart of a method of audio processing in some embodiments of the present application;

FIG. 19 is a flow chart of a method of audio processing in some embodiments of the present application;

FIG. 20 is a flow chart of an audio processing method in some embodiments of the present application;

FIG. 21 is a flow chart of an audio processing method according to some embodiments of the present application.

Detailed Description

To make the objects, embodiments and advantages of the present application clearer, the following description of exemplary embodiments of the present application will clearly and completely describe the exemplary embodiments of the present application with reference to the accompanying drawings in the exemplary embodiments of the present application, and it is to be understood that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.

All other embodiments, which can be derived by a person skilled in the art from the exemplary embodiments described herein without inventive step, are intended to be within the scope of the claims appended hereto. In addition, while the disclosure herein has been presented in terms of one or more exemplary examples, it should be appreciated that aspects of the disclosure may be implemented solely as a complete embodiment. It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.

Fig. 1 is a schematic diagram of an operation scenario between a display device and a control device according to one or more embodiments of the present application, as shown in fig. 1, a user may operate the display device 200 through a mobile terminal 300 and the control device 100. The control apparatus 100 may be a remote controller, and the communication between the remote controller and the display device includes infrared protocol communication, bluetooth protocol communication, wireless or other wired method to control the display device 200. The user may input a user command through a key on a remote controller, voice input, control panel input, etc. to control the display apparatus 200. In some embodiments, mobile terminals, tablets, computers, laptops, and other smart devices may also be used to control the display device 200.

In some embodiments, the mobile terminal 300 may install a software application with the display device 200 to implement connection communication through a network communication protocol for the purpose of one-to-one control operation and data communication. The audio and video contents displayed on the mobile terminal 300 can also be transmitted to the display device 200, so that the display device 200 with the synchronous display function can also perform data communication with the server 400 through multiple communication modes. The display device 200 may be allowed to be communicatively connected through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks. The server 400 may provide various contents and interactions to the display apparatus 200. The display device 200 may be a liquid crystal display, an OLED display, a projection display device. The display apparatus 200 may additionally provide an intelligent network tv function that provides a computer support function in addition to the broadcast receiving tv function.

Fig. 2 exemplarily shows a block diagram of a configuration of the control apparatus 100 according to an exemplary embodiment. As shown in fig. 2, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface 140, a memory, and a power supply. The control apparatus 100 may receive an input operation instruction from a user and convert the operation instruction into an instruction recognizable and responsive by the display device 200, serving as an interaction intermediary between the user and the display device 200. The communication interface 130 is used for communicating with the outside, and includes at least one of a WIFI chip, a bluetooth module, NFC, or an alternative module. The user input/output interface 140 includes at least one of a microphone, a touch pad, a sensor, a key, or an alternative module.

Fig. 3 shows a hardware configuration block diagram of the display apparatus 200 according to an exemplary embodiment. The display apparatus 200 as shown in fig. 3 includes at least one of a tuner demodulator 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, an external memory, a power supply, and a user interface 280. The controller includes a central processor, a video processor, an audio processor, a graphic processor, a RAM, a ROM, and first to nth interfaces for input/output. The display 260 may be at least one of a liquid crystal display, an OLED display, a touch display, and a projection display, and may also be a projection device and a projection screen. The tuner demodulator 210 receives a broadcast television signal through a wired or wireless reception manner, and demodulates an audio/video signal, such as an EPG data signal, from a plurality of wireless or wired broadcast television signals. The detector 230 is used to collect signals of the external environment or interaction with the outside. The controller 250 and the tuner-demodulator 210 may be located in different separate devices, that is, the tuner-demodulator 210 may also be located in an external device of the main device where the controller 250 is located, such as an external set-top box.

In some embodiments, the controller 250 controls the operation of the display device and responds to user operations through various software control programs stored in an external memory. The controller 250 controls the overall operation of the display apparatus 200. A user may input a user command on a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI). Alternatively, the user may input the user command by inputting a specific sound or gesture, and the user input interface receives the user input command by recognizing the sound or gesture through the sensor.

In some embodiments, a "user interface" is a media interface for interaction and information exchange between an application or operating system and a user that enables conversion between an internal form of information and a form that is acceptable to the user. A commonly used presentation form of the User Interface is a Graphical User Interface (GUI), which refers to a User Interface related to computer operations and displayed in a graphical manner. It may be an interface element such as an icon, a window, a control, etc. displayed in the display screen of the electronic device, where the control may include at least one of an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget, etc. visual interface elements.

Fig. 4 is a schematic diagram of a software configuration in a display device 200 according to one or more embodiments of the present Application, and as shown in fig. 4, the system is divided into four layers, which are, from top to bottom, an Application (Applications) layer (referred to as an "Application layer"), an Application Framework (Application Framework) layer (referred to as a "Framework layer"), an Android runtime (Android runtime) and system library layer (referred to as a "system runtime library layer"), and a kernel layer. The inner core layer comprises at least one of the following drivers: audio drive, display driver, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (like fingerprint sensor, temperature sensor, pressure sensor etc.) and power drive etc..

Fig. 5 is a schematic diagram illustrating an icon control interface display of an application program in the display device 200 according to one or more embodiments of the present application, as shown in fig. 5, an application layer includes at least one application program that can display a corresponding icon control in a display, for example: the system comprises a live television application icon control, a video on demand application icon control, a media center application icon control, an application center icon control, a game application icon control and the like. The live television application program can provide live television through different signal sources. A video-on-demand application may provide video from different storage sources. Unlike live television applications, video on demand provides a video display from some storage source. The media center application program can provide various applications for playing multimedia contents. The application program center can provide and store various application programs.

The implementation of this application in android system is shown in fig. 6A, and android system mainly includes application layer, middleware and core layer, and the implementation logic can be in the middleware, and the middleware includes: the device comprises an audio decoder, a sound separation module, a gain control module, a sound effect enhancement module and an audio output interface. The audio decoder is used for performing audio decoding processing on a signal source input through a broadcast signal, a network, a USB, an HDMI, or the like to obtain audio data. The sound separation module is configured to perform sound separation on the decoded audio data, for example, a human voice audio and a background audio can be separated by a human voice separation method. The gain control module can acquire a sound control mode of a user for the display device, and respectively perform different gain processing on the human sound audio and the background audio so as to enhance the human sound audio or the background audio. The merging module is used for merging the gain-processed human voice audio and the background audio to obtain merged audio data, and the sound effect enhancing module is used for performing sound effect enhancing processing on the merged audio data to obtain target audio data. The audio output interface is used for outputting the target audio data.

It should be noted that the above implementation logic may be implemented in a core layer as well as in a middleware. Alternatively, it may be implemented in both the middleware and the core layer, for example, the audio decoder and the sound separation module may be implemented in the middleware and the modules following the sound separation module may be implemented in the core layer.

Fig. 6B is a schematic diagram of an audio processing method according to some embodiments of the present application, corresponding to fig. 6A. After the audio decoder decodes the acquired sound signal, first audio data may be obtained. The sound separation module can realize sound separation of the first audio data through a pre-trained neural network model by an AI (artificial intelligence) technology to obtain first target audio data and first background audio data. For example, the human voice, i.e., the first target audio data, may be separated through a human voice separation model, and the car voice, i.e., the first target audio data, may be separated through a pre-trained car voice separation model, i.e., the first background audio data, i.e., the audio data other than the first target audio data. The gain control module can obtain a first gain and a second gain according to the sound control mode, and the values of the first gain and the second gain are not equal. And performing gain processing on the first target audio data according to the first gain to obtain second target audio data, and performing gain processing on the first background audio data according to the second gain to obtain second background audio data. And merging the second target audio data and the second background audio data, and obtaining and outputting second audio data after sound effect enhancement processing. According to the method and the device, the first target audio data or the first background audio data are enhanced by carrying out non-equal proportion gain processing on the first target audio data and the first background audio data, so that the effect of sound effect enhancement can be improved.

The following first describes a display device according to some embodiments of the present application.

In some embodiments, the display device 200 may be a terminal device with a display function, such as a television, a smart phone, a computer, a learning machine, and the like. The display device 200 includes:

a controller 250 configured to: and carrying out sound separation on the acquired first audio data to obtain first target audio data and first background audio data.

The first audio data refers to audio data including at least two mixed sounds, for example, the first audio data may include human voice and background music, the human voice is separated through a human voice separation model trained in advance, and the sounds other than the human voice are the background sounds. At this time, the first target audio data is the voice, and the first background audio data is the background sound.

Referring to fig. 7, fig. 7 is a schematic diagram of sound separation. The sound in normal life and the sound in movie and television works are mixed by various sounds, for example, in fig. 7, the sound signal 1 is the sound of an instrument, and the sound signal 2 is the sound of singing of a person. The mixed sound signal is a sound signal which mixes the sound of a musical instrument and the sound of singing a person during recording and audio and video production. The traditional sound effect algorithm based on fixed logic operation cannot separate two sounds from a mixed sound signal, and can realize sound separation by means of AI technology to obtain an audio 1 similar to a musical instrument and an audio 2 similar to human voice.

Or the first audio data includes multiple mixed sounds such as human voice, car voice, gunshot voice, background music and the like, the human voice can be separated through a human voice separation model, the car voice can be separated through a pre-trained car voice separation model, and the gunshot voice can be separated through a pre-trained gunshot voice separation model. And taking other sounds except the separated human voice, automobile voice and gun sound in the first audio data as background sounds. At this time, the first target audio data may include a human voice, a car voice, and a gunshot voice, and the first background audio data is a background sound.

The user may select the voice control mode according to his own preference, and the first gain and the second gain may be determined according to the voice control mode. A controller 250 configured to: performing gain processing on the first target audio data according to the first gain to obtain second target audio data; and performing gain processing on the first background audio data according to the second gain to obtain second background audio data. That is, the first target audio data and the first background audio data are subjected to different-magnitude gain processing to enhance the first target audio data or the first background audio data. And then, merging the second target audio data and the second background audio data, and performing sound effect enhancement processing to obtain second audio data.

It can be understood that if the first gain and the second gain are both 0dB, the signal after combining the second target audio data and the second background audio data is highly similar to the signal before sound separation. And carrying out sound effect enhancement processing on the combined signal through a sound effect enhancement algorithm to obtain second audio data. The sound effect enhancement algorithm includes, but is not limited to, AGC (automatic gain control), DRC (Dynamic range compression), EQ (equalizer), virtual surround, and the like.

An audio output interface 270 configured to: and outputting the second audio data.

In some embodiments, the controller 250 is configured to: determining the type of a sound effect enhancement mode corresponding to the first audio data according to a sound control mode corresponding to the display equipment; the type of the sound effect enhancement mode refers to the type of sound that the user wants to enhance, and the first gain and the second gain corresponding to the type of the sound effect enhancement mode are determined according to the sound control mode corresponding to the display device. The sound effect enhancement modes are of different types, and the corresponding first gain and second gain are also different.

In some embodiments, according to the sound control mode, a type of an audio enhancement mode corresponding to the first audio data may be determined, where the type of the audio enhancement mode indicates a type of sound that a user wants to enhance, the type of the audio enhancement mode is different, and determination methods of the first gain and the second gain may also be different. Therefore, the first gain and the second gain corresponding to the type of the acoustics enhancing mode can be determined according to the sound control mode after the type of the acoustics enhancing mode. For example, the types of the sound-enhancement mode may include a sound enhancement mode indicating that the user wants to enhance the first target audio data and a background enhancement mode indicating that the user wants to enhance the first background audio data.

In some embodiments, the controller 250 is configured to: if the type of the sound effect enhancement mode corresponding to the first audio data is the sound enhancement mode, namely enhancing the first target audio data, the first gain is larger than the second gain. If the type of the sound effect enhancement mode corresponding to the first audio data is the background enhancement mode, the first background audio data is enhanced, and the first gain is smaller than the second gain.

Assuming that the first gain is G1 and the second gain is G2, if the user wants to enhance the first target audio data, the first target audio data may be enhanced without changing the first background audio data, i.e., G1 may be a value greater than 0dB, and G2 is equal to 0 dB. If the user wants to enhance the first background audio data, the first target audio data may be enhanced without changing the first background audio data, i.e., G1 is equal to 0dB and G2 is a value greater than 0 dB.

In some embodiments, to ensure that no positive gain occurs, resulting in a plosive on the audio signal, G1 and G2 may range from [ -911BB,0dB ]. And if the type of the sound effect enhancement mode corresponding to the first audio data is the sound enhancement mode, setting the first gain to be 0dB, and determining a second gain according to the sound control mode, wherein the second gain is less than 0 dB. In this way, the first target audio data is enhanced by attenuating the first background audio data without changing the first target audio data. And if the type of the sound effect enhancement mode corresponding to the first audio data is the background enhancement mode, determining a first gain according to the sound control mode, and setting a second gain to be 0dB, wherein the first gain is less than 0 dB. In this way, the first background audio data is enhanced by attenuating the first target audio data without changing the first background audio data.

In some embodiments, the display device corresponds to a plurality of preset sound definition control modes and/or a plurality of preset sound effect modes. The user can adjust the degree of the clearness of the voice according to own needs and hobbies, and selects a target sound definition control mode from a plurality of preset sound definition control modes, wherein each preset sound definition control mode has a corresponding numerical value. For example, the preset sound definition control modes are divided into a plurality of different grades, and each grade corresponds to a different numerical value. The user can also select a target sound effect mode from a plurality of preset sound effect modes (such as a standard mode, a music mode, a movie mode, etc.), wherein each preset sound effect mode has a corresponding numerical value.

The preset sound definition control mode represents the sound definition degree of the display device, and can comprise a plurality of different levels. If the preset sound definition control mode corresponds to a value M1, the user can adjust the definition of the sound through the menu, and to simplify the calculation, the menu adjustment value can be normalized to a value within [0,1], that is, M1 is a value greater than or equal to 0 and less than or equal to 1. Assuming that 0.5 represents a default value of the display device at the time of factory shipment, greater than 0.5 represents a higher degree of clarity of the sound, and less than 0.5 represents a lower degree of clarity of the sound.

The preset sound effect mode represents the sound effect mode of the display device, and can comprise standard sound effect, music sound effect, movie sound effect, news sound effect and the like. If the value corresponding to the preset sound effect mode is M2, M2 may also be a normalized value, assuming that the value of M2 in the standard mode is 0.5, the value of M2 in the music mode is 0.6, the value of M2 in the movie mode is 0.7, and the value of M2 in the news mode is 0.8.

The sound control mode corresponding to the display device comprises the following steps: a target sound definition control mode and/or a target sound effect mode; the target sound definition control mode is one of multiple preset sound definition control modes, and the target sound effect mode is one of multiple preset sound effect modes. A controller 250 configured to: and determining the type of the sound effect enhancement mode corresponding to the first audio data according to the first numerical value corresponding to the target sound definition control mode and/or the second numerical value corresponding to the target sound effect mode. That is, a value can be obtained according to the first value and/or the second value, and the type of the sound-effect enhancement mode can be determined according to the value. Further, a first gain and a second gain corresponding to the type of the emphasis mode are determined based on the first value and/or the second value.

In some embodiments, a third value may be derived from the first value and the second value, and the type of the prominence mode may be determined based on the third value. Assuming that the third value may be 1 in the normalized scenario, it indicates that the first target audio data and the first background audio data are not enhanced. When the third value is greater than 1, it indicates that the first target audio data is enhanced, and when the third value is less than 1, it indicates that the first background audio data is enhanced. In some embodiments, the third value T may be expressed as the following equation:

T＝(2×M1)×(2×M2) (1)

it is understood that the values of M1 and M2 are different in the standard mode, and the expression of the third value T may be different.

For example, when the user does not adjust the sound control mode of the display device, the first value corresponding to the target sound definition control mode is 0.5, the second value corresponding to the target sound effect mode is also 0.5, and at this time, T is equal to 1, and both the first gain G1 and the second gain G2 may be 0dB, that is, the first target audio data and the first background audio data are not subjected to gain processing.

If the user adjusts the sound control mode of the display device, it is assumed that the first value corresponding to the target sound definition control mode is 0.7 and the second value corresponding to the target sound effect mode is 0.8. When the value of T is greater than 1, the first target audio data is enhanced. As previously mentioned, G1 and G2 are both numbers not greater than 0dB, so G1 may be set to 0 and G2 to a value less than 0, and in some embodiments, G2 may be expressed as the following equation:

of course, the determination method of G2 is not limited to this, and for example, this formula (2) may be simply modified or the like.

On the contrary, if the user adjusts the sound control mode of the display device, the value of T is less than 1, which indicates that the first background audio data is enhanced. At this time, G2 may be set to 0, and G1 may be set to a value less than 0. In some embodiments, G1 may be expressed as the following formula:

of course, the determination method of G1 is not limited to this, and for example, the formula (3) may be simply modified or the like.

Referring to fig. 8, fig. 8 is a schematic diagram of an audio processing method according to some embodiments of the present application. In the stereo display device, after the audio decoder decodes, the audio data of the left and right channels are independently processed by human voice separation, gain processing and sound effect enhancement processing, and then sent to the corresponding loudspeakers.

Because most of the loudspeakers of the display device are positioned at the bottom of the display device and produce sound downwards, and because the distance between the two loudspeakers is short (generally about 0.3-0.8 m), the watching distance of people is generally about 2.0-2.5 m, and the angle is only 8-14 degrees. The orientation resolution limit of the person is about 5 deg., that is, the distance of the two speakers of the display device is relatively close to the orientation resolution limit of the person. When a common stereo sound source is created (a standard sound studio), the angle of the left and right sound channels is 60 degrees. Referring to fig. 9A, fig. 9A is a schematic diagram of a standard recording studio or a distribution angle of home audio speakers. It can be seen that the angle of the left and right channels is 60 °. When a sound source is created, the sound is not only in one sound channel, but both sound channels are simultaneously provided, so that when the creator wants to represent the sound on the left side, the sound on the left side is larger than that on the right side, and conversely, when the creator wants to represent the sound on the right side, the sound on the right side is larger than that on the left side.

However, such creation is made based on an angle of 60 °, and referring to fig. 9B, fig. 9B is a schematic diagram of an angle of a speaker of a television. At this angle, the virtual sound image of all the sound elements is reduced, unlike the author's intention of creating based on a 60 ° speaker. When the angles of the two loudspeakers are reduced to 8-14 degrees, if the left and right sound channels are matched according to the original matching, the sound images obtained by audiences are blurred, and the audiences have difficulty in hearing the azimuth sense of sound.

In order to improve the sense of orientation, the signal ratio of sound in the left and right speakers can be changed without changing the physical conditions of the speakers and the like. For example, the energy distribution relation of a certain sound in the film source in the left and right channels is 7:3, and the position sense of the sound field can be enhanced by changing the energy distribution relation to 8:2 or 9: 1. Referring to fig. 9C, fig. 9C is a schematic diagram of changing the power distribution relationship of the tv speakers. It can be seen that after changing the energy distribution relationship, the car is closer to the left speaker under the subjective perception of the viewer.

In general, the energy of the background music for atmosphere backing in the movie and television play is basically the same or the signal is the same in the left and right sound channels, but the typical sound for representing the sense of orientation is distributed to different sound channels for representing the sense of orientation, and the typical sound includes but is not limited to human sound, gunshot sound, car sound, airplane sound and the like. If the energy of the left and right channels is measured and calculated still according to the above method, then simply changing the energy ratio of the two channels will cause the center of the background music with the sound image centered to be changed, so this method is not preferable.

In some embodiments, the first audio data includes at least one third target audio data belonging to a preset sound type (e.g., a sound type representing a sense of direction), and the third target audio data includes, but is not limited to, a human voice, a gun sound, a car sound, an airplane sound, and the like.

To solve the above problem, the controller 250 is further configured to: at least one third target audio data and third background audio data are separated from the first audio data.

As described above, the first audio data refers to audio data including at least two kinds of mixed sounds, and a human voice, a gunshot sound, a car sound, and the like can be separated from the first audio data through trained and different neural network models, the third target audio data is one type of audio data, the first audio data may include one or more third target audio data, and audio data other than the third target audio data in the first audio data is third background audio data. For example, when the first audio data includes a human voice and a car voice, the first audio data includes two third target audio data, which are the human voice and the car voice, respectively, and the sounds other than the human voice and the car voice are the background sounds. The following process may be performed for each kind of third target audio data.

Since the third target audio data is used to express the sense of azimuth, the third target audio data includes audio data of at least two different channels (e.g., a first channel and a second channel). In some embodiments, the first channel and the second channel may be a left channel and a right channel, respectively. For example, the third target audio data includes two channels of audio data, i.e., a first channel initial target audio data and a second channel initial target audio data. The first channel initial target audio data and the second channel initial target audio data may be left channel audio data and right channel audio data, respectively. For another example, the first channel initial background audio data and the second channel initial background audio data may be left channel initial background audio data and right channel initial background audio data, respectively.

It is to be understood that the energies of the first channel initial target audio data and the second channel initial target audio data in the third target audio data are different, and therefore, a first energy value of the first channel initial target audio data and a second energy value of the second channel initial target audio data of a single third target audio data may be obtained, and a third gain corresponding to the first channel initial target audio data and a fourth gain corresponding to the second channel initial target audio data may be determined according to the first energy value and the second energy value.

Performing gain processing on the first channel initial target audio data according to a third gain to obtain first channel first gain audio data, namely the gain-processed first channel audio data; performing gain processing on the second channel initial target audio data according to the fourth gain to obtain second channel first gain audio data, namely gain-processed second channel audio data; wherein the third gain and the fourth gain are determined based on the first energy value and the second energy value. In this way, the third target audio data can be subjected to the gain processing on the first channel initial target audio data according to the third gain and the second channel initial target audio data according to the fourth gain, respectively, and the sense of direction of the third target audio data can be further improved. Meanwhile, the center of the third background audio data may not be changed.

For example, if the first energy value of the first channel initial target audio data is greater than the second energy value of the second channel initial target audio data, the third gain may be greater than the fourth gain, for example, the third gain may be set to a value greater than 0dB, and the fourth gain may be set to 0dB, that is, no gain processing is performed on the second channel initial target audio data. If the first energy value is equal to the second energy value, indicating that the two energies are equal, the third gain is equal to the fourth gain, or no processing may be performed. If the first energy value is smaller than the second energy value, the third gain may be smaller than the fourth gain, for example, the third gain is set to 0dB, that is, no gain processing is performed on the initial target audio data of the first channel, and the fourth gain is set to a value greater than 0 dB.

In some embodiments, to ensure that no positive gain occurs resulting in a break in the audio signal, the third gain may be set to 0dB if the first energy value is greater than the second energy value, and the fourth gain may be determined based on the first energy value and the second energy value, wherein the fourth gain is less than 0 dB. Performing gain processing on the initial target audio data of the first sound channel according to the third gain to obtain first gain audio data of the first sound channel; and performing gain processing on the second channel initial target audio data according to the fourth gain to obtain second channel first gain audio data.

If the first energy value is less than the second energy value, a third gain may be determined based on the first energy value and the second energy value, the third gain being less than 0dB, and the fourth gain being set to 0 dB. Performing gain processing on the initial target audio data of the first sound channel according to the third gain to obtain first gain audio data of the first sound channel; and performing gain processing on the second channel initial target audio data according to the fourth gain to obtain second channel first gain audio data.

Finally, combining the first channel initial background audio data of the first channel first gain audio data and the third background audio data, and performing sound effect enhancement processing to obtain first channel first enhanced audio data; and merging the second channel initial background audio data of the second channel first gain audio data and the third background audio data, and performing sound effect enhancement processing to obtain second channel first enhanced audio data.

The first energy value of the first channel initial target audio data and the second energy value of the second channel initial target audio data of the third target audio data are obtained, the energy magnitude relation between the first channel initial target audio data and the second channel initial target audio data can be analyzed, different gain processing is carried out on the first channel initial target audio data and the second channel initial target audio data according to the energy magnitude relation, and therefore the audio data of the channel with high energy is stronger, the direction sense of sound is better improved, and the effect of sound effect enhancement is improved.

It should be noted that, in the case that the third target audio data includes audio data of more channels, the processing procedure is similar to this, and is not described herein again.

The audio output interface 270 includes: a first output interface and a second output interface; the first output interface is configured to: outputting first channel first enhancement audio data; the second output interface is configured to: outputting the second channel first enhanced audio data.

In some embodiments, the third target audio data and the third background audio data may also be gain-processed in consideration of the sound control mode, the first energy value, and the second energy value at the same time. A controller 250 further configured to: and determining a fifth gain and a sixth gain corresponding to the single third target audio data according to the sound control mode corresponding to the display device, the first energy value and the second energy value. The fifth gain and the sixth gain are gains corresponding to the first channel initial target audio data and the second channel initial target audio data of the third target audio data, respectively. The fifth gain and the sixth gain may be different.

Determining a seventh gain according to a sound control mode corresponding to the display device; the seventh gain is used for performing gain processing on the first channel initial background audio data and the second channel initial background audio data, that is, performing the same gain processing on the first channel initial background audio data and the second channel initial background audio data.

And then, performing gain processing on the initial target audio data of the first channel according to a fifth gain to obtain second gain audio data of the first channel, namely the audio data of the first channel after the gain processing. Performing gain processing on the second channel initial target audio data according to the sixth gain to obtain second channel second gain audio data, namely the second channel audio data after the gain processing; and respectively performing gain processing on the first channel initial background audio data and the second channel initial background audio data according to the seventh gain to obtain first channel gain background audio data (namely the background audio data of the first channel after the gain processing) and second channel gain background audio data (namely the background audio data of the second channel after the gain processing).

It should be noted that the first channel second gain audio data and the first channel first gain audio data are both first channel audio data obtained by performing gain processing on the first channel initial target audio data, and the difference is that corresponding gain values are different during the gain processing. Similarly, the second channel second gain audio data and the second channel first gain audio data are both second channel audio data obtained by performing gain processing on the second channel initial target audio data, and the difference is that the corresponding gain values are different during the gain processing.

The audio output interface 270 includes: a first output interface and a second output interface; the first output interface is configured to: outputting the first channel second enhanced audio data; the second output interface is configured to: outputting second channel second enhanced audio data.

In some embodiments, the controller 250 is configured to: determining the type of a sound effect enhancement mode corresponding to the first audio data according to a sound control mode corresponding to the display equipment; and determining the energy magnitude relation of the left channel and the right channel according to the first energy value of the first channel initial target audio data and the second energy value of the second channel initial target audio data. Determining a fifth gain and a sixth gain corresponding to the type of the sound effect enhancement mode and the relationship between the left channel energy and the right channel energy according to the sound control mode, the first energy value and the second energy value corresponding to the display equipment; and determining a seventh gain corresponding to the type of the sound effect enhancement mode and the relation between the energy of the left channel and the energy of the right channel according to the sound control mode corresponding to the display equipment.

The types of the sound effect enhancement modes are different, and the gain processing modes for the third target audio data and the third background audio data are different. The left and right channel energy magnitude relations are different, and the gain processing modes of the first channel initial target audio data and the second channel initial target audio data are also different. The type of the sound-effect enhancement mode is used for determining whether to enhance the third target audio data or the third background audio data, and the left-right channel energy magnitude relation is used for determining whether to enhance the first channel initial target audio data or the second channel initial target audio data. Therefore, different types of the sound-effect enhancement modes and the relationship between the energy levels of the left channel and the right channel correspond to different fifth gain, sixth gain and seventh gain.

For example, if the type of the sound-effect enhancement mode is the sound enhancement mode, the fifth gain and the sixth gain are both larger than the seventh gain, and if the first energy is larger than the second energy, the fifth gain is larger than the sixth gain. The fifth gain may be equal to the sixth gain if the first energy is equal to the second energy. The fifth gain is less than the sixth gain if the first energy is less than the second energy.

If the type of the sound-effect enhancement mode is the background enhancement mode, both the fifth gain and the sixth gain are smaller than the seventh gain, and if the first energy is larger than the second energy, the fifth gain is larger than the sixth gain. The fifth gain may be equal to the sixth gain if the first energy is equal to the second energy. The fifth gain is less than the sixth gain if the first energy is less than the second energy.

In some embodiments, in the sound enhancement mode, the third value isT may be greater than 1, assuming the first energy value is P_LThe second energy value is P_RIf P is_LGreater than P_RAt this time, the fifth gain may be equal to 0dB, and the sixth gain and the seventh gain may be each less than 0 dB. For example, the fifth gain G_1LThe sixth gain may be expressed as the following equation at 0 dB:

the seventh gain may be expressed as the following equation:

if the third value T is greater than 1, P_LIs less than or equal to P_RAt this time, the sixth gain is equal to 0dB, and the fifth gain and the seventh gain are both smaller than 0 dB. For example, the fifth gain may be expressed as the following equation:

sixth gain G_1RThe seventh gain may be expressed as the following equation at 0 dB:

if the third value T is less than or equal to 1, P_LGreater than P_RAt this time, the fifth gain and the sixth gain are both smaller than 0, and the seventh gain is equal to 0 dB. For example, the fifth gain may be expressed as the following equation:

G_1L＝20×logT (8)

the sixth gain may be expressed as the following equation:

a seventh gain G₂＝0dB。

If the third value T is less than or equal to 1, P_LIs less than or equal to P_RAt this time, the fifth gain and the sixth gain are both less than 0dB, and the seventh gain is equal to 0 dB. For example, the fifth gain may be expressed as the following equation:

the sixth gain may be expressed as the following equation:

G_1R＝20×logT (11)

a seventh gain G₂＝0dB。

Where x is between (0.5,1), f (x) > x, where x is between (0,0.5), f (x) < x, and where x is equal to 0.5, f (x) ═ 0.5. Referring to fig. 10, fig. 10 is a schematic diagram of the function f (x) in some embodiments of the present application, and it can be seen that the trend of f (x) with x satisfies the above relationship. The trend of f (x) with x is not limited to this, and may be, for example, exponential, parabolic, or a combination of plural forms, as long as the above relationship is satisfied.

The manner of determining the fifth gain, the sixth gain, and the seventh gain is not limited to this, and for example, a simple modification of the above formula may be used. The fifth gain, the sixth gain, and the seventh gain may be equal to or greater than 0 dB.

The controller 250 is configured to: combining the first channel second gain audio data with the first channel gain background audio data, and performing sound effect enhancement processing to obtain and output first channel second enhancement audio data; and combining the second channel second gain audio data and the second channel gain background audio data, and performing sound effect enhancement processing to obtain and output second channel second enhanced audio data.

The method and the device can also determine the gain values corresponding to the first channel initial target audio data and the second channel initial target audio data respectively by simultaneously considering the energy size relationship between the control mode and the first channel initial target audio data and the second channel initial target audio data, thereby further improving the effect of sound effect enhancement.

As mentioned above, the voice separation algorithm usually uses an artificial intelligence technology, and after the voice is processed by the artificial intelligence technology and then processed by the sound effect enhancement, the time required for processing the voice may be longer, so that the time for outputting the voice at the speaker is later than the time of outputting the voice at the image, i.e. the problem of the asynchronization of the voice and the picture occurs. In order to solve the problem, the application also provides a solution.

The implementation of this scheme in the android system can be as shown in fig. 11A, where the android system mainly includes an application layer, a middleware, and a core layer, and the implementation logic can be in the middleware, and the middleware includes: the device comprises an audio decoder, a sound separation module, a sound effect enhancement module, a gain control module, a time delay module and an audio output interface. The audio decoder is used for performing audio decoding processing on a signal source input through a broadcast signal, a network, a USB, an HDMI, or the like to obtain audio data. The sound separation module is configured to perform sound separation on the decoded audio data, for example, a human voice audio can be separated by a human voice separation method. The sound effect enhancing module is used for performing sound effect enhancing processing on the decoded audio data, and the gain control module can acquire a sound control mode of a user for the display equipment and perform different gain processing on the separated audio and the audio subjected to sound effect enhancement respectively. Since the time duration consumed by sound separation and sound effect enhancement are usually different, the delay module may perform delay processing on the two audio data after gain processing. And the merging module is used for merging the two audios after the gain processing to obtain merged audio data. The audio output interface is used for outputting the merged audio data.

It should be noted that the above implementation logic may be implemented in a core layer as well as in a middleware. Alternatively, it may be implemented in both the middleware and the core layer, e.g., the audio decoder and the sound separation module may be implemented in the middleware and the remaining other modules may be implemented in the core layer.

Corresponding to fig. 11A, fig. 11B is a schematic diagram of an audio processing method according to some embodiments of the present application. After the audio decoder decodes the acquired sound signal, first audio data may be obtained. The sound separation module can realize sound separation of the first audio data through a pre-trained neural network model by an AI technology to obtain first target audio data. The first target audio data may be a human voice, a car voice, or the like. Meanwhile, second audio data can be obtained after the sound effect enhancement processing is carried out on the first audio data. The gain control module can obtain a first gain and a second gain according to the sound control mode, and the values of the first gain and the second gain are not equal. And performing gain processing on the first target audio data according to the first gain to obtain second target audio data, and performing gain processing on the second audio data according to the second gain to obtain third audio data. And determining to delay the second target audio data or delay the third target audio data according to the time length consumed by the sound separation module and the time length consumed by the sound effect enhancement module. Thereafter, the second target audio data and the third audio data are combined.

It can be seen that only one sound, i.e., the first target audio data, can be separated by sound separation without separating the background sound, thereby reducing the time consumed for sound separation. And the sound separation and the sound effect enhancement are processed in parallel instead of in series, so that the time consumed by the whole audio processing flow can be further shortened, and the effect of sound and picture synchronization is improved.

Based on this, some embodiments of the present application further provide a display apparatus 200 including:

the controller 250, may be further configured to: and respectively carrying out sound separation and effect enhancement processing on the acquired first audio data to obtain first target audio data and second audio data.

The first audio data refers to audio data containing at least two mixed sounds, and for example, the first audio data may include human voice, background music, and the like. The first target audio data generally refers to audio data that the user wants to enhance, and may be human voice or other sounds, for example, suitable for watching movies, videos, listening to music, and the like. The voice can be separated through the voice separation model trained in advance, and at the moment, the first target audio data is the voice. Or the first audio data includes various mixed sounds such as human voice, car voice, gunshot sound, background music and the like, the car voice can be separated through a car voice separation model which is trained in advance, and at this time, the first target audio data is the car voice. In the sound separation process, only one kind of sound (the first target audio data) may be separated. The time consumed for the separation process can be reduced compared to separating out a plurality of sounds.

The method and the device can also carry out sound effect enhancement processing on the first audio data, in order to reduce the total time of audio processing, the processing procedure of sound effect enhancement and the processing procedure of sound separation can be processed in parallel instead of in series, and the time consumed by the whole audio processing procedure can be further shortened, so that the effect of sound and picture synchronization is improved. The prominence enhancement algorithm includes, but is not limited to, automatic gain control, dynamic range planning, equalizer, virtual surround, etc.

Performing gain processing on the first target audio data according to the first gain to obtain second target audio data; and performing gain processing on the second audio data according to the second gain to obtain third audio data, wherein the first gain and the second gain are determined according to a sound control mode corresponding to the display equipment. And respectively carrying out gain processing on the first target audio data and the second audio data through different gains so as to improve the overall effect of sound effect enhancement.

In some embodiments, the display device corresponds to a plurality of preset sound definition control modes and/or a plurality of preset sound effect modes; each preset sound definition control mode has a corresponding numerical value, and each preset sound effect mode has a corresponding numerical value. The user can adjust the sound control mode of the display device according to the needs and preferences of the user. After the display device acquires the sound control mode set by the user, the sound control mode corresponding to the display device comprises the following steps: a target sound definition control mode and/or a target sound effect mode; the target sound definition control mode is one of multiple preset sound definition control modes, and the target sound effect mode is one of multiple preset sound effect modes. Therefore, the first gain and the second gain are determined according to the first value corresponding to the target sound definition control mode and/or the second value corresponding to the target sound effect mode, wherein the first gain can be larger than the second gain.

As previously mentioned, the first target audio data generally refers to audio data that the user wants to enhance. Therefore, in the case where the types of the sound-effect enhancement mode include the sound enhancement mode and the background enhancement mode, it can be applied to the scene of the sound enhancement mode. And under the condition of normalization, obtaining a third numerical value according to the first numerical value and the second numerical value, and enhancing the first target audio data when the third numerical value is more than 1. In some embodiments, the third value T may be expressed as: (2 × M1) × (2 × M2), it will be appreciated that the values of M1 and M2 in the standard mode are different, and the expression for the third value T may also be different.

In some embodiments, to ensure that no positive gain occurs resulting in a break in the audio signal, the first gain and the second gain may be equal to or less than 0 dB. For example, the first gain may be set to 0 dB; and determining a second gain according to a first value corresponding to the target sound definition control mode and/or a second value corresponding to the target sound effect mode, so that the second gain is smaller than 0 dB. It should be noted that the determination method of the first gain and the second gain may refer to the description in the foregoing embodiments, and is not described herein again.

Since the process of separating the sound and the process of enhancing the sound effect of the first audio data can be processed in parallel, and the time duration consumed by separating the sound and the time duration consumed by enhancing the sound effect of the first audio data are usually different, if the second target audio data and the third audio data are directly combined, the sound signals cannot be overlapped, and the problem of echo is caused.

In order to solve the problem, the second target audio data or the third audio data may be subjected to delay processing so as to synchronize the second target audio data and the third audio data; and merging the second target audio data and the third audio data to obtain fourth audio data. Therefore, the problems of echo and the like caused by the fact that the sound signals cannot be overlapped can be avoided.

An audio output interface 270 configured to: outputting the fourth audio data.

In some embodiments, the controller 250 is configured to: acquiring a first time length consumed during sound separation and a second time length consumed during sound effect enhancement processing; and carrying out time delay processing on the second target audio data or the third audio data according to the first time length and the second time length. That is, the time length consumed for the sound separation and the sound effect enhancement processing may be directly counted, and if the time length consumed for the sound separation is short, the delay processing may be performed on the second target audio data; if the time length consumed by the sound effect enhancement processing is short, the delay processing can be performed on the third audio data, and finally the second target audio data and the third audio data are synchronized.

When the operation units for sound separation and effect enhancement are dedicated or the system resources are sufficient, the first time length and the second time length can be calculated into one or more fixed values according to the measurement. However, in practical scenarios, the sound separation algorithm is not usually dedicated on the chip of the display device, but is used together with the AI algorithm of the image, such that the operation time of sound separation is often not a fixed value, but rather has a certain volatility, and the volatility is measured to be within ± 20ms through practical measures. For the system architecture shown in fig. 6A, the fluctuation may affect the voice frame synchronization, but usually the range that a human can tolerate the voice frame delay is ± 30 ms. Therefore, the fluctuation is acceptable. However, in the system architecture shown in fig. 11A, there is a processing manner in which the same sound is processed in two links and then combined. The same sound error of more than ± 5ms causes significant sound quality problems, and therefore, accurate alignment is required.

Since there is a case where the same sound is processed in both links in the system architecture shown in fig. 11A, there is a certain correlation between the first target audio data and the second audio data. In some embodiments, the controller 250 is configured to: determining a time difference between the first target audio data and the second audio data according to a correlation between the first target audio data and the second audio data; and performing time delay processing on the second target audio data or the third audio data according to the time difference.

In some cases, if the time duration consumed by the sound separation and the sound effect enhancement processing cannot be directly counted or the statistics are inaccurate, the correlation between the first target audio data and the second audio data may be analyzed. And determining the time difference between the first target audio data and the second audio data according to the correlation, and further performing delay processing.

In some embodiments, the correlation between the first target audio data and the second audio data may be compared by a time-domain window function. A controller 250 configured to: acquiring a first audio segment of first target audio data in a time period t, wherein the first audio segment can be an audio segment with any time length t; acquiring a second audio segment of second audio data in the time period t (namely, the time is the same as that of the first audio segment), a plurality of third audio segments before the second audio segment and a plurality of fourth audio segments after the second audio segment; and the time length corresponding to the third audio frequency band and the fourth audio frequency band is equal to the time length of the time period t.

Determining the correlation between the first audio segment and the second audio segment, the correlation between the first audio segment and the third audio segment and the correlation between the first audio segment and the fourth audio segment, and determining the audio segment with the highest correlation; the time difference between the audio piece with the highest correlation and the first audio piece is determined as the time difference between the first target audio data and the second audio data.

That is, a segment is cut from the first target audio data and recorded as w, meanwhile, a same window is used for cutting a plurality of segments of the second audio data in the same time period and recorded as w (x), and convolution values of all data in w and w (x) are calculated one by one to obtain correlation data of w and w (x). The time difference between w (x) and w, which is the highest in correlation, is determined as the time difference between the first target audio data and the second audio data.

Or, a section of the second audio data may be cut, and meanwhile, a plurality of sections of the first target audio data in the same time period are cut by using the same window, and correlation calculation is performed in the same manner as above to determine the time difference between the first target audio data and the second audio data.

It should be noted that the window width and the delay calculation accuracy have a large relationship, where the window width is t and the calculation accuracy is also t. However, the smaller t is, the larger the corresponding calculation amount is. In addition, if the calculation amount of the data within t is large by adopting point-by-point calculation, the calculation amount can be reduced by half by adopting a mode of alternate point calculation, and the corresponding precision can be selected according to the calculation capability of the processor.

In a common stereo television, the sounds of the left and right channels are separated independently, and by the method shown in the system architecture of fig. 8, the two separated audio data are combined after being subjected to gain processing by the first gain and the second gain, and are sent to corresponding speakers after being subjected to sound effect enhancement processing. Although the architecture is simple, the audio data of the left and right channels need to be operated by a sound separation algorithm, and the sound separation algorithm usually uses the same physical operation processor and is overlapped in time, so that the requirement on the AI processing capability of the chip is high. It can be seen that how to reduce the amount of sound separation used determines whether the present solution can be applied to more display devices.

Referring to fig. 12, fig. 12 is a schematic diagram of an audio processing method according to some embodiments of the present application. As shown in fig. 12, the left channel audio data and the right channel audio data output from the audio decoder are combined into one signal, subjected to sound separation, and subjected to gain processing on the separated first target audio data, in addition to being subjected to sound effect enhancement processing and gain processing, respectively. And then the sound signals of the two links are subjected to time delay processing, and the sound signals in the sound separation link are finally respectively superposed into a left sound channel and a right sound channel in the sound effect enhancement link. Therefore, the calculation amount of sound separation can be reduced by half, and the landing feasibility of the scheme is higher.

In some embodiments, the first audio data comprises first channel initial audio data and second channel initial audio data. That is, the first audio data may include two channels of audio data, for example, the first channel initial audio data and the second channel initial audio data may be left channel audio data and right channel audio data contained in the first audio data.

A controller 250 configured to: and respectively carrying out sound effect enhancement processing on the first channel initial audio data and the second channel initial audio data to obtain first channel sound effect enhanced audio data (namely the first channel audio data after sound effect enhancement) and second channel sound effect enhanced audio data (namely the second channel audio data after sound effect enhancement).

It should be noted that, in the sound separation process, the first audio data (i.e., the audio data obtained by combining the first channel initial audio data and the second channel initial audio data) may be directly subjected to sound separation to obtain the first target audio data, so that the amount of operation for sound separation is reduced by half.

The first target audio data can be subjected to gain processing according to the first gain to obtain second target audio data; and respectively carrying out gain processing on the first channel sound effect enhanced audio data and the second channel sound effect enhanced audio data according to the second gain to obtain first channel target audio data and second channel target audio data.

Carrying out time delay processing on the second target audio data or the first channel target audio data so as to synchronize the second target audio data with the first channel target audio data; and performing time delay processing on the second target audio data or the second channel target audio data to synchronize the second target audio data and the second channel target audio data.

Similarly, the duration of the sound separation and the duration of the sound enhancement processing are typically different, and therefore, the delay processing may be performed before the merging. In some embodiments of the present application, a first time duration consumed by sound separation, a second time duration consumed by performing sound effect enhancement processing on the first channel initial audio data, and a third time duration consumed by performing sound effect enhancement processing on the second channel initial audio data may also be counted. According to the first time length and the second time length, carrying out time delay processing on the second target audio data or the first channel target audio data; and carrying out time delay processing on the second target audio data or the second channel target audio data according to the first time length and the third time length.

Or, the correlation between the first target audio data and the first channel sound effect enhanced audio data may also be determined, and the second target audio data or the first channel target audio data may be subjected to delay processing according to the correlation; and determining the correlation between the first target audio data and the second channel sound effect enhanced audio data, and carrying out time delay processing on the second target audio data or the second channel target audio data according to the correlation.

It can be understood that the second time period consumed for performing the sound effect enhancement processing on the first channel initial audio data and the third time period consumed for performing the sound effect enhancement processing on the second channel initial audio data are generally equal to each other, or have a small difference, which can be ignored. Therefore, in order to reduce the amount of computation, only the time consumed by one of the sound effect enhancement processes may be counted. Alternatively, the correlation between the first target audio data and the first channel effect-enhanced audio data (second channel effect-enhanced audio data) may be determined.

Then, merging the second target audio data with the first channel target audio data and the second channel target audio data respectively to obtain first channel merged audio data and second channel merged audio data;

the audio output interface 270 includes: a first output interface and a second output interface; the first output interface is configured to: outputting the first channel-merged audio data; the second output interface is configured to: and outputting the second channel merged audio data.

As described above, the sound separation can be realized by the artificial intelligence technique, and in the case where the first audio data includes the first channel initial audio data and the second channel initial audio data, if the sound separation and the effect enhancement processing are respectively performed on both the first channel initial audio data and the second channel initial audio data, the sound separation consumes a large amount of computation, and thus, the requirement on the processing capability of the chip in the display device is high. In order to solve the problem, the first channel initial audio data and the second channel initial audio data may be merged, that is, the first audio data is directly subjected to sound separation, and the separated first target audio data is subjected to gain processing to obtain the second target audio data. And merging the second target audio data with the first channel target audio data and the second channel target audio data respectively. Therefore, the calculation amount of sound separation can be reduced by half, so that the scheme can be realized under the condition that the processing capacity of the chip is not very high, and the applicability of the scheme is improved.

With the improvement of the computing power of the chip AI, machine learning is widely applied to the fields of images and sounds, and even combination on a plurality of scenes appears. The application also provides a solution for improving the stereo effect of sound. The implementation in the android system can be as shown in fig. 13A, the android system mainly includes an application layer, a middleware, and a core layer, the implementation logic can be in the middleware, and the middleware can include: the voice-frequency voice frequency system comprises a voice frequency separation module, a voice frequency voice separating module, a voice frequency control module, a voice frequency, a voice frequency control module, a voice frequency, a voice frequency decoder, a voice frequency distribution module, a voice frequency distribution module, a voice frequency combination module, a voice frequency distribution module, a voice frequency distribution module, an image decoder, a voice frequency distribution module, a voice frequency combination module, a voice frequency distribution module, a voice frequency and a voice frequency combination module, a voice frequency enhancement module, a voice frequency distribution module, a voice frequency and. The audio decoder is used for performing audio decoding processing on a signal source input through a broadcast signal, a network, a USB, an HDMI, or the like to obtain audio data. The voice separation module is used for respectively carrying out voice separation on the decoded left channel audio data and the decoded right channel audio data to obtain left channel voice audio data, left channel background audio data, right channel voice audio data and right channel background audio data. The sound distribution module is used for carrying out lip motion detection on the image decoded and output by the image decoder so as to determine the weight of the human voice audio and the weight of the background audio output by each audio output interface. The merging module is used for merging the human voice audio and the background audio according to the weight of the human voice audio and the weight of the background audio to obtain merged audio data. And the sound effect enhancing module is used for carrying out sound effect enhancing processing on the combined audio data to obtain the audio data after sound effect enhancement. The audio output interface is used for outputting audio data after sound effect enhancement.

It should be noted that the above implementation logic may be implemented in a core layer as well as in a middleware. Alternatively, it can be implemented in the middleware and the core layer, for example, the audio decoder and the voice separation module can be implemented in the middleware and the other modules can be implemented in the core layer.

Fig. 13B is a schematic diagram of an audio processing method according to some embodiments of the present application, corresponding to fig. 13A. The audio decoder can decode and output left channel audio data and right channel audio data, and can respectively perform voice separation on the left channel audio data and the right channel audio data to obtain left channel voice audio data, left channel background audio data, right channel voice audio data and right channel background audio data. For example, the vocal separation of the left channel audio data and the vocal separation of the right channel audio data can be achieved through the pre-trained neural network model through the AI technique. And merging the left channel voice audio data and the right channel voice audio data to obtain the target voice audio data.

Meanwhile, the image decoder can decode to obtain images of the time of the left channel audio data and the right channel audio data, perform lip movement detection on the images, and determine the weight of the target human voice audio data at each audio output interface according to the lip movement detection result. And determining the weight of the audio output interface outputting the left channel background audio data and the right channel background audio data according to the coordinates of the audio output interface. And then, according to the weight of the target human voice audio data at each audio output interface, the audio output interface outputs the weight of the left channel background audio data and the right channel background audio data, and the human voice audio and the background audio are combined. And finally, performing sound effect enhancement processing on the combined audio and outputting the audio.

It can be seen that, for the stereo display device, after the voice separation is performed on the left channel audio data and the right channel audio data, the separated left channel voice audio data and right channel voice audio data can be merged. And then, according to the speaking position of the person in the image, the voice weight corresponding to each audio output interface is adjusted, namely the weight corresponding to the voice audio is output, and according to the position of the audio output interface, the weight of the background audio output by each audio output interface is adjusted, so that the stereoscopic impression of the sound is enhanced, and the watching experience of a user is improved.

In some embodiments of the present application, a display device 200 includes: a controller 250 and a plurality of audio output interfaces 270;

a controller 250 configured to: and respectively carrying out voice separation on the acquired first channel audio data and second channel audio data to obtain first channel first person voice audio data and first channel first background audio data, and second channel first person voice audio data and second channel first background audio data.

The first channel audio data and the second channel audio data are audio data of two different channels acquired at the same time, and the first channel audio data and the second channel audio data can enable sound to have a stereoscopic impression. For example, the first channel audio data and the second channel audio data may be left channel audio data and right channel audio data, respectively.

For the first channel audio data, the first channel first person audio data and the first channel first background audio data may be obtained through person sound separation (e.g., artificial intelligence technology). The first channel first person audio data refers to a person voice in the first channel audio data, and the number of the first channel first person audio data may be multiple, that is, the person voices of multiple persons may be extracted. The audio data excluding the first-channel first-person audio data is first-channel first background audio data. Similarly, the second channel audio data may be subjected to voice separation to obtain second channel first-person audio data and second channel first background audio data.

And merging the first-person audio data of the first sound channel and the first-person audio data of the second sound channel to obtain the target person sound audio data.

In some embodiments of the present application, for the separated first-channel first-person audio data and second-channel first-person audio data, the first-channel first-person audio data and the second-channel first-person audio data are not directly allocated to the first channel and the second channel to be merged with the background audio, but the first-channel first-person audio data and the second-channel first-person audio data are directly merged first to obtain the target person audio data. Furthermore, the output condition of the target human voice audio data in each audio output interface is distributed according to the position of the human speaking in the image.

It should be noted that, if the human voice audio of a plurality of people is included, for each person, the first channel first human voice audio data and the second channel first human voice audio data corresponding to the person are merged to obtain the target human voice audio data of the person. The distribution method of the target voice audio data of each person is similar, and the target voice audio data of one person is taken as an example for explanation.

A controller 250 configured to: and if the lip motion coordinate in the screen of the display equipment is detected, determining the human voice weight corresponding to the audio output interface according to the lip motion coordinate and the coordinate of the single audio output interface.

In the display device, in addition to the audio data decoded by the audio decoder, the image decoder may also decode corresponding image data. Under the condition of sound-picture synchronization, image data corresponding to audio can be acquired simultaneously. Here, image data at the time when the first-channel audio data and the second-channel audio data are present may be acquired.

In the case where the human voice audio is extracted by the human voice separation, the image data usually has a corresponding human image. Therefore, lip movement detection can be performed on the image data to obtain lip movement coordinates, that is, position coordinates of the lips of the person. For example, the presence or absence of lip information and the presence or absence of lip motion in the image data may be detected by artificial intelligence techniques. If there are lips that are moving, lip motion coordinates can be detected.

The lip movement coordinates indicate the location in the image where the character is speaking in the screen, while the coordinates of the plurality of audio output interfaces represent the location where the audio is output. It can be understood that, when the lip motion coordinate is closer to the audio output interface, the voice weight corresponding to the audio output interface is also larger. The larger the human voice weight is, the larger the energy of the human voice audio output by the audio output interface is.

In some embodiments, the controller 250 is configured to: for each audio output interface, determining a corresponding area of the audio output interface in the screen according to the coordinates of the audio output interface; if the lip motion coordinate is located in the area corresponding to the audio output interface, determining the human voice weight corresponding to the audio output interface as a first numerical value; and if the lip motion coordinate is positioned outside the area corresponding to the audio output interface, determining the human voice weight corresponding to the audio output interface as a second numerical value, wherein the second numerical value is smaller than the first numerical value.

In some embodiments of the present application, corresponding regions may be divided in the screen for each audio output interface in advance according to coordinates of each audio output interface. It can be understood that, when the lip motion coordinate is closer to the region corresponding to the audio output interface, the human voice weight corresponding to the audio output interface is also larger.

For example, the screen is divided into a left area and a right area, and the lower left and lower right of the screen each include a speaker. The lip motion coordinates may be the position coordinates (x, y) of the actual pixel points, and if the resolution of the line of the played video is L, the resolution of the column is C. Then, the lip motion coordinates can be normalized to the following formula:

x’＝x÷C，y’＝y÷L (12)

if x 'is less than 0.5, the lip motion coordinate is in the left area, and if x' is more than 0.5, the lip motion coordinate is in the right area.

If the lip movement coordinate is in the left area of the screen, the human voice weight corresponding to the speaker at the lower left of the screen and the human voice weight corresponding to the speaker at the lower right of the screen may be set to 1 and 0, respectively, that is, the target human voice audio data is output through the speaker at the lower left of the screen, and the target human voice audio data is not output through the speaker at the lower right of the screen. Alternatively, the human voice weight corresponding to the speaker at the lower left of the screen and the human voice weight corresponding to the speaker at the lower right of the screen may be set to 0.8 and 0.2, respectively, and may be determined specifically with reference to the specific position of the lip motion coordinate in the left region. The closer the lip movement coordinate is to the left side of the left area, the larger the difference between the human voice weight corresponding to the loudspeaker at the lower left of the screen and the human voice weight corresponding to the loudspeaker at the lower right of the screen is; the closer the lip motion coordinate is to the right of the left region, i.e., to the middle of the screen, the smaller the difference between the vocal weight corresponding to the speaker at the lower left of the screen and the vocal weight corresponding to the speaker at the lower right of the screen.

Referring to fig. 14, fig. 14 is a schematic diagram of a speaker distribution, and it can be seen that the display device includes four speakers, respectively at the lower left, lower right, upper left and upper right of the screen. The corresponding areas of the four speakers in the screen are shown in fig. 14, which are the lower left area, the lower right area, the upper left area and the upper right area of the screen. The lip moves the coordinate and is located upper left region, and the human sound weight that four speakers correspond in left below, right below, upper left and upper right can be respectively: 0. 0,1 and 0. Alternatively, the human voice weights corresponding to the four speakers at the lower left, lower right, upper left and upper right may be 0.2, 0, 0.8, 0, etc., so that the final effect is positioned at the upper left of the screen with subjective auditory sensation.

In some embodiments, the screen comprises: a middle region and a non-middle region. A controller 250 configured to: and if the lip motion coordinate is positioned in the non-middle area, determining the human voice weights respectively corresponding to the plurality of audio output interfaces according to the lip motion coordinate and the coordinates of the plurality of audio output interfaces. That is, the human voice weights corresponding to the plurality of audio output interfaces may be determined according to the above method.

And if the lip movement coordinate is positioned in the middle area, determining the human voice weights respectively corresponding to the plurality of audio output interfaces according to the coordinate of the plurality of audio output interfaces and the attribute information of the plurality of audio output interfaces, wherein the attribute information comprises the volume and/or the orientation. That is, when the lip movement coordinate is located in the middle area of the screen, the voice weight corresponding to each audio output interface can be flexibly configured according to the volume, orientation, position relationship and the like of the audio output interface, so that the final effect is preferably that the subjective listening sensation is located at the center of the screen.

For example, for the speakers shown in FIG. 14, the speakers below the screen are oriented downward and the speakers above the screen are oriented upward. On the basis of the orientation, the larger the volume of the speaker is, the smaller the vocal gain corresponding to the speaker is, and the smaller the volume of the speaker is, the larger the vocal gain corresponding to the speaker is. In this way, the subjective sense of hearing can be centered on the screen. Or, if the volume of the four speakers is the same, the human voice gains corresponding to the four speakers may be the same.

If the distribution of the plurality of speakers around the screen is not uniform, and the orientation of each speaker is not directly below or above, the human voice weight can be determined according to the position relationship, the orientation and the volume of the specific reference plurality of speakers, so that the subjective auditory sensation can be positioned in the middle of the screen. It is understood that the human voice weight corresponding to each speaker may include a variety of different situations.

A controller 250 configured to: and determining the first channel first background audio data and/or the second channel first background audio data corresponding to the audio output interface according to the coordinates of the audio output interface.

For the background audio data, since the background audio data is not related to human voice, it can be directly determined according to the coordinates of the audio output interface whether the audio output interface outputs the first background audio data of the first channel, the first background audio data of the second channel, the first background audio data of the first channel and the first background audio data of the second channel.

In some embodiments, the screen comprises: a left area and a right area, if the coordinates of the audio output interface correspond to the left area, determining that the audio output interface corresponds to the initial background audio data of the first sound channel; and if the coordinates of the audio output interface correspond to the right area, determining that the audio output interface corresponds to the second channel initial background audio data. If the lower left and lower right of the screen each include a speaker corresponding to the left and right regions, the speaker at the lower left of the screen may output the first channel of initial background audio data and the speaker at the lower right of the screen may output the second channel of initial background audio data.

In some embodiments, the screen comprises: a left region, a middle region, and a right region; a controller 250 configured to: if the coordinates of the audio output interface correspond to the left area, determining that the audio output interface corresponds to first background audio data of a first sound channel; if the coordinates of the audio output interface correspond to the right area, determining that the audio output interface corresponds to second channel first background audio data; and if the coordinates of the audio output interface correspond to the middle area, determining that the audio output interface corresponds to the first channel first background audio data and the second channel first background audio data.

For example, the lower left, middle and lower right of the screen each include a speaker corresponding to the left, middle and right regions, the speaker at the lower left of the screen may output the first channel first background audio data, the speaker at the lower middle of the screen may output the first channel first background audio data and the second channel first background audio data at the same time, and the speaker at the lower right of the screen may output the second channel first background audio data.

A controller 250 configured to: and combining the product of the target human voice audio data and the human voice weight corresponding to the audio output interface, and the first background audio data of the first sound channel and/or the first background audio data of the second sound channel corresponding to the audio output interface, and performing sound effect enhancement processing to obtain the audio data corresponding to the audio output interface.

After the human voice audio (i.e., the product of the target human voice audio data and the human voice weight corresponding to the audio output interface) and the background audio (i.e., the first background audio data of the first channel and/or the first background audio data of the second channel) corresponding to each audio output interface are determined, the human voice audio and the background audio can be merged and subjected to sound effect enhancement processing to obtain the audio data corresponding to the audio output interface.

A single audio output interface 270 configured to: and outputting the audio data corresponding to the audio output interface.

In some embodiments, after performing the human voice separation on the left channel audio data and the right channel audio data, different gain processing may be performed on the human voice audio and the background audio to enhance the human voice audio or the background audio.

The controller 250 is further configured to: respectively performing gain processing on first channel first person audio data and second channel first person audio data according to the first gain to obtain first channel second person audio data and second channel second person audio data; respectively carrying out gain processing on the first channel first background audio data and the second channel first background audio data according to a second gain to obtain first channel second background audio data and second channel second background audio data; wherein the first gain and the second gain are determined according to a sound control mode corresponding to the display device.

It should be noted that the first channel first person audio data and the second channel first person audio data both belong to a person audio and may correspond to the same first gain, and the first channel first background audio data and the second channel first background audio data both belong to a background audio and may correspond to the same second gain.

In some embodiments, the display device corresponds to a plurality of preset sound definition control modes and/or a plurality of preset sound effect modes; each preset sound definition control mode has a corresponding numerical value, and each preset sound effect mode has a corresponding numerical value; the sound control mode includes: a target sound definition control mode and/or a target sound effect mode; the target sound definition control mode is one of a plurality of preset sound definition control modes, and the target sound effect mode is one of a plurality of preset sound effect modes; a controller 250 configured to: and determining a first gain and a second gain according to a first numerical value corresponding to the target sound definition control mode and/or a second numerical value corresponding to the target sound effect mode.

It can be seen that the user can control the sound control mode of the display device according to his/her own preference, and further, the controller 250 can determine how to gain-process the first channel first-person sound audio data and the second channel first-person sound audio data and how to gain-process the first channel first-background audio data and the second channel first-background audio data according to the sound control mode.

It should be noted that, the determination method of the first gain and the second gain is the same as the determination method of the first gain and the second gain in the foregoing embodiment, and specific reference may be made to the description in the foregoing embodiment, which is not described herein again.

A controller 250 configured to: merging the first channel second voice audio data and the second channel second voice audio data to obtain target voice audio data; determining the first channel second background audio data and/or the second channel second background audio data corresponding to the audio output interface according to the coordinates of the audio output interface aiming at each audio output interface; and combining the product of the target human voice audio data and the human voice weight corresponding to the audio output interface, and the first channel second background audio data and/or the second channel second background audio data corresponding to the audio output interface, and performing sound effect enhancement processing to obtain the audio data corresponding to the audio output interface.

In some embodiments, no person is included in the image data, or even if a person is included in the image data, the lips of the person are not displayed, such as displaying only the side face of the person, the back of the person, and so on. Alternatively, even if the lips of the person are displayed, the lips of the person are not moving, and the lip motion coordinates cannot be detected. A controller 250 further configured to: if the lip movement coordinate is not detected, for each audio output interface, the human sound weights respectively corresponding to the audio output interfaces can be directly determined according to the ratio of the energy of the first-person audio data of the first channel to the energy of the first-person audio data of the second channel and the coordinate of the audio output interface.

For example, if the lower left and lower right of the screen each include a speaker and the ratio of the energy of the left channel vocal audio data to the energy of the right channel vocal audio data is greater than 1, the vocal weight corresponding to the speaker located at the lower left of the screen may be greater than the vocal weight corresponding to the speaker located at the lower right of the screen. If the ratio of the energy of the left channel vocal audio data to the energy of the right channel vocal audio data is 0.6:0.4, the vocal weight corresponding to the speaker at the lower left of the screen may be 0.6, and the vocal weight corresponding to the speaker at the lower right of the screen may be 0.4. Alternatively, in order to enhance the sense of orientation of the sound, the human voice weight for the speaker at the lower left of the screen may be 0.7, and the human voice weight for the speaker at the lower right of the screen may be 0.3.

Currently, the karaoke function of a television is usually completed in a singing APP. Singing APP has rich functions and better user experience, but the media resources of the singing APP are limited. For example, an original singer a of a song is a male singer, and a flipper singer B is a female singer. When a female user C wants to sing the song, only the accompaniment video of the original singer a may be recorded in the singing APP, but no accompaniment video of the singer B exists, so that a suitable accompaniment cannot be found. Or, the two channels are subtracted to eliminate the human voice in the stereo song. However, the method sometimes loses bass in the song, and the obtained accompaniment sound is weak, has no singing accompaniment feeling, and is poor in user experience.

Therefore, some embodiments of the present application further provide a technical solution, that is, the voice in the playing song is removed through a voice separation technology, so that the user can find the favorite song without relying on the singing APP, for example, playing the familiar song through an online music player, or playing the audio/video content purchased for the user through a television. Then, the function of eliminating the voice is turned on, the original singing voice in the audio can be eliminated, and the singing without the limitation of media resources is further realized. Meanwhile, the original vocal can be completely or partially added to the accompaniment according to the energy of the vocal singing collected by the microphone, and the singing experience is prevented from being influenced due to the fact that the singing level of a singer is not high.

The implementation of this technical scheme in the android system can be as shown in fig. 15A, where the android system mainly includes an application layer, a middleware, and a core layer, and the implementation logic can be in the middleware, and the middleware includes: the voice-frequency voice playing system comprises an audio decoder, a voice separating module, an audio input interface, an original singing volume control module, a merging module, a sound effect enhancing module, a gain control module, a time delay module and an audio output interface. The audio decoder is used for performing audio decoding processing on a signal source input through a broadcast signal, a network, a USB, an HDMI, or the like to obtain audio data. The voice separation module is used for carrying out voice separation on the decoded audio data and separating the original voice audio and the accompaniment audio. The audio input interface is used for receiving singing audio input by a user, and the original singing volume control module determines the size of the original singing audio combined with the accompaniment audio, namely the target vocal audio, according to the singing audio and the separated original vocal audio. The merging module is used for merging the accompaniment audio, the singing audio and the target voice audio to obtain merged audio data. The sound effect enhancing module is used for carrying out sound effect enhancing processing on the combined audio data, and the audio output interface is used for outputting the audio data after the sound effect enhancing processing.

Fig. 15B is a schematic diagram of an audio processing method according to some embodiments of the present application, corresponding to fig. 15A. After the audio decoder decodes the audio data of the song, original singing voice audio data and accompaniment audio data are obtained through voice separation. Meanwhile, the microphone can collect voice and audio data of the singing person input by a user, and the target voice and audio data can be determined according to the original voice and audio data of the singing person, namely the target voice and audio data are combined to the size of the original voice and audio data in the accompaniment audio data. And merging the voice data of the singing person, the voice data of the target person and the voice data of the accompaniment, and outputting after sound effect enhancement processing.

Some embodiments of the present application also provide a display device 200, including:

a controller 250 configured to: and acquiring song audio data, and carrying out voice separation on the song audio data to obtain original singing voice audio data and accompaniment audio data.

The song audio data may be any song, including songs included in the singing APP and songs not included in the singing APP. By separating the voice of the song audio data, for example, the original vocal audio data and the accompaniment audio data can be separated by artificial intelligence technology. It can be seen that for any song, the corresponding accompaniment audio data can be separated.

A controller 250 further configured to: determining an original singing gain according to the energy of the original singing voice audio data in each time period and the energy of the singing voice audio data collected in the time period; and performing gain processing on the original vocal audio data in the time period according to the original vocal gain to obtain the target vocal audio data.

In the process of singing, a user can sing a song through an audio input interface (such as a microphone), at the moment, voice data of a singing person can be collected, and the problems of running tone, insufficient tone accuracy and the like can exist when the user sings the song. In addition, the voice separation is operated in real time by a main chip of the display device, and the problems that the voice separation is not clean or individual noise is introduced when the voice separation is carried out exist. In order to solve the problem, when the user does not sing or sing the song, the original singing vocal audio separated from the vocal is wholly or partially combined into the accompaniment to set off the atmosphere of the song singing site, and when the user is detected to sing, the original singing vocal audio can be reduced or muted through the volume control of the original singing vocal audio to play the voice of the user singing as the main part.

Since each song corresponds to a longer time period, the audio data may be processed at a preset time period when being processed. That is, the audio data of the respective time periods are sequentially processed in time order. Wherein the time period may be 0.8 seconds, 1 second, etc.

Aiming at each time period, the original singing gain can be obtained according to the energy of the original singing voice audio data and the energy of the singing voice audio data, gain processing is carried out on the original singing voice audio data through the original singing gain, and target voice audio data, namely the audio data merged into the accompaniment audio data, is obtained.

In some embodiments, the original singing gain is less than or equal to a preset gain threshold. For example, the preset gain threshold may be 0.1dB, 0dB, -0.1dB, etc. And under the condition that the preset gain threshold value is equal to 0dB, the original singing gain is less than or equal to 0 dB. Under the condition that the original singing gain is equal to 0dB, the original singing voice audio data are all merged into the accompaniment audio data; and under the condition that the original singing gain is less than 0dB, the voice data part of the original singing person is merged into the accompaniment audio data. Under the condition that the preset gain threshold is smaller than 0dB, the original singing gain is also smaller than 0dB, and the original singing voice audio data are merged into the accompaniment audio data. Under the condition that the preset gain threshold value is larger than 0dB, the voice data of the original singing person can be merged into the audio data of the accompaniment after enhancement processing.

In some embodiments, the controller 250 is configured to: if the energy of the vocal voice audio data is smaller than the preset energy threshold, the preset energy threshold is a smaller energy value, and at this time, it can be considered that the user does not sing, the original singing gain can be set to the preset gain threshold, for example, the original singing gain is set to 0dB, that is, the original vocal voice audio data is directly used as the target vocal audio data. If the energy of the voice audio data of the singing person is larger than or equal to the preset energy threshold value, the user can be considered to start singing at the moment, and the original singing gain is determined according to the energy ratio between the energy of the voice audio data of the singing person and the energy of the voice audio data of the original singing person, so that the original singing gain is smaller than the preset gain threshold value, namely the original singing person voice audio data can be used as the target voice audio data after the energy of the voice audio data of the original singing person is reduced.

In some embodiments, in order to ensure that the sound incorporated into the accompanying audio data is relatively stable, rather than varying with the volume of the vocal audio data, a corresponding relationship between the energy ratio between the energy of the vocal audio data and the energy of the original vocal audio data and the original vocal gain may be pre-established, for example, when the energy ratio is within a certain energy ratio range, the original vocal gain may correspond to the same value. For example, if the energy ratio is less than or equal to 0.25, which means that the energy of the vocal voice audio data is small, w is 0dB, the original vocal voice audio data may be all merged into the accompaniment audio data; if the energy ratio is more than 0.25 and less than 0.75, the singing voice audio data has moderate energy, and w is-6 dB, the original singing voice audio data part can be merged into the accompanying audio data; if the energy ratio is more than or equal to 0.75, the voice data of the singing voice has larger energy, the voice data of the original singing voice can be completely closed, and only the voice data of the singing voice is played.

A controller 250 configured to: and merging the accompaniment audio data, the target voice audio data and the singing voice audio data in the time period and performing sound effect enhancement processing to obtain the target audio data. On the basis of combining the accompaniment audio data and the singing voice audio data, the target voice audio data are further combined. The target voice audio data refers to the whole of the original voice audio data or the part of the original voice audio data, so that the finally output target audio data is richer and has better effect.

An audio output interface 270 configured to: the target audio data is output.

In some embodiments of the application, for any song, the accompaniment audio data can be obtained through voice separation, so that the user can not be limited by media resources when singing the song. And, whether to add the original vocal audio data to the accompaniment audio data or partially add the original vocal audio data can be determined according to the singing level of the user, so that the singing experience of the user is improved.

In some embodiments, the controller 250 is further configured to: obtaining an original singing gain corresponding to a previous time period, if the original singing gain corresponding to the current time period is the same as the original singing gain corresponding to the previous time period, indicating that an energy ratio between energy of vocal audio data corresponding to the previous time period and energy of the original vocal audio data is smaller than an energy ratio difference corresponding to the current time period, for example, the energy ratio is located in the same energy ratio range, indicating that a user sings a song stably, the user is familiar with the singing song, and the time period can be prolonged to reduce the processing frequency of the process until the prolonged time period is smaller than a first time threshold (for example, 2 seconds and the like). That is, the processing frequency of the above-described process is reduced, instead of frequently incorporating target vocal audio data obtained based on the original vocal audio data into the accompaniment audio data at the interval of singing. Certainly, the time period cannot be prolonged infinitely, so that the influence of the overlong time period on the final singing effect is avoided.

If the original singing gain corresponding to the current time period is different from the original singing gain corresponding to the previous time period, the volume change is shown when the user sings the song, the user does not beat with the original singing, the situations that the user can not sing, can sing inaccurately and the like can occur, at the moment, the time period is shortened, namely, the target audio data is quickly called out, and the target audio data is combined into the accompaniment audio data until the shortened time period is greater than a second time threshold (for example, 0.25 second and the like), wherein the first time threshold is greater than the second time threshold.

Compared with the method for eliminating the original singing voice audio data by simply subtracting the left and right channel audio data, the audio processing process can improve the effect of accompaniment in singing. However, in professional singing APPs, there are many professional accompaniment libraries in addition to libraries in which left and right channel audio data are subtracted. The accompaniment music library is not obtained by eliminating the original vocal audio data by subtracting the left and right vocal audio data, but records the accompaniment audio data in a single track when recording music. For many songs there are some vocals of a professional vocal accompaniment in addition to the accompaniment. In some embodiments of the present application, all voices can be identified and eliminated, and although the effect of a single musical accompaniment track can be approximated, the harmony of the vocal accompaniment person is eliminated, so that the remaining accompaniment lacks the sense of ambience. In addition, the voice separation is to strip out signals belonging to the characteristics of the voice in the original audio signal, however, the voice and the sound of the musical instrument overlap in the frequency domain, and when the voice is separated, the sound of the musical instrument overlapping the voice is also stripped out together.

In order to solve the problem, the separated original vocal voice audio data can be transformed to obtain vocal accompaniment audio data, and the vocal accompaniment audio data is combined into the accompaniment in a certain proportion to make up the problem of the hollow sense of the accompaniment. The ratio is associated with the energy of the vocal audio data, and specifically, becomes larger when the energy of the vocal audio data becomes larger, and becomes smaller when the vocal sound becomes smaller.

In some embodiments, to avoid the problem of eliminating vocals of the professional vocal accompaniment while the voices are separated, the controller 250 is further configured to: and generating first vocal accompaniment audio data according to the original vocal voice audio data in each time period.

As described above, if the energy of the vocal audio data is less than the preset energy threshold, it indicates that the user has not singing, or the singing sound is very small, and the original vocal audio data may be completely merged into the accompaniment audio data. At this time, the first vocal accompaniment audio data may not be generated. Therefore, in some embodiments, when the energy of the vocal audio data is greater than or equal to the preset energy threshold, the first vocal accompaniment audio data is generated according to the original vocal audio data in each time period.

In some embodiments, the original vocal audio data may be time domain transformed to generate the first vocal accompaniment audio data. A controller 250 configured to: acquiring a plurality of different delays and gains corresponding to the delays; for each time delay, carrying out time delay processing on the original vocal voice audio data in each time period according to the time delay to obtain first time delay audio data; performing gain processing on the delayed audio data according to the gain corresponding to the delay to obtain second delayed audio data; and combining the plurality of second delayed audio data to obtain the first vocal accompaniment audio data.

Referring to fig. 16, fig. 16 is a schematic diagram of time domain transformation of original vocal audio data according to some embodiments of the present application.

A plurality of different delays and gains corresponding to each delay are obtained. The plurality of different delays and the gain corresponding to each delay may be preset. The plurality of different delays may be equally spaced, and the gain decreases as the delay increases, and thus the gain corresponding to the plurality of different delays decreases. For example, T1 is 10ms, T2 is 20ms, T3 is 30ms … …, gain 1 is 0dB, gain 2 is-6 dB, gain 3 is-10 dB … …

For each delay, the original vocal voice audio data in each time period can be subjected to delay processing according to the delay, so that first delay audio data is obtained. And performing gain processing on the delayed audio data according to the gain corresponding to the delay to obtain second delayed audio data. For example, for T1, the original vocal audio data may be subjected to delay processing according to 10ms to obtain first delayed audio data, and the first delayed audio data may be subjected to gain processing according to 0dB to obtain second delayed audio data. The corresponding second delayed audio data can be obtained by processing T2 and T3 … … in the same manner.

And then combining the plurality of second delayed audio data to obtain the first vocal accompaniment audio data.

Thus, after different delays, the different gains are added together to form a reverberation effect similar to that in a room or a stadium. Namely, the original singing voice sounds like a sense that a plurality of people sing a song together, so that the original singing voice becomes music with chorus sense.

In some embodiments, the original vocal audio data may be further subjected to frequency domain transformation to generate the first vocal accompaniment audio data. A controller 250 configured to: determining a sound zone to which original vocal voice data belongs; and performing tone rising processing or tone falling processing on the original vocal voice data according to the sound zone to obtain first vocal accompaniment voice data. In this way, vocal accompaniment can be formed without the vocal accompaniment and the original vocal. For example, for professional performances, there are professional vocal accompaniment teams, and the vocal accompaniment teams are not in the same vocal part as the original vocal accompaniment, such as 3 degrees higher or 3 degrees lower than the original vocal accompaniment.

Referring to fig. 17, fig. 17 is a schematic diagram of frequency domain transformation of original vocal audio data according to some embodiments of the present application. Through fundamental frequency analysis, the vocal range to which the original vocal voice data belongs can be determined. The fundamental frequency analysis is to perform FFT (fast fourier transform) on the human voice to find the first peak, and the frequency of the peak is the fundamental frequency. The tone of the singer is known from the fundamental frequency, e.g. the frequency of the central C, i.e. "do", is 261.6 Hz. According to the calculated tone of the current sound, the frequency corresponding to the rising tone several degrees or the falling tone several degrees can be calculated.

It should be noted that, the pitch up or pitch down of different sound zones have a certain difference, which can be distinguished from the operation. For example, the principle of 3 degree up or 3 degree down algorithm may be described in detail with respect to a piano keyboard. If the current original human voice audio data belongs to the sound zone of the middle pitch C, namely C4, 3-degree rising white keyboard E4, the middle pitch is 4 semitones in total, namely the current voice is modified and raised in tone

Multiple times. If the tone of the current original vocal audio data is B3, the 3 degree rise is D4, and 3 semitones are added in total, i.e. the frequency rise

Multiple times.

In some embodiments of the present application, tone up processing or tone down processing may be performed on the original vocal audio data according to the singing habit of a general singer. Specifically, for a nonprofessional singer, there are usually problems that the bass is not low enough and the treble is not high enough. Thus, in some embodiments, to address the problem of a non-professional singer not having low enough bass and high enough treble when singing. A controller 250 configured to: if the sound zone is a low sound zone, tone reduction processing is carried out on the original vocal voice data to obtain first vocal accompaniment voice data; if the sound zone is a high sound zone, tone-rising processing is carried out on the original vocal audio data to obtain first vocal accompaniment audio data; if the sound zone is a middle sound zone, performing tone rising processing and tone falling processing on the original vocal audio data to respectively obtain first vocal audio data and second vocal audio data; and taking the first person sound audio data and the second person sound audio data as first vocal accompaniment audio data.

Specifically, when the original vocal audio data is lower than a certain low pitch, the pitch-down operation is started, and when the original vocal audio data is higher than a certain high pitch, the pitch-up operation is started. For example, the pitch-up operation is started when the value is higher than C5, that is, the gain of the pitch-down operation is controlled to be minimum, that is, to be silent, and the gain of the pitch-up operation is controlled to be 0dB, that is, the first vocal accompaniment audio number generated contains the audio data after the pitch-up operation. Conversely, when the value is lower than C4, the pitch reduction operation is started, the gain of the pitch reduction operation is controlled to be 0dB, and the gain of the pitch increase operation is controlled to be minimum, i.e., mute, i.e., the first vocal accompaniment audio number generated contains the audio data after the pitch reduction operation. When the sound is between C4 and C5, the gains of the pitch-up operation and the pitch-down operation can be both-6 dB, that is, the generated first vocal accompaniment audio data simultaneously comprises the audio data after the pitch operation and the audio data after the pitch-down operation.

It should be noted that, if the first accompaniment audio data is merged into the accompaniment audio data according to the energy of the original vocal audio data, the original accompaniment music style and tone may be affected. The purpose of vocal accompaniment is to enrich and beautify vocal sounds when they are present. Therefore, the energy of the vocal accompaniment audio data finally incorporated into the accompaniment audio data may be smaller than the energy of the vocal accompaniment audio data. E.g., 12dB less than singing human voice audio data, etc.

Thus, after generating the first vocal accompaniment audio, the controller 250 is configured to: determining vocal accompaniment gain according to the energy of the voice data of the singing person collected in the time period; wherein, the vocal accompaniment gain is positively correlated with the energy of the vocal voice data collected in the time period; gain processing is carried out on the first vocal accompaniment audio data through vocal accompaniment gain to obtain second vocal accompaniment audio data; wherein the energy of the second vocal accompaniment audio data is less than the energy of the vocal audio data.

It can be understood that the larger the energy of the vocal accompaniment audio data is, the larger the energy of the vocal accompaniment audio data finally merged into the accompaniment audio data may be, and therefore, the vocal accompaniment gain is positively correlated with the energy of the vocal accompaniment audio data collected in the time period. Assuming that the energy of the vocal voice data is E, the vocal accompaniment gain m can be calculated according to the following formula: and m is E-12. Thus, the energy of the second vocal accompaniment audio data obtained by the vocal accompaniment gain is less than that of the vocal audio data. Of course, the method for calculating vocal accompaniment gains is not limited thereto, and the vocal accompaniment gains may be calculated by simply modifying the above formula.

A controller 250 configured to: and merging the accompaniment audio data, the second accompaniment audio data, the target voice audio data and the singing voice audio data in the time period and performing sound effect enhancement processing to obtain the target audio data.

Like this, on the basis of accompaniment audio data, vocal performance audio data and target vocal audio data, further add second vocal accompaniment audio data, can avoid in the vocal separation process, also peel off the problem that leads to the accompaniment effect poor with the vocal accompaniment audio data in the song to can improve the whole effect of accompaniment, finally promote user's singing and experience.

Corresponding to the display device embodiment, the application also provides an audio processing method. It is understood that the steps involved in fig. 18 to 21 may include more steps or fewer steps in actual implementation, and the order between the steps may be different, so as to implement the audio processing method provided in the embodiment of the present invention.

Referring to fig. 18, fig. 18 is a flowchart of an audio processing method according to some embodiments of the present application, which may include the following steps:

step S1810, performing sound separation on the acquired first audio data to obtain first target audio data and first background audio data.

Step S1820, perform gain processing on the first target audio data according to the first gain to obtain second target audio data, and perform gain processing on the first background audio data according to the second gain to obtain second background audio data. Wherein the first gain and the second gain are determined according to a sound control mode corresponding to the display device.

Step S1830, the second target audio data and the second background audio data are merged, and sound effect enhancement processing is performed to obtain and output second audio data.

In the audio processing method, after first target audio data and first background audio data are separated from first audio data, gain processing may be performed on the first target audio data according to a first gain to obtain second target audio data; and performing gain processing on the first background audio data according to the second gain to obtain second background audio data. And merging the second target audio data and the second background audio data, and performing sound effect enhancement processing to obtain and output second audio data. Because the first gain and the second gain are determined according to the sound control mode corresponding to the display device, the first target audio data or the first background audio data can be enhanced according to the watching requirement of the user by combining the current watching requirement of the user through carrying out non-proportional gain processing on the first target audio data and the first background audio data and then combining the first target audio data and the first background audio data, so that the effect of enhancing the sound effect can be improved.

In some embodiments, the audio processing method further includes:

determining the type of a sound effect enhancement mode corresponding to the first audio data according to the sound control mode;

according to the sound control mode, a first gain and a second gain corresponding to the type of the sound-effect enhancement mode are determined.

In some embodiments, the display device corresponds to a plurality of preset sound definition control modes and/or a plurality of preset sound effect modes; each preset sound definition control mode has a corresponding numerical value, and each preset sound effect mode has a corresponding numerical value;

the sound control mode includes: a target sound definition control mode and/or a target sound effect mode; the target sound definition control mode is one of a plurality of preset sound definition control modes, and the target sound effect mode is one of a plurality of preset sound effect modes;

determining the type of the sound effect enhancement mode corresponding to the first audio data according to the sound control mode, wherein the type comprises the following steps:

determining the type of a sound effect enhancement mode corresponding to the first audio data according to a first numerical value corresponding to the target sound definition control mode and/or a second numerical value corresponding to the target sound effect mode;

determining a first gain and a second gain corresponding to the type of the sound-effect enhancement mode according to the sound control mode, including:

and determining a first gain and a second gain corresponding to the type of the sound-effect enhancement mode according to the first numerical value and/or the second numerical value.

In some embodiments, determining the first gain and the second gain corresponding to the type of the sound-enhancement mode according to the sound control mode includes:

if the type of the sound effect enhancement mode corresponding to the first audio data is the sound enhancement mode, the first gain is larger than the second gain;

if the type of the sound effect enhancement mode corresponding to the first audio data is the background enhancement mode, the first gain is smaller than the second gain.

In some embodiments, the first audio data includes at least one third target audio data belonging to a preset sound type;

the audio processing method further comprises:

separating at least one third target audio data and third background audio data from the first audio data;

acquiring a first energy value of first channel initial target audio data and a second energy value of second channel initial target audio data of single third target audio data;

performing gain processing on the initial target audio data of the first sound channel according to the third gain to obtain first gain audio data of the first sound channel; performing gain processing on the initial target audio data of the second channel according to the fourth gain to obtain first gain audio data of the second channel; wherein the third gain and the fourth gain are determined based on the first energy value and the second energy value;

combining the first channel initial background audio data of the first channel first gain audio data and the third background audio data, and performing sound effect enhancement processing to obtain and output first channel first enhanced audio data;

and merging the second channel initial background audio data of the second channel first gain audio data and the third background audio data, and performing sound effect enhancement processing to obtain and output second channel first enhanced audio data.

In some embodiments, the audio processing method further includes:

determining a fifth gain and a sixth gain corresponding to a single third target audio data according to the sound control mode, the first energy value and the second energy value;

determining a seventh gain according to the voice control mode;

performing gain processing on the initial target audio data of the first sound channel according to the fifth gain to obtain second gain audio data of the first sound channel; performing gain processing on the second channel initial target audio data according to the sixth gain to obtain second channel second gain audio data;

respectively carrying out gain processing on the first channel initial background audio data and the second channel initial background audio data according to a seventh gain to obtain first channel gain background audio data and second channel gain background audio data;

combining the first channel second gain audio data with the first channel gain background audio data, and performing sound effect enhancement processing to obtain and output first channel second enhancement audio data;

and combining the second channel second gain audio data and the second channel gain background audio data, and performing sound effect enhancement processing to obtain and output second channel second enhanced audio data.

In some embodiments, determining a fifth gain and a sixth gain corresponding to a single third target audio data according to the sound control mode, the first energy value, and the second energy value includes:

determining the relation of the left and right channel energy according to a first energy value of the first channel initial target audio data and a second energy value of the second channel initial target audio data;

determining a fifth gain and a sixth gain corresponding to the type of the sound effect enhancement mode and the relation between the left channel energy and the right channel energy according to the sound control mode, the first energy value and the second energy value;

determining a seventh gain according to the voice control mode, comprising:

and determining a seventh gain corresponding to the relationship between the type of the sound-effect enhancement mode and the energy levels of the left and right channels according to the sound control mode.

Referring to fig. 19, fig. 19 is a flowchart of an audio processing method according to some embodiments of the present application, which may include the following steps:

step S1910, sound separation and sound effect enhancement are performed on the acquired first audio data, respectively, to obtain first target audio data and second audio data.

Step S1920, performing gain processing on the first target audio data according to the first gain to obtain second target audio data, and performing gain processing on the second audio data according to the second gain to obtain third audio data, where the first gain and the second gain are determined according to a sound control mode corresponding to the display device.

Step S1930, performing delay processing on the second target audio data or the third audio data to synchronize the second target audio data and the third audio data.

Step S1940, the second target audio data and the third audio data are combined, and fourth audio data is obtained and output.

In the audio processing method according to some embodiments of the present application, since the sound separation algorithm only performs separation of the target sound and does not perform separation of the background sound, the time duration consumed by the sound separation algorithm can be reduced by half. Moreover, the sound separation and the sound effect enhancement can be processed in parallel instead of in series, so that the time consumed by the whole audio processing flow can be further shortened, and the effect of sound and picture synchronization is improved. In addition, the second target audio data or the third audio data is subjected to delay processing, for example, the delay processing can be performed in a link with short operation time in the audio enhancement link and the audio separation link, so that the second target audio data and the third audio data are synchronized and then combined to avoid the echo problem, and the effect of enhancing the audio effect is not reduced while the effect of synchronizing the sound and the picture is improved.

In some embodiments, the delaying the second target audio data or the third audio data comprises:

acquiring a first time length consumed during sound separation and a second time length consumed during sound effect enhancement processing;

and carrying out time delay processing on the second target audio data or the third audio data according to the first time length and the second time length.

determining a time difference between the first target audio data and the second audio data according to a correlation between the first target audio data and the second audio data;

and performing time delay processing on the second target audio data or the third audio data according to the time difference.

In some embodiments, determining a time difference between the first target audio data and the second audio data based on a correlation between the first target audio data and the second audio data comprises:

acquiring a first audio segment of first target audio data in a time period t;

acquiring a second audio segment of second audio data in a time period t, a plurality of third audio segments before the second audio segment and a plurality of fourth audio segments after the second audio segment; the time length corresponding to the third audio frequency band and the fourth audio frequency band is equal to the time length of the time period t;

determining the correlation between the first audio segment and the second audio segment, the correlation between the first audio segment and the third audio segment and the correlation between the first audio segment and the fourth audio segment, and determining the audio segment with the highest correlation;

the time difference between the audio piece with the highest correlation and the first audio piece is determined as the time difference between the first target audio data and the second audio data.

In some embodiments, the first audio data comprises first channel initial audio data and second channel initial audio data;

carry out the audio effect enhancement to first audio data and handle, obtain second audio data, include:

respectively performing sound effect enhancement processing on the first channel initial audio data and the second channel initial audio data to obtain first channel sound effect enhanced audio data and second channel sound effect enhanced audio data;

performing gain processing on the second audio data according to the second gain to obtain third audio data, including:

respectively carrying out gain processing on the first channel sound effect enhanced audio data and the second channel sound effect enhanced audio data according to the second gain to obtain first channel target audio data and second channel target audio data;

delaying the second target audio data or the third audio data to synchronize the second target audio data and the third audio data, including:

carrying out time delay processing on the second target audio data or the first channel target audio data so as to synchronize the second target audio data with the first channel target audio data; performing time delay processing on the second target audio data or the second channel target audio data to synchronize the second target audio data with the second channel target audio data;

merging the second target audio data and the third audio data to obtain fourth audio data, including:

and respectively merging the second target audio data with the first channel target audio data and the second channel target audio data to obtain first channel merged audio data and second channel merged audio data.

the sound control mode includes: a target sound definition control mode and/or a target sound effect mode; the target sound definition control mode is one of a plurality of preset sound definition control modes, and the target sound effect mode is one of a plurality of preset sound effect modes; the audio processing method further comprises:

and determining a first gain and a second gain according to a first value corresponding to the target sound definition control mode and/or a second value corresponding to the target sound effect mode, wherein the first gain is greater than the second gain.

In some embodiments, determining the first gain and the second gain according to the first value corresponding to the target sound definition control mode and/or the second value corresponding to the target sound effect mode includes:

setting the first gain to 0 dB;

and determining a second gain according to a first value corresponding to the target sound definition control mode and/or a second value corresponding to the target sound effect mode, so that the second gain is smaller than 0 dB.

Referring to fig. 20, fig. 20 is a flowchart of an audio processing method applied to a display device according to some embodiments of the present application, which may include the following steps:

step S2010, performing voice separation on the acquired first channel audio data and second channel audio data respectively to obtain first channel first-person voice audio data and first channel first background audio data, and second channel first-person voice audio data and second channel first background audio data.

In step S2020, the first-channel first-person audio data and the second-channel first-person audio data are merged to obtain target person audio data.

Step S2030, obtaining image data of a moment when the first channel audio data and the second channel audio data are located, performing lip movement detection on the image data, and if a lip movement coordinate in the screen of the display device is detected, determining human voice weights corresponding to the plurality of audio output interfaces according to the lip movement coordinate and coordinates of the plurality of audio output interfaces of the display device.

Step S2040, for each audio output interface, determining, according to the coordinates of the audio output interface, that the audio output interface corresponds to the first channel first background audio data and/or the second channel first background audio data.

Step S2050 is to combine the product of the target human voice audio data and the human voice weight corresponding to the audio output interface, and the first background audio data of the first channel and/or the first background audio data of the second channel corresponding to the audio output interface, perform sound effect enhancement processing to obtain audio data corresponding to the audio output interface, and output the audio data through the audio output interface.

In the audio processing method according to some embodiments of the present application, in a stereo scene, after performing voice separation on first channel audio data and second channel audio data, the separated first channel first person audio data and second channel first person audio data may be merged first to obtain target person audio data, and the target person audio data is used as a person audio to be output. And then, according to the speaking position of the person in the image, the voice weight corresponding to each audio output interface is adjusted, namely the weight corresponding to the voice audio is output, and according to the position of the audio output interface, the weight of the background audio output by each audio output interface is adjusted, so that the stereoscopic impression of the sound is enhanced, and the watching experience of a user is improved.

In some embodiments, the audio processing method further includes:

respectively performing gain processing on first channel first person audio data and second channel first person audio data according to the first gain to obtain first channel second person audio data and second channel second person audio data;

respectively carrying out gain processing on the first channel first background audio data and the second channel first background audio data according to a second gain to obtain first channel second background audio data and second channel second background audio data; wherein, the first gain and the second gain are determined according to a sound control mode corresponding to the display device;

merging the first-person audio data of the first sound channel and the first-person audio data of the second sound channel to obtain target person sound audio data, wherein the merging step comprises the following steps:

merging the first channel second voice audio data and the second channel second voice audio data to obtain target voice audio data;

for each audio output interface, determining the first channel first background audio data and/or the second channel first background audio data corresponding to the audio output interface according to the coordinates of the audio output interface, including:

for each audio output interface, determining second background audio data of a first channel and/or second background audio data of a second channel corresponding to the audio output interface according to the coordinates of the audio output interface;

combining the product of the target human voice audio data and the human voice weight corresponding to the audio output interface, and the first background audio data of the first sound channel and/or the first background audio data of the second sound channel corresponding to the audio output interface, and performing sound effect enhancement processing to obtain the audio data corresponding to the audio output interface, including:

and combining the product of the target human voice audio data and the human voice weight corresponding to the audio output interface, and the first channel second background audio data and/or the second channel second background audio data corresponding to the audio output interface, and performing sound effect enhancement processing to obtain the audio data corresponding to the audio output interface.

In some embodiments, the sound effect processing method further includes:

and if the lip movement coordinate is not detected, determining the human sound weights respectively corresponding to the audio output interfaces according to the ratio of the energy of the first-person audio data of the first sound channel to the energy of the first-person audio data of the second sound channel and the coordinate of the audio output interface aiming at each audio output interface.

In some embodiments, the screen comprises: a left region, a middle region, and a right region; determining the audio output interface to correspond to the first channel first background audio data and/or the second channel first background audio data according to the coordinates of the audio output interface, including:

if the coordinates of the audio output interface correspond to the left area, determining that the audio output interface corresponds to first background audio data of a first sound channel;

if the coordinates of the audio output interface correspond to the right area, determining that the audio output interface corresponds to second channel first background audio data;

and if the coordinates of the audio output interface correspond to the middle area, determining that the audio output interface corresponds to the first channel first background audio data and the second channel first background audio data.

In some embodiments, the screen comprises: a middle region and a non-middle region; according to the lip movement coordinate and the coordinate of a plurality of audio output interfaces of the display device, determining the human voice weight corresponding to the audio output interfaces respectively, comprising the following steps:

if the lip motion coordinate is located in the non-middle area, determining the human voice weights respectively corresponding to the audio output interfaces according to the lip motion coordinate and the coordinates of the audio output interfaces;

and if the lip movement coordinate is positioned in the middle area, determining the human voice weights respectively corresponding to the plurality of audio output interfaces according to the coordinate of the plurality of audio output interfaces and the attribute information of the plurality of audio output interfaces, wherein the attribute information comprises the volume and/or the orientation.

In some embodiments, for each audio output interface, determining a corresponding area of the audio output interface in the screen according to the coordinates of the audio output interface;

if the lip motion coordinate is located in the area corresponding to the audio output interface, determining the human voice weight corresponding to the audio output interface as a first numerical value;

and if the lip motion coordinate is positioned outside the area corresponding to the audio output interface, determining the human voice weight corresponding to the audio output interface as a second numerical value, wherein the second numerical value is smaller than the first numerical value.

and determining a first gain and a second gain according to a first numerical value corresponding to the target sound definition control mode and/or a second numerical value corresponding to the target sound effect mode.

Some embodiments of the present application further provide an audio processing method, which can sing without being limited by media resources by separating human voice. Meanwhile, the original vocal can be completely or partially added to the accompaniment according to the energy of the vocal singing collected by the microphone, and the singing experience is prevented from being influenced due to the fact that the singing level of a singer is not high.

Referring to fig. 21, fig. 21 is a flowchart of an audio processing method applied to a display device according to some embodiments of the present application, which may include the following steps:

step S2110, song audio data are obtained, and voice separation is carried out on the song audio data to obtain original singing voice audio data and accompaniment audio data.

Step S2120, determining original singing gain according to energy of the original singing voice audio data in each time period and energy of the singing voice audio data collected in the time period, and performing gain processing on the original singing voice audio data in the time period according to the original singing gain to obtain target voice audio data.

Step S2130, merging the accompaniment audio data, the target vocal audio data, and the singing vocal audio data in each time period, and performing sound effect enhancement processing to obtain and output the target audio data.

According to the sound effect processing method of some embodiments of the application, original singing voice audio data and accompaniment audio data can be obtained through voice separation aiming at song audio data. Thus, even songs not contained in the singing APP can be singed for any song through the method. And determining the original voice gain according to the energy of the voice data of the singing voice collected in real time and the energy of the voice data of the original singing voice, and performing gain processing on the voice data of the original singing voice according to the original voice gain to obtain the voice data of the target voice. Because the original gain of singing is confirmed according to the energy of the vocal sound audio data of singing and the energy of the vocal sound audio data of original singing, consequently, merge target vocal sound audio data to accompaniment audio data, that is, according to the singing condition of the user, merge vocal sound audio data of original singing into accompaniment audio data, for example, merge all vocal sound audio data of original singing into accompaniment audio data, or merge part of vocal sound audio data of original singing into accompaniment audio data, thereby promote the accompaniment effect when the user sings, promote user experience.

In some embodiments, determining the original singing gain according to the energy of the original vocal audio data in each time period and the energy of the vocal audio data collected in the time period includes:

if the energy of the voice data of the singing person is smaller than a preset energy threshold value, setting the original singing gain as a preset gain threshold value;

and if the energy of the voice data of the singing person is more than or equal to the preset energy threshold, determining the original singing gain according to the energy ratio between the energy of the voice data of the singing person and the energy of the voice data of the original singing person, so that the original singing gain is smaller than the preset gain threshold.

In some embodiments, the sound effect processing method further includes:

obtaining an original singing gain corresponding to a previous time period, and if the original singing gain corresponding to the current time period is the same as the original singing gain corresponding to the previous time period, prolonging the time period until the prolonged time period is smaller than a first time threshold;

if the original singing gain corresponding to the current time period is different from the original singing gain corresponding to the previous time period, the time period is shortened until the shortened time period is greater than a second time threshold, wherein the first time threshold is greater than the second time threshold.

In some embodiments, the sound effect processing method further includes:

determining vocal accompaniment gain according to the energy of the voice data of the singing person collected in the time period; wherein, the vocal accompaniment gain is positively correlated with the energy of the vocal voice data collected in the time period;

gain processing is carried out on the first vocal accompaniment audio data through vocal accompaniment gain to obtain second vocal accompaniment audio data; wherein the energy of the second vocal accompaniment audio data is less than the energy of the vocal audio data;

the method comprises the following steps of combining accompaniment audio data, target voice audio data and singing voice audio data in a time period, and performing sound effect enhancement processing to obtain the target audio data, wherein the method specifically comprises the following steps:

and combining the accompaniment audio data, the second accompaniment audio data, the target voice audio data and the singing voice audio data in the time period, and performing sound effect enhancement processing to obtain the target audio data.

In some embodiments, generating the first vocal accompaniment audio data from the original vocal audio data for each time period comprises:

for each time delay, carrying out time delay processing on the original vocal voice audio data in each time period according to the time delay to obtain first time delay audio data;

determining a sound zone to which original vocal voice data belongs;

and performing tone rising processing or tone falling processing on the original vocal voice data according to the sound zone to obtain first vocal accompaniment voice data.

In some embodiments, the tone-up processing or the tone-down processing is performed on the original vocal voice audio data according to the vocal regions, and the method comprises the following steps:

if the sound zone is a low sound zone, tone reduction processing is carried out on the original vocal voice audio data to obtain first vocal accompaniment audio data;

if the sound zone is a high sound zone, tone-rising processing is carried out on the original vocal audio data to obtain first vocal accompaniment audio data;

The specific details of each step in the above method have been described in detail in the corresponding display device, and therefore are not described herein again.

Some embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process executed by the audio processing method, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk.

The present application provides a computer program product comprising: when the computer program product is run on a computer, the computer is caused to implement the audio processing method described above.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the foregoing discussion in some embodiments is not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and these modifications or substitutions do not depart from the spirit of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A display device, comprising:

an audio output interface configured to: and outputting the target audio data.

2. The display device according to claim 1, wherein the original singing gain is less than or equal to a preset gain threshold.

3. The display device according to claim 2, wherein the controller is configured to: if the energy of the voice data of the singing person is smaller than a preset energy threshold value, setting the original singing gain as the preset gain threshold value;

4. The display device of claim 1, wherein the controller is further configured to:

5. The display device according to claim 1, wherein the controller is further configured to:

the controller configured to: and combining the accompaniment audio data, the second accompaniment audio data, the target voice audio data and the singing voice audio data in the time period, and performing sound effect enhancement treatment to obtain the target audio data.

6. The display device according to claim 5, wherein the controller is configured to: acquiring a plurality of different delays and gains corresponding to the delays;

7. The display device according to claim 5, wherein the controller is configured to: determining a sound zone to which the original vocal voice data belongs;

8. The display device according to claim 7, wherein the controller is configured to: if the sound zone is a bass zone, tone reduction processing is carried out on the original vocal voice audio data to obtain first vocal accompaniment audio data;

9. A method of audio processing, the method comprising:

10. The method of claim 9, further comprising:

generating first vocal accompaniment audio data according to the original vocal voice audio data in each time period; wherein the energy of the first vocal accompaniment audio data is less than that of the original vocal audio data;

gain processing is carried out on the first vocal accompaniment audio data through the vocal accompaniment gain to obtain second vocal accompaniment audio data;