WO2024077452A1 - 音频处理方法、装置、设备及存储介质 - Google Patents

音频处理方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2024077452A1
WO2024077452A1 PCT/CN2022/124432 CN2022124432W WO2024077452A1 WO 2024077452 A1 WO2024077452 A1 WO 2024077452A1 CN 2022124432 W CN2022124432 W CN 2022124432W WO 2024077452 A1 WO2024077452 A1 WO 2024077452A1
Authority
WO
WIPO (PCT)
Prior art keywords
loudness
frequency domain
time window
frequency
fade
Prior art date
Application number
PCT/CN2022/124432
Other languages
English (en)
French (fr)
Inventor
杨亚斌
党正军
漆原
刘佳泽
Original Assignee
广州酷狗计算机科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州酷狗计算机科技有限公司 filed Critical 广州酷狗计算机科技有限公司
Priority to PCT/CN2022/124432 priority Critical patent/WO2024077452A1/zh
Priority to CN202280003707.0A priority patent/CN115956270A/zh
Publication of WO2024077452A1 publication Critical patent/WO2024077452A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L21/14Transforming into visible information by displaying frequency domain information
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Definitions

  • the present application relates to the field of audio processing, and in particular to an audio processing method, apparatus, device and storage medium.
  • the music spectrum When playing music, the music spectrum will be displayed.
  • the music spectrum can be displayed as a bar graph, and users can visually feel the rhythm of the music by watching the ups and downs of the bar graph.
  • the spectrum graph is obtained by performing short-time Fourier transform on the audio data to obtain the spectrum data of each frame, and then smoothing the spectrum data to obtain the spectrum graph.
  • the spectrum graph of each frame of the audio data can be played synchronously, thereby displaying the rhythm of the audio data.
  • the spectrum obtained by the method in the related art does not conform to the music effect heard by the human ear.
  • the embodiments of the present application provide an audio processing method, device, equipment and storage medium, which can make the spectrum diagram more suitable for human hearing.
  • the technical solution is as follows.
  • an audio processing method comprising:
  • each time window in the frequency domain data set corresponds to a set of frequency domain data, and the frequency domain data includes a frequency and an amplitude corresponding to the frequency;
  • a fade-in and fade-out function is used to perform head and tail windowing processing on the frequency domain loudness of each time window in the first frequency domain loudness set to obtain a second frequency domain loudness set.
  • an audio processing device comprising:
  • a processing module configured to perform short-time Fourier transform on the audio data to obtain a frequency domain data set, wherein each time window in the frequency domain data set corresponds to a set of frequency domain data, and the frequency domain data includes a frequency and an amplitude corresponding to the frequency;
  • a loudness module configured to calculate loudness according to the amplitude in the frequency domain data set to obtain a first frequency domain loudness set, wherein each time window in the first frequency domain loudness set corresponds to a set of frequency domain loudness, and the frequency domain loudness includes a frequency and a loudness corresponding to the frequency;
  • the windowing module is used to perform head and tail windowing processing on the frequency domain loudness of each time window in the first frequency domain loudness set by using a fade-in and fade-out function to obtain a second frequency domain loudness set.
  • a computer device comprising: a processor and a memory, wherein the memory stores at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by the processor to implement the audio processing method described above.
  • a computer-readable storage medium in which at least one instruction, at least one program, a code set or an instruction set is stored, and the at least one instruction, the at least one program, the code set or the instruction set is loaded and executed by a processor to implement the audio processing method described above.
  • a computer program product or a computer program is provided, the computer program product or the computer program including computer instructions, the computer instructions being stored in a computer-readable storage medium.
  • a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the audio processing method provided in the above optional implementation.
  • the frequency domain data of each time window can be obtained, and the frequency domain data identifies the amplitude distribution of the audio data at each frequency in the time window. Then the loudness is calculated based on the amplitude in the frequency domain data, and then the frequency domain loudness of each time window is obtained.
  • the frequency domain loudness can represent the loudness perception of the human ear to the sound waves of each frequency in the current time window. Furthermore, since the human ear has a weaker perception of high-frequency and low-frequency sound waves, the beginning and end of the audio loudness of each time window are faded in and out and windowed, so that the loudness value gradually decreases from the middle to both sides.
  • the frequency domain loudness of each time window calculated by the above method is more in line with the loudness distribution heard by the human ear when the audio data is played. According to the frequency domain loudness, the relevant display effect during audio playback can be produced to make the display effect closer to the human ear hearing effect.
  • FIG1 is a block diagram of a computer device provided by an exemplary embodiment of the present application.
  • FIG2 is a method flow chart of an audio processing method provided by another exemplary embodiment of the present application.
  • FIG3 is a method flow chart of an audio processing method provided by another exemplary embodiment of the present application.
  • FIG4 is a schematic diagram of an audio processing method provided by another exemplary embodiment of the present application.
  • FIG5 is a schematic diagram of an audio processing method provided by another exemplary embodiment of the present application.
  • FIG6 is a block diagram of an audio processing device provided by another exemplary embodiment of the present application.
  • FIG7 is a schematic diagram of the structure of a server provided by another exemplary embodiment of the present application.
  • FIG8 is a block diagram of a terminal provided by another exemplary embodiment of the present application.
  • FIG. 1 shows a schematic diagram of a computer device 101 provided in an exemplary embodiment of the present application.
  • the computer device 101 may be a terminal or a server.
  • the terminal may include at least one of a digital camera, a smart phone, a laptop computer, a desktop computer, a tablet computer, an intelligent speaker, and an intelligent robot.
  • the terminal may also be a device with an audio system, for example, MP3 (Moving Picture Experts Group Audio Layer III), MP4 (Moving Picture Experts Group 4), a speaker, an intelligent speaker, a vehicle-mounted computer, a headset, a smart home device, etc.
  • the audio processing method provided in the present application may be applied to an application with an audio processing function, which may be: a music player, a video player, a small video player, an audio editing program, a video editing program, a social program, a life service program, a shopping program, a live broadcast program, a forum program, an information program, a life program, an office program, etc.
  • an audio processing function which may be: a music player, a video player, a small video player, an audio editing program, a video editing program, a social program, a life service program, a shopping program, a live broadcast program, a forum program, an information program, a life program, an office program, etc.
  • a client of the application is installed on the terminal.
  • an audio processing algorithm is stored on the terminal, and when the client needs to use the audio processing function provided by the embodiment of the present application, the client can call the audio processing algorithm to complete the audio processing.
  • the audio processing process can be completed by the terminal or by the server.
  • the terminal and the server are connected to each other via a wired or wireless network.
  • the terminal includes a first memory and a first processor.
  • An audio processing algorithm is stored in the first memory; the audio processing algorithm is called and executed by the first processor to implement the audio processing method provided in the present application.
  • the first memory may include but is not limited to the following: random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable read-only memory (EPROM), and electrically erasable read-only memory (EEPROM).
  • the first processor may be composed of one or more integrated circuit chips.
  • the first processor may be a general-purpose processor, such as a central processing unit (CPU) or a network processor (NP).
  • the first processor may implement the audio processing method provided in the present application by running a program or code.
  • the server includes a second memory and a second processor.
  • the second memory stores an audio processing algorithm; the audio processing algorithm is called by the second processor to implement the audio processing method provided in the present application.
  • the second memory may include but is not limited to the following: RAM, ROM, PROM, EPROM, EEPROM.
  • the second processor may be a general-purpose processor, such as a CPU or NP.
  • Fig. 2 shows a flow chart of an audio processing method provided by an exemplary embodiment of the present application.
  • the method can be executed by a computer device, for example, a terminal or a server as shown in Fig. 1.
  • the method includes the following steps.
  • Step 210 performing short-time Fourier transform on the audio data to obtain a frequency domain data set, in which each time window corresponds to a set of frequency domain data, and the frequency domain data includes frequency and amplitude corresponding to the frequency.
  • the audio data may be PCM (Pulse Code Modulation) audio data, or the audio data may be audio data in other audio formats.
  • PCM Pulse Code Modulation
  • the short-time Fourier transform divides the audio data into multiple time windows in time, performs Fourier transform on the audio data in each time window, and obtains the frequency domain data of each time window.
  • the frequency domain data of multiple time windows constitute a frequency domain data set. That is, the frequency domain data set is a three-dimensional data composed of a time window, a frequency, and an amplitude.
  • the frequency domain data set includes the frequency domain data of multiple time windows, and the frequency domain data of each time window includes the amplitude at each frequency.
  • the frequency domain data may also include the real part and the imaginary part at each frequency, and the amplitude and phase are calculated based on the real part and the imaginary part. In the embodiment of the present application, the loudness is calculated using the amplitude.
  • each time window is a frame (30ms, 60ms or 100ms).
  • Each frame of audio data is Fourier transformed to obtain frequency domain data of each frame, and multiple frames of frequency domain data constitute a frequency domain data set.
  • a frame of frequency domain data includes the amplitude of the frame of audio data at at least one frequency.
  • the length of the time window can be set arbitrarily, for example, to two frames, one second, 1ms, etc.
  • Step 220 Calculate loudness according to the amplitude in the frequency domain data set to obtain a first frequency domain loudness set.
  • Each time window in the first frequency domain loudness set corresponds to a set of frequency domain loudness.
  • the frequency domain loudness includes frequency and loudness corresponding to the frequency.
  • the amplitude in the frequency domain data set is calculated as the loudness, and the loudness replaces the amplitude in the frequency domain data set to obtain a first frequency domain loudness set.
  • loudness 20lg (amplitude).
  • the amplitude of the first frame (first time window) at 20 Hz (Hertz) 10
  • the corresponding calculated loudness 20.
  • the loudness of the first frame (first time window) at 20 Hz in the first frequency domain loudness set is 20.
  • the relationship between the first frequency domain loudness set and the frequency domain data set is: the time window remains unchanged, the frequency remains unchanged, and the amplitude is replaced by the loudness.
  • the frequency domain data set includes frequency domain data of 30 time windows
  • the first frequency domain loudness set also includes frequency domain loudness of 30 time windows
  • the time windows in the frequency domain data set have a one-to-one correspondence with the time windows in the first frequency domain loudness set.
  • the sound pressure can also be calculated according to the amplitude in the frequency domain data set to obtain a first frequency domain sound pressure set.
  • Each time window in the first frequency domain sound pressure set corresponds to a set of frequency domain sound pressures, and the frequency domain sound pressure includes the frequency and the sound pressure corresponding to the frequency. Then, the "loudness" related terms in the subsequent steps can be replaced by "sound pressure" accordingly.
  • Step 230 Use a fade-in/fade-out function to perform head-to-tail windowing processing on the frequency domain loudness of each time window in the first frequency domain loudness set to obtain a second frequency domain loudness set.
  • the fade function may be a function whose starting point and end point are 0/approaching 0, and gradually increases first and then gradually decreases.
  • the fade function may be a broken line function connected at (0,0), (1,1), (2,1), and (3,0).
  • the fade function can be used to perform full-band windowing processing on the frequency domain loudness of each time window.
  • Windowing processing refers to multiplying the windowing function (fade function) by the windowed data (frequency domain loudness).
  • the fade-in and fade-out functions may also be two functions: a fade-in function and a fade-out function.
  • the fade-in function is a gradually increasing function with a starting point of 0/approaching 0.
  • the fade-out function is a gradually decreasing function with an ending point of 0/approaching 0.
  • the fade-in function can be used to perform windowing processing on the head (a frequency band starting from the starting point) of the frequency domain loudness of each time window
  • the fade-out function can be used to perform windowing processing on the tail (a frequency band ending at the end point) of the frequency domain loudness of each time window.
  • the window length (frequency band length) of the windowing can be set arbitrarily, the window lengths of the head and the tail can be the same or different, the window lengths of the head of different time windows can be the same or different, and the window lengths of the tail of different time windows can be the same or different.
  • the method provided in this embodiment can obtain frequency domain data of each time window by performing short-time Fourier transform on audio data, and the frequency domain data identifies the amplitude distribution of audio data at each frequency in the time window. Then, the loudness is calculated based on the amplitude in the frequency domain data, and then the frequency domain loudness of each time window is obtained.
  • the frequency domain loudness can represent the loudness perception of the human ear to the sound waves of each frequency in the current time window. Furthermore, since the human ear has a weaker perception of high-frequency and low-frequency sound waves, the beginning and end of the audio loudness of each time window are faded in and out and windowed, so that the loudness value gradually decreases from the middle to both sides.
  • the frequency domain loudness of each time window calculated by the above method is more consistent with the loudness distribution heard by the human ear when the audio data is played. According to the frequency domain loudness, the relevant display effect during audio playback can be produced to make the display effect closer to the human ear hearing effect.
  • Fig. 3 shows a flow chart of an audio processing method provided by an exemplary embodiment of the present application.
  • the method can be executed by a computer device, for example, a terminal or a server as shown in Fig. 1.
  • the method includes the following steps.
  • Step 210 performing short-time Fourier transform on the audio data to obtain a frequency domain data set, in which each time window corresponds to a set of frequency domain data, and the frequency domain data includes frequency and amplitude corresponding to the frequency.
  • a time window is a frame
  • the audio data includes 1000 frames.
  • a frequency domain data set consisting of 1000 frames of frequency domain data is obtained.
  • One frame of frequency domain data can form a dot graph/line graph/bar graph with the horizontal axis being frequency (for example, the value range is 0-20000Hz) and the vertical axis being amplitude.
  • the frequency domain data set includes 1000 dot graphs/line graphs/bar graphs of frequency domain data.
  • Step 220 Calculate loudness according to the amplitude in the frequency domain data set to obtain a first frequency domain loudness set.
  • Each time window in the first frequency domain loudness set corresponds to a set of frequency domain loudness.
  • the frequency domain loudness includes frequency and loudness corresponding to the frequency.
  • the frequency domain data set of 1000 frames of frequency domain data in step 210 is converted into a first frequency domain loudness set, and the first frequency domain loudness set includes 1000 frames of frequency domain loudness.
  • One frame of frequency domain loudness can constitute a dot graph/line graph/bar graph with the horizontal axis being frequency (for example, a value range of 0-20000 Hz) and the vertical axis being amplitude.
  • the first frequency domain loudness set includes 1000 dot graphs/line graphs/bar graphs of frequency domain loudness.
  • Step 221 Perform A-weighted filtering and Mel-scale conversion on the first frequency domain loudness set.
  • the frequency domain loudness of each time window in the first frequency domain loudness set is further processed, and the further processing includes alpha-weighted filtering and Mel scale conversion.
  • A-weighted filtering simulates the loudness of 40-degree pure tone to the human ear. When the signal passes through, its low frequency and mid-range frequency (below 1000Hz) are greatly attenuated. The characteristic curve of A-weighted filtering is close to the hearing characteristics of the human ear.
  • the a-weighted filter can be used to process each frequency domain loudness in the first frequency domain loudness set.
  • Mel scale conversion is used to convert frequency into Mel scale.
  • Most human ears can recognize frequencies between 20 and 20,000 Hz, but the recognition relationship between the human ear and the sound frequency unit Hz is not a simple linear relationship.
  • the human ear is most sensitive to medium and low frequency sounds (around 1000 Hz).
  • the sound frequency increases from 1000 Hz to 2000 Hz, the human ear cannot feel the frequency doubling. For this reason, people often use Mel scale to re-quantify the human ear's perception of frequency.
  • the horizontal axis of frequency domain loudness is frequency, and the value range is 0-20000Hz. Then, through the Mel scale conversion, 0-20000Hz is converted to the Mel scale.
  • the loudness corresponding to the Mel scale is the sum of the loudness in the frequency loudness interval corresponding to the Mel scale.
  • Mel scale 1 corresponds to a frequency band
  • the loudness of Mel scale 1 is the sum of the loudness corresponding to all frequencies in the frequency band. In this way, the first frequency domain loudness set is more consistent with the human ear's perception of loudness.
  • the Mel scale conversion can be completed by using a Mel scale filter, and the frequency domain loudness in the first frequency domain loudness set after A-weighted filtering is input into the Mel scale filter to obtain the first frequency domain loudness set after Mel scale conversion.
  • the first frequency domain loudness set used in subsequent steps may be the first frequency domain loudness set obtained in step 220, or may refer to the first frequency domain loudness set processed by A-weighted filtering, or may refer to the first frequency domain loudness set converted by Mel scale, or may refer to the first frequency domain loudness set processed by A-weighted filtering and converted by Mel scale.
  • A-weighted filtering and Mel-scale conversion can be used to obtain frequency domain loudness that is more consistent with human auditory changes.
  • the data is uneven and too jittery.
  • data smoothing, error consistency, and fade-in and fade-out are achieved through intra-frame and inter-frame data processing.
  • Step 222 when the distribution of the frequency domain loudness satisfies the concentrated distribution condition, reduce the loudness values of the frequency domain loudness that are lower than the average loudness.
  • the concentrated distribution condition is used to determine whether the distribution of frequency domain loudness is concentrated around the average loudness.
  • the average distribution condition is that the loudness variance is less than the first threshold, and the difference between the loudness expectation and the average loudness is less than the second threshold.
  • the values of the first threshold and the second threshold are determined according to actual needs.
  • average loudness, expected loudness and loudness variance are calculated based on frequency domain loudness; when the loudness variance is less than a first threshold and the difference between the expected loudness and the average loudness is less than a second threshold, the value of the frequency domain loudness whose loudness is lower than the average loudness is reduced.
  • the way to reduce the value of the loudness in the frequency domain that is lower than the average loudness can be: subtracting a fixed value, multiplying by a coefficient, subtracting a gradient value, etc. For example, all loudnesses lower than the average loudness are multiplied by 0.5.
  • the frequency domain loudness of the first time window of the first frequency domain loudness set is concentrated near the average loudness, so the loudness below the average loudness is reduced to obtain (2) in FIG4 , thereby highlighting the high loudness part and making it fit the hearing effect of the human ear.
  • Step 230 Use a fade-in/fade-out function to perform head-to-tail windowing processing on the frequency domain loudness of each time window in the first frequency domain loudness set to obtain a second frequency domain loudness set.
  • a fade-in function is used to perform windowing processing on the head of the frequency domain loudness of each time window in the first frequency domain loudness set, where the head is a part with a frequency lower than a first frequency threshold; and a fade-out function is used to perform windowing processing on the tail of the frequency domain loudness of each time window in the first frequency domain loudness set, where the tail is a part with a frequency higher than a second frequency threshold.
  • the window size and position of the head and tail can be set arbitrarily. For example, it can be set according to the value range of the horizontal axis, from the minimum value of the value range to the first frequency threshold value is the head, and from the second frequency threshold value to the maximum value of the value range is the tail. It can also be determined according to the frequency range/Mel scale range of the loudness value greater than 0 in the frequency domain loudness. For example, the horizontal axis value range of frequency loudness is 0-100, but the loudness of 0-10 is 0, the loudness of 10 is 1, the loudness of 90 is 1, and the loudness of 90-100 is 0.
  • the frequency/Mel scale range of the loudness value greater than 0 is 10-90, then the minimum value of the frequency/Mel scale range of the loudness value greater than 0 to the first frequency threshold value is the head, and the second frequency threshold value to the maximum value of the frequency/Mel scale range of the loudness value greater than 0 is the tail.
  • the window lengths of the head and tail can be the same or different.
  • the head can be 0-100 Hz and the tail can be 19900-20000 Hz.
  • the head can be 0-10 and the tail can be 90-100.
  • the fade-in function is the portion from 0 to ⁇ /2 of the sine function; and the fade-out function is the portion from 0 to ⁇ /2 of the cosine function.
  • the fade-in function is scaled to the same length as the horizontal axis window of the head, and then the fade-in function is multiplied by the frequency domain loudness of the head to obtain the head after windowing.
  • the fade-out function is scaled to the same length as the horizontal axis window of the tail, and then the fade-out function is multiplied by the frequency domain loudness of the tail to obtain the tail after windowing.
  • the frequency domain loudness of the first time window where the horizontal axis is frequency/Mel scale and the vertical axis is loudness.
  • a fade-in function is used to perform windowing processing on the head, and a fade-out function is used to perform windowing processing on the tail.
  • the frequency domain loudness shown in (2) in FIG. 5 is obtained.
  • Step 231 Perform intra-window data smoothing on the frequency domain loudness of each time window in the second frequency domain loudness set.
  • a polynomial smoothing algorithm is used to smooth the frequency domain loudness of each time window. For example, if the window includes 10 loudnesses of 0-100 Hz, the j-th loudness is smoothed using the j-2-th loudness, j-1-th loudness, j+1-th loudness, and j+2-th loudness, and each loudness in the time window is smoothed in turn.
  • j is an integer greater than 2.
  • Step 232 perform inter-window data smoothing on the frequency domain loudness of each time window in the second frequency domain loudness set.
  • a sliding smoothing weighted filtering algorithm is used to smooth the frequency domain loudness of the i-th time window based on the frequency domain loudness of the i-th time window, the i-th time window, and the i+1-th time window, where i is a positive integer.
  • the loudness c3 of the i-th time window after smoothing at the xth frequency (loudness f3 of the i+1th time window at the xth frequency) * reduction coefficient a + (1-reduction coefficient a) * (1-reduction coefficient a) * (loudness f2 of the i-th time window before smoothing at the xth frequency) + (1-reduction coefficient a) * reduction coefficient a * (loudness f1 of the i-1th time window at the xth frequency).
  • the loudness c3 of the i-th time window after smoothing at the xth frequency (loudness f3 of the i+1th time window at the xth frequency)*rise coefficient b+(1-rise coefficient b)*rise coefficient b*(loudness f2 of the i-th time window before smoothing at the xth frequency)+(1-rise coefficient b)*(1-rise coefficient b)*(loudness f1 of the i-1th time window at the xth frequency).
  • the drop coefficient a and the rise coefficient b can be set according to the requirements.
  • the value range of a and b is 0 to 1.
  • Step 233 Generate a playback display effect of the audio data based on the second frequency domain loudness set, where the playback display effect includes at least one of a spectrum diagram of the audio data, a background image playback effect, a lyrics playback effect, a music fountain effect, and a lighting effect.
  • the second frequency domain loudness set obtained after the above steps in which the frequency domain loudness is spectrum data that conforms to human hearing, can generate a playback display effect displayed when the audio data is played based on the frequency domain loudness in the second frequency domain loudness set.
  • a spectrum graph that moves with the rhythm of music can be generated when music is played.
  • the background image playback effect can be controlled according to the audio loudness, for example, the image display duration, image zoom size, image playback speed, etc.
  • the lighting effect can be controlled according to the frequency domain loudness.
  • the above steps of processing the audio data may be arbitrarily deleted, the order may be adjusted, and combinations may be made to obtain new embodiments.
  • the method provided in this embodiment can obtain frequency domain data of each time window by performing short-time Fourier transform on audio data, and the frequency domain data identifies the amplitude distribution of audio data at each frequency in the time window. Then, loudness is calculated based on the amplitude in the frequency domain data, and then the frequency domain loudness of each time window is obtained.
  • Frequency domain loudness can represent the loudness perception of human hearing to each frequency sound wave in the current time window. Further, A-weighted filtering and Mel scale conversion are performed on the frequency domain loudness. Then, for the time window where the loudness distribution is concentrated in the average loudness, the loudness below the average loudness is reduced to highlight the high loudness part.
  • the intra-frame data and the inter-frame data are smoothed to make the visual effect of the frequency domain loudness value more streamlined.
  • the frequency domain loudness of each time window calculated by the above method is more consistent with the loudness distribution heard by the human ear when the audio data is played. According to the frequency domain loudness, the relevant display effect during audio playback can be produced to make the display effect closer to the human ear hearing effect, improve the visual and auditory matching degree when the user views the playback display effect, and improve the viewing experience.
  • FIG6 shows a schematic diagram of the structure of an audio processing device provided by an exemplary embodiment of the present application.
  • the device can be implemented as all or part of a computer device through software, hardware, or a combination of both.
  • the device includes:
  • a processing module 401 is used to perform a short-time Fourier transform on the audio data to obtain a frequency domain data set, wherein each time window in the frequency domain data set corresponds to a set of frequency domain data, and the frequency domain data includes a frequency and an amplitude corresponding to the frequency;
  • a loudness module 402 is configured to calculate loudness according to the amplitude in the frequency domain data set to obtain a first frequency domain loudness set, wherein each time window in the first frequency domain loudness set corresponds to a set of frequency domain loudness, and the frequency domain loudness includes a frequency and a loudness corresponding to the frequency;
  • the windowing module 403 is configured to perform head-to-tail windowing processing on the frequency domain loudness of each time window in the first frequency domain loudness set by using a fade-in/fade-out function to obtain a second frequency domain loudness set.
  • the windowing module 403 is used to perform windowing processing on the head of the frequency domain loudness of each time window in the first frequency domain loudness set by using a fade-in function, wherein the head is a part whose frequency is lower than a first frequency threshold;
  • the windowing module 403 is configured to perform windowing processing on the tail of the frequency domain loudness of each time window in the first frequency domain loudness set by using a fade-out function, wherein the tail is a part whose frequency is higher than a second frequency threshold.
  • the fade-in function is the 0 to ⁇ /2 portion of a sine function
  • the fade-out function is the 0 to ⁇ /2 portion of the cosine function.
  • the device further comprises:
  • the reducing module 406 is configured to reduce the loudness values of the frequency domain loudness that are lower than the average loudness when the distribution of the frequency domain loudness meets the concentrated distribution condition.
  • the reduction module 406 is used to calculate the average loudness, loudness expectation and loudness variance based on the frequency domain loudness;
  • the reducing module 406 is configured to reduce the value of the frequency domain loudness that is lower than the average loudness when the loudness variance is smaller than a first threshold and the difference between the expected loudness and the average loudness is smaller than a second threshold.
  • each time window in the second frequency domain loudness set corresponds to a group of frequency domain loudness; the device further includes:
  • the smoothing module 405 is configured to perform intra-window data smoothing on the frequency domain loudness of each time window in the second frequency domain loudness set.
  • each time window in the second frequency domain loudness set corresponds to a group of frequency domain loudness; the device further includes:
  • the smoothing module 405 is configured to perform inter-window data smoothing on the frequency domain loudness of each time window in the second frequency domain loudness set.
  • the smoothing module 405 is used to adopt a sliding smoothing weighted filtering algorithm to smooth the frequency domain loudness of the i-th time window based on the frequency domain loudness of the i-1th time window, the i-th time window, and the i+1th time window, where i is a positive integer.
  • the processing module 401 is used to perform A-weighted filtering and Mel-scale conversion on the first frequency-domain loudness set.
  • the device further comprises:
  • the display module 404 is used to generate a playback display effect of the audio data based on the second frequency domain loudness set, wherein the playback display effect includes at least one of a spectrum diagram of the audio data, a background image playback effect, a lyrics playback effect, and a music fountain effect.
  • FIG7 is a schematic diagram of the structure of a server provided by an embodiment of the present application.
  • the server 800 includes a central processing unit (CPU) 801, a system memory 804 including a random access memory (RAM) 802 and a read-only memory (ROM) 803, and a system bus 805 connecting the system memory 804 and the central processing unit 801.
  • the server 800 also includes a basic input/output system (I/O system) 806 that helps transmit information between various devices in the computer, and a large-capacity storage device 807 for storing an operating system 813, application programs 814 and other program modules 815.
  • I/O system basic input/output system
  • the basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse and a keyboard for inputting information to a user account.
  • the display 808 and the input device 809 are connected to the central processing unit 801 through an input/output controller 810 connected to the system bus 805.
  • the basic input/output system 806 may also include an input/output controller 810 for receiving and processing inputs from a plurality of other devices such as a keyboard, a mouse, or an electronic stylus.
  • the input/output controller 810 also provides output to a display screen, a printer, or other types of output devices.
  • the mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805.
  • the mass storage device 807 and its associated computer readable medium provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or a read-only optical disk (English: Compact Disc Read-Only Memory, referred to as: CD-ROM) drive.
  • Computer readable media may include computer storage media and communication media.
  • Computer storage media include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, Erasable Programmable Read-Only Memory (EPROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory or other solid-state memory technology, CD-ROM, Digital Versatile Disc (DVD) or other optical storage, cassettes, tapes, disk storage or other magnetic storage devices.
  • RAM random access memory
  • ROM read-only memory
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electrically Erasable Programmable Read-Only Memory
  • flash memory or other solid-state memory technology
  • CD-ROM Compact Disc
  • DVD Digital Versatile Disc
  • the server 800 can also be connected to a remote computer on the network through a network such as the Internet. That is, the server 800 can be connected to the network 812 through the network interface unit 811 connected to the system bus 805, or the network interface unit 811 can be used to connect to other types of networks or remote computer systems (not shown).
  • the present application also provides a terminal, which includes a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the audio processing method provided by each of the above method embodiments. It should be noted that the terminal can be the terminal provided in FIG. 8 below.
  • FIG8 shows a block diagram of a terminal 900 provided by an exemplary embodiment of the present application.
  • the terminal 900 may be a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III), an MP4 player (Moving Picture Experts Group Audio Layer IV), a laptop computer or a desktop computer.
  • the terminal 900 may also be called a user account device, a portable terminal, a laptop terminal, a desktop terminal or other names.
  • the terminal 900 includes a processor 901 and a memory 902 .
  • the processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc.
  • the processor 901 may be implemented in at least one hardware form of DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array).
  • the processor 901 may also include a main processor and a coprocessor.
  • the main processor is a processor for processing data in the awake state, also known as a CPU (Central Processing Unit); the coprocessor is a low-power processor for processing data in the standby state.
  • the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the display screen.
  • the processor 901 may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 902 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 902 may also include a high-speed random access memory, and a non-volatile memory, such as one or more disk storage devices, flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 902 is used to store at least one instruction, which is used to be executed by the processor 901 to implement the audio processing method or audio processing method provided in the method embodiment of the present application.
  • the terminal 900 may also optionally include: a peripheral device interface 903 and at least one peripheral device.
  • the processor 901, the memory 902 and the peripheral device interface 903 may be connected via a bus or a signal line.
  • Each peripheral device may be connected to the peripheral device interface 903 via a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908 and a power supply 909.
  • the peripheral device interface 903 may be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 901 and the memory 902.
  • the processor 901, the memory 902, and the peripheral device interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902, and the peripheral device interface 903 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
  • the radio frequency circuit 904 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals.
  • the radio frequency circuit 904 communicates with the communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 904 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 904 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user account identity module card, and the like.
  • the radio frequency circuit 904 can communicate with other terminals through at least one wireless communication protocol.
  • the wireless communication protocol includes, but is not limited to: the World Wide Web, a metropolitan area network, an intranet, various generations of mobile communication networks (2G, 3G, 4G and 5G), a wireless local area network and/or a WiFi (Wireless Fidelity) network.
  • the radio frequency circuit 904 may also include circuits related to NFC (Near Field Communication), which is not limited in this application.
  • the display screen 905 is used to display the UI (User Interface).
  • the UI may include graphics, text, icons, videos and any combination thereof.
  • the display screen 905 also has the ability to collect touch signals on the surface of the display screen 905 or above the surface.
  • the touch signal can be input to the processor 901 as a control signal for processing.
  • the display screen 905 can also be used to provide virtual buttons and/or virtual keyboards, also known as soft buttons and/or soft keyboards.
  • the display screen 905 can be one, and the front panel of the terminal 900 is set; in other embodiments, the display screen 905 can be at least two, which are respectively set on different surfaces of the terminal 900 or are folded; in some other embodiments, the display screen 905 can be a flexible display screen, which is set on the curved surface or folded surface of the terminal 900. Even, the display screen 905 can also be set as a non-rectangular irregular shape, that is, a special-shaped screen.
  • the display screen 905 can be made of materials such as LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode) and the like.
  • the camera component 906 is used to capture images or videos.
  • the camera component 906 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • there are at least two rear cameras which are any one of a main camera, a depth of field camera, a wide-angle camera, and a telephoto camera, so as to realize the fusion of the main camera and the depth of field camera to realize the background blur function, the fusion of the main camera and the wide-angle camera to realize panoramic shooting and VR (Virtual Reality) shooting function or other fusion shooting functions.
  • the camera component 906 may also include a flash.
  • the flash can be a single-color temperature flash or a dual-color temperature flash.
  • a dual-color temperature flash refers to a combination of a warm light flash and a cold light flash, which can be used for light compensation at different color temperatures.
  • the audio circuit 907 may include a microphone and a speaker.
  • the microphone is used to collect sound waves from the user account and the environment, and convert the sound waves into electrical signals and input them into the processor 901 for processing, or input them into the radio frequency circuit 904 to achieve voice communication.
  • the microphone may also be an array microphone or an omnidirectional acquisition microphone.
  • the speaker is used to convert the electrical signal from the processor 901 or the radio frequency circuit 904 into sound waves.
  • the speaker may be a traditional film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert the electrical signal into sound waves audible to humans, but also convert the electrical signal into sound waves inaudible to humans for purposes such as ranging.
  • the audio circuit 907 may also include a headphone jack.
  • Positioning component 908 is used to locate the current geographic location of terminal 900 to achieve navigation or LBS (Location Based Service). Positioning component 908 can be a positioning component based on the US GPS (Global Positioning System), China's Beidou system or Russia's Galileo system.
  • the power supply 909 is used to power various components in the terminal 900.
  • the power supply 909 can be an alternating current, a direct current, a disposable battery, or a rechargeable battery.
  • the rechargeable battery can be a wired rechargeable battery or a wireless rechargeable battery.
  • a wired rechargeable battery is a battery charged through a wired line
  • a wireless rechargeable battery is a battery charged through a wireless coil.
  • the rechargeable battery can also be used to support fast charging technology.
  • the terminal 900 further includes one or more sensors 910 , including but not limited to: an acceleration sensor 911 , a gyroscope sensor 912 , a pressure sensor 913 , a fingerprint sensor 914 , an optical sensor 915 , and a proximity sensor 916 .
  • sensors 910 including but not limited to: an acceleration sensor 911 , a gyroscope sensor 912 , a pressure sensor 913 , a fingerprint sensor 914 , an optical sensor 915 , and a proximity sensor 916 .
  • the acceleration sensor 911 can detect the magnitude of acceleration on the three coordinate axes of the coordinate system established by the terminal 900.
  • the acceleration sensor 911 can be used to detect the components of gravity acceleration on the three coordinate axes.
  • the processor 901 can control the display screen 905 to display the user account interface in a horizontal view or a vertical view according to the gravity acceleration signal collected by the acceleration sensor 911.
  • the acceleration sensor 911 can also be used to collect motion data of games or user accounts.
  • the gyro sensor 912 can detect the body direction and rotation angle of the terminal 900, and the gyro sensor 912 can cooperate with the acceleration sensor 911 to collect the 3D action of the user account on the terminal 900.
  • the processor 901 can implement the following functions based on the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to the tilt operation of the user account), image stabilization during shooting, game control, and inertial navigation.
  • the pressure sensor 913 can be set on the side frame of the terminal 900 and/or the lower layer of the display screen 905.
  • the processor 901 When the pressure sensor 913 is set on the side frame of the terminal 900, it can detect the holding signal of the user account on the terminal 900, and the processor 901 performs left and right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913.
  • the processor 901 controls the operability controls on the UI interface according to the pressure operation of the user account on the display screen 905.
  • the operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
  • the fingerprint sensor 914 is used to collect the fingerprint of the user account, and the processor 901 identifies the identity of the user account based on the fingerprint collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the identity of the user account based on the collected fingerprint. When the identity of the user account is identified as a trusted identity, the processor 901 authorizes the user account to perform relevant sensitive operations, which include unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings.
  • the fingerprint sensor 914 can be set on the front, back or side of the terminal 900. When a physical button or a manufacturer logo is set on the terminal 900, the fingerprint sensor 914 can be integrated with the physical button or the manufacturer logo.
  • the optical sensor 915 is used to collect the ambient light intensity.
  • the processor 901 can control the display brightness of the display screen 905 according to the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced.
  • the processor 901 can also dynamically adjust the shooting parameters of the camera component 906 according to the ambient light intensity collected by the optical sensor 915.
  • the proximity sensor 916 also called a distance sensor, is usually arranged on the front panel of the terminal 900.
  • the proximity sensor 916 is used to collect the distance between the user account and the front of the terminal 900.
  • the processor 901 controls the display screen 905 to switch from the screen-on state to the screen-off state; when the proximity sensor 916 detects that the distance between the user account and the front of the terminal 900 is gradually increasing, the processor 901 controls the display screen 905 to switch from the screen-off state to the screen-on state.
  • FIG. 8 does not limit the terminal 900 , and may include more or fewer components than shown in the figure, or combine certain components, or adopt a different component arrangement.
  • the memory also includes one or more programs, which are stored in the memory and include programs for performing the audio processing method provided in the embodiment of the present application.
  • the present application also provides a computer device, which includes: a processor and a memory, wherein the storage medium stores at least one instruction, at least one program, a code set or an instruction set, and the at least one instruction, at least one program, a code set or an instruction set is loaded and executed by the processor to implement the audio processing method provided by the above-mentioned method embodiments.
  • the present application also provides a computer-readable storage medium, which stores at least one instruction, at least one program, code set or instruction set.
  • the at least one instruction, at least one program, code set or instruction set is loaded and executed by a processor to implement the audio processing method provided by the above-mentioned method embodiments.
  • the present application also provides a computer program product or a computer program, which includes a computer instruction stored in a computer-readable storage medium.
  • a processor of a computer device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the audio processing method provided in the above optional implementation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Stereophonic System (AREA)
  • Tone Control, Compression And Expansion, Limiting Amplitude (AREA)

Abstract

本申请公开了一种音频处理方法、装置、设备及存储介质,涉及音频处理领域。该方法包括:对音频数据进行短时傅里叶变换得到频域数据集,所述频域数据集中每个时间窗口对应一组频域数据,所述频域数据包括频率以及频率对应的振幅;根据频域数据集中的振幅计算响度,得到第一频域响度集,所述第一频域响度集中每个时间窗口对应一组频域响度,所述频域响度包括频率以及频率对应的响度;采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集。该方法使频谱图更贴合人耳听觉。

Description

音频处理方法、装置、设备及存储介质 技术领域
本申请涉及音频处理领域,特别涉及一种音频处理方法、装置、设备及存储介质。
背景技术
在播放音乐时,会显示音乐的频谱图。音乐的频谱图可以显示为柱状图,用户通过观赏柱状图的上下起伏来从视觉上感受音乐的韵律节奏。
相关技术中该频谱图的获取方式为:对音频数据进行短时傅里叶变换,得到每一帧的频谱数据,对频谱数据进行平滑处理,既可以得到频谱图。在播放音频数据的同时可以同步播放该音频数据每一帧的频谱图,进而显示出音频数据的韵律节奏。
相关技术中的方法所得到的频谱图并不符合人耳听觉所听到的音乐效果。
发明内容
本申请实施例提供了一种音频处理方法、装置、设备及存储介质,可以使频谱图更贴合人耳听觉。所述技术方案如下。
根据本申请的一方面,提供了一种音频处理方法,所述方法包括:
对音频数据进行短时傅里叶变换得到频域数据集,所述频域数据集中每个时间窗口对应一组频域数据,所述频域数据包括频率以及频率对应的振幅;
根据频域数据集中的振幅计算响度,得到第一频域响度集,所述第一频域响度集中每个时间窗口对应一组频域响度,所述频域响度包括频率以及频率对应的响度;
采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集。
根据本申请的另一方面,提供了一种音频处理装置,所述装置包括:
处理模块,用于对音频数据进行短时傅里叶变换得到频域数据集,所述频 域数据集中每个时间窗口对应一组频域数据,所述频域数据包括频率以及频率对应的振幅;
响度模块,用于根据频域数据集中的振幅计算响度,得到第一频域响度集,所述第一频域响度集中每个时间窗口对应一组频域响度,所述频域响度包括频率以及频率对应的响度;
加窗模块,用于采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集。
根据本申请的另一方面,提供了一种计算机设备,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如上方面所述的音频处理方法。
根据本申请的另一方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如上方面所述的音频处理方法。
根据本公开实施例的另一个方面,提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述可选实现方式中提供的音频处理方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
通过对音频数据进行短时傅里叶变换,可以得到每个时间窗口的频域数据,该频域数据标识在时间窗口内音频数据在各个频率的振幅分布情况。然后基于频域数据中的振幅计算得到响度,进而得到每个时间窗口的频域响度。频域响度可以表示当前时间窗口内人耳听觉对各个频率声波的响度感受。进一步的,由于人耳对高频音波和低频音波的感知较弱,则对每个时间窗口音频响度的首尾进行淡入淡出加窗,使响度值从中部向两边逐渐递减。通过上述方法计算得到的每个时间窗口的频域响度,更符合音频数据播放时人耳所听到的响度分布情况,根据该频域响度即可制作音频播放时的相关显示效果,以使显示效果更 近人耳听觉效果。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一个示例性实施例提供的计算机设备的框图;
图2是本申请另一个示例性实施例提供的音频处理方法的方法流程图;
图3是本申请另一个示例性实施例提供的音频处理方法的方法流程图;
图4是本申请另一个示例性实施例提供的音频处理方法的示意图;
图5是本申请另一个示例性实施例提供的音频处理方法的示意图;
图6是本申请另一个示例性实施例提供的音频处理装置的框图;
图7是本申请另一个示例性实施例提供的服务器的结构示意图;
图8是本申请另一个示例性实施例提供的终端的框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
图1示出了本申请一个示例性实施例提供的计算机设备101的示意图,该计算机设备101可以是终端或服务器。
终端可以包括数码相机、智能手机、笔记本电脑、台式电脑、平板电脑、智能音箱、智能机器人中的至少一种。可选地,该终端还可以是具有音响的设备,例如,MP3(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group 4,动态图像专家组)、音箱、智能音箱、车载计算机、耳机、智能家居设备等。在一种可选的实现方式中,本申请提供的音频处理方法可以应用于具有音频处理功能的应用程序中,该应用程序可以是:音乐播放程序、视频播放程序、小视频播放程序、音频编辑程序、视频编辑程序、社交程序、生活服务程序、购物程序、直播程序、论 坛程序、资讯程序、生活类程序、办公程序等。可选地,终端上安装有该应用程序的客户端。
示例性的,终端上存储有音频处理算法,当客户端需要使用本申请实施例提供的音频处理功能时,客户端可以调用音频处理算法完成音频处理。示例性的,音频处理过程可以由终端来完成,也可以由服务器来完成。
终端与服务器之间通过有线或者无线网络相互连接。
终端包括第一存储器和第一处理器。第一存储器中存储有音频处理算法;上述音频处理算法被第一处理器调用执行以实现本申请提供的音频处理方法。第一存储器可以包括但不限于以下几种:随机存取存储器(Random Access Memory,RAM)、只读存储器(Read Only Memory,ROM)、可编程只读存储器(Programmable Read-Only Memory,PROM)、可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM)、以及电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)。
第一处理器可以是一个或者多个集成电路芯片组成。可选地,第一处理器可以是通用处理器,比如,中央处理器(Central Processing Unit,CPU)或者网络处理器(Network Processor,NP)。可选地,第一处理器可以通过运行程序或代码来实现本申请提供的音频处理方法。
服务器包括第二存储器和第二处理器。第二存储器中存储有音频处理算法;上述音频处理算法被第二处理器调用来实现本申请提供的音频处理方法。可选地,第二存储器可以包括但不限于以下几种:RAM、ROM、PROM、EPROM、EEPROM。可选地,第二处理器可以是通用处理器,比如,CPU或者NP。
图2示出了本申请一个示例性实施例提供的音频处理方法的流程图。该方法可以由计算机设备来执行,例如,如图1所示的终端或服务器来执行。方法包括如下步骤。
步骤210,对音频数据进行短时傅里叶变换得到频域数据集,频域数据集中每个时间窗口对应一组频域数据,频域数据包括频率以及频率对应的振幅。
音频数据可以为PCM(Pulse Code Modulation,脉冲编码调制)音频数据,或者,音频数据可以为其他音频格式的音频数据。
短时傅里叶变换从时间上将音频数据分割为多个时间窗口,对每个时间窗 口内的音频数据进行傅里叶变换,得到每个时间窗口的频域数据。多个时间窗口的频域数据组成频域数据集。即,频域数据集是由时间窗口、频率、振幅组成的三维数据。频域数据集中包括多个时间窗口的频域数据,每个时间窗口的频域数据包括各个频率上的振幅。示例性的,频域数据也可以是包括各个频率上的实部和虚部,基于实部和虚部计算得到振幅和相位,本申请实施例中利用振幅计算得到响度。
例如,每个时间窗口为一帧(30ms、60ms或100ms)。将音频数据的每一帧音频信号进行傅里叶变换,得到每一帧的频域数据,多帧频域数据组成频域数据集。一帧频域数据包括该帧音频数据在至少一个频率上的振幅。当然时间窗口的长度可以任意设置,例如,设置为两帧、一秒、1ms等。
步骤220,根据频域数据集中的振幅计算响度,得到第一频域响度集,第一频域响度集中每个时间窗口对应一组频域响度,频域响度包括频率以及频率对应的响度。
将频域数据集中的振幅计算为响度,并将响度替换频域数据集中的振幅得到第一频域响度集。
例如,响度=20lg(振幅)。当第一帧(第一时间窗口)在20Hz(赫兹)的振幅为10时,其对应计算得到的响度为20。则第一频域响度集中第一帧(第一时间窗口)在20Hz的响度为20。
第一频域响度集与频域数据集的关系为:时间窗口不变、频率不变、振幅对应替换为响度。例如,频域数据集包括30个时间窗口的频域数据,则第一频域响度集也包括30个时间窗口的频域响度,且频域数据集中的时间窗口与第一频域响度集中的时间窗口具有一一对应关系。
可选的,也可以根据频域数据集中的振幅计算声压,得到第一频域声压集。第一频域声压集中每个时间窗口对应一组频域声压,频域声压包括频率以及频率对应的声压。则,后续步骤中的“响度”相关名词可以对应替换为“声压”。
步骤230,采用淡入淡出函数,对第一频域响度集中每个时间窗口的频域响度执行首尾加窗处理,得到第二频域响度集。
淡入淡出函数可以是起始点和终止点为0/趋于0,先逐渐增大后逐渐减小的函数。例如,淡入淡出函数可以为在(0,0)、(1,1)、(2,1)、(3,0)连成的折线函数。
当淡入淡出函数为连续的一个函数时,可以使用淡入淡出函数对每个时间窗口的频域响度执行全频段的加窗处理。加窗处理是指用加窗函数(淡入淡出函数)乘以被加窗数据(频域响度)。
可选的,淡入淡出函数还可以是两个函数:淡入函数和淡出函数。淡入函数为起始点为0/趋于0的逐渐增大的函数。淡出函数为终止点为0/趋于0的逐渐减小的函数。
可以使用淡入函数对每个时间窗口的频域响度的首部(从起始点开始的一段频段)执行加窗处理,使用淡出函数对每个时间窗口的频域响度的尾部(到终止点结束的一段频段)执行加窗处理。加窗的窗口长度(频段长度)可以是任意设置的,首部和尾部的窗口长度可以相同也可以不同,不同时间窗口的首部执行加窗的窗口长度可以相同也可以不同,不同时间窗口的尾部执行加窗的窗口长度可以相同也可以不同。
综上所述,本实施例提供的方法,通过对音频数据进行短时傅里叶变换,可以得到每个时间窗口的频域数据,该频域数据标识在时间窗口内音频数据在各个频率的振幅分布情况。然后基于频域数据中的振幅计算得到响度,进而得到每个时间窗口的频域响度。频域响度可以表示当前时间窗口内人耳听觉对各个频率声波的响度感受。进一步的,由于人耳对高频音波和低频音波的感知较弱,则对每个时间窗口音频响度的首尾进行淡入淡出加窗,使响度值从中部向两边逐渐递减。通过上述方法计算得到的每个时间窗口的频域响度,更符合音频数据播放时人耳所听到的响度分布情况,根据该频域响度即可制作音频播放时的相关显示效果,以使显示效果更近人耳听觉效果。
图3示出了本申请一个示例性实施例提供的音频处理方法的流程图。该方法可以由计算机设备来执行,例如,如图1所示的终端或服务器来执行。该方法包括以下步骤。
步骤210,对音频数据进行短时傅里叶变换得到频域数据集,频域数据集中每个时间窗口对应一组频域数据,频域数据包括频率以及频率对应的振幅。
例如,一个时间窗口为一帧,音频数据包括1000帧,则对音频数进行短时傅里叶变换后得到了由1000帧频域数据组成的频域数据集。1帧频域数据可以构成横轴为频率(例如取值范围为0-20000Hz)纵轴为振幅的点图/折线图/柱状 图。则频域数据集包括1000张频域数据的点图/折线图/柱状图。
步骤220,根据频域数据集中的振幅计算响度,得到第一频域响度集,第一频域响度集中每个时间窗口对应一组频域响度,频域响度包括频率以及频率对应的响度。
例如,步骤210举例中的1000帧频域数据的频域数据集转换为第一频域响度集,第一频域响度集包括1000帧频域响度,一帧频域响度可以构成横轴为频率(例如取值范围为0-20000Hz)纵轴为振幅的点图/折线图/柱状图。则第一频域响度集包括1000张频域响度的点图/折线图/柱状图。
步骤221,对第一频域响度集进行A加权滤波处理和梅尔标度转化。
可选的,在通过步骤220得到第一频域响度集后,还会度第一频域响度集中的各时间窗口的频域响度进行进一步处理,进一步处理包括a加权滤波处理和梅尔标度转化。
A加权滤波处理是模拟人耳对40方纯音的响度,当信号通过时,其低频、中段频(1000Hz以下)有较大的衰减。A加权滤波的特性曲线接近于人耳的听感特性。例如,可以利用a加权滤波器对第一频域响度集中的每个频域响度进行处理。
梅尔标度转化用于将频率转换为梅尔标度。大部分人耳所能识别的频率范围在20~20000Hz之间,但是人耳对声音频率单位Hz的识别关系并不是简单的线性关系,例如人耳对中低频(1000Hz左右)的声音最为敏感,声音频率由1000Hz提高到2000Hz时人耳并不能感受到频率成倍的变化。为此人们常利用梅尔标度来重新量化人耳对频率的感受特点。
例如,频域响度的横轴为频率,取值范围为0-20000Hz,则通过梅尔标度转化,将0-20000Hz转化为梅尔标度。梅尔标度对应的响度为梅尔标度对应的频率响度区间内的响度之和。例如,梅尔标度1对应了一段频段,则梅尔标度1的响度为该频段内所有频率对应的响度之和。进而使第一频域响度集更符合人耳对响度的感受。
梅尔标度转化的公式为:梅尔标度=2595*lg(1+频率/700)。例如,6300Hz转化为梅尔标度为2595。
梅尔标度转化可以利用梅尔标度滤波器完成,将经过A加权滤波后的第一频域响度集中的频域响度输入梅尔标度滤波器,得到经过梅尔标度转化的第一 频域响度集。
可选的,后续步骤中所使用的第一频域响度集可以是步骤220得到的第一频域响度集,也可以是指经过A加权滤波处理的第一频域响度集,也可以是指经过梅尔标度转化的第一频域响度集,还可以是指经过A加权滤波处理和梅尔标度转化的第一频域响度集。
进过A加权滤波和梅尔标度转化可以得到比较符合人类听觉变化的频域响度,但是该数据参差不齐,过于抖动,则本实施例通过帧内和帧间的数据处理实现数据的平滑、错差符合观感,以及淡入淡出。
步骤222,在频域响度的分布情况满足集中分布条件的情况下,降低频域响度中低于平均响度的响度值。
集中分布条件用于判断频域响度的分布情况是否集中分布在平均响度附近。例如,平均分布条件为响度方差小于第一阈值,且响度期望与平均响度之差小于第二阈值。第一阈值和第二阈值的取值根据实际需求来确定。
计算每个时间窗口的频域响度的最大响度、最小响度、平均响度、响度期望、响度方差,进而判断该时间窗口频域响度的分布情况。若分布集中在平均响度附近,则对于低于平均响度的响度拉低其取值,从而突出当前时间窗口中高响度的部分。
可选的,基于频域响度计算平均响度、响度期望和响度方差;在响度方差小于第一阈值,且响度期望与平均响度之差小于第二阈值的情况下,降低频域响度中响度低于平均响度的取值。
降低频域响度中响度低于平均响度的取值的方式可以是:减去固定值、乘以系数、减去梯度值等等。例如,将低于平均响度的所有响度都乘以0.5。
例如,如图4中的(1)所示,为第一频域响度集的第一时间窗口的频域响度,其响度集中分布在平均响度附近,则将低于平均响度的响度降低取值,得到如图4中的(2),从而突出高响度部分,使其贴合人耳听觉效果。
步骤230,采用淡入淡出函数,对第一频域响度集中每个时间窗口的频域响度执行首尾加窗处理,得到第二频域响度集。
可选的,采用淡入函数,对第一频域响度集中每个时间窗口的频域响度的首部执行加窗处理,首部为频率低于第一频率阈值的部分;采用淡出函数,对第一频域响度集中每个时间窗口的频域响度的尾部执行加窗处理,尾部为频率 高于第二频率阈值的部分。
首部和尾部的窗口大小和位置可以任意设置。例如,可以根据横轴的取值范围设置,从取值范围的最小值开始到第一频率阈值为首部,从第二频率阈值到取值范围的最大值为尾部。还可以根据频域响度中响度值大于0的频率范围/梅尔标度范围来确定,例如,频率响度的横轴取值范围为0-100,但0-10的响度都为0,10的响度为1,90的响度为1,90-100的响度都为0,则响度值大于0的频率/梅尔标度范围为10-90,则从响度值大于0的频率/梅尔标度范围的最小值开始到第一频率阈值为首部,从第二频率阈值到响度值大于0的频率/梅尔标度范围的最大值为止为尾部。首部和尾部的窗口长度可以相同也可以不同。
例如,频域响度的频率取值范围为0-20000Hz,则首部可以是0-100Hz,尾部可以是19900-20000Hz。或者,频域响度的梅尔标度范围为0-100,则首部可以是0-10,尾部可以是90-100。
可选的,淡入函数为正弦函数的0至π/2部分;淡出函数为余弦函数的0至π/2部分。
将淡入函数缩放至与首部的横轴窗口长度相同,然后将淡入函数与首部的频域响度相乘,得到加窗处理后的首部。将淡出函数缩放至与尾部的横轴窗口长度相同,然后将淡出函数与尾部的频域响度相乘,得到加窗处理后的尾部。
例如,如图5中的(1)所示,为第一时间窗口的频域响度,横轴为频率/梅尔标度纵轴为响度,则采用淡入函数对首部进行加窗处理,采用淡出函数对尾部进行加窗处理,加窗处理后得到如图5中的(2)所示的频域响度。
步骤231,对第二频域响度集中每个时间窗口的频域响度执行窗口内数据平滑。
可选的,采用多项式平滑算法对每个时间窗口的频域响度进行窗口内的数据平滑。例如,窗口内包括0-100hz的10个响度,则第j个响度用第j-2个响度、第j-1个响度、第j+1个响度、第j+2个响度来进行平滑,如此依次平滑时间窗口内的每一个响度。j为大于2的整数。
步骤232,对第二频域响度集中每个时间窗口的频域响度执行窗口间数据平滑。
可选的,采用滑动平滑加权滤波算法,基于第i-1个时间窗口、第i个时间窗口、第i+1个时间窗口的频域响度,平滑第i个时间窗口的频域响度,i为正 整数。
若第i+1个时间窗口在第x个频率的响度f3小于第i个时间窗口在第x个频率的响度f2,则第i个时间窗口在第x个频率平滑后的响度c3=(第i+1个时间窗口在第x个频率的响度f3)*下降系数a+(1-下降系数a)*(1-下降系数a)*(第i个时间窗口在第x个频率平滑前的响度f2)+(1-下降系数a)*下降系数a*(第i-1个时间窗口在第x个频率的响度f1)。
若第i+1个时间窗口在第x个频率的响度f3大于等于第i个时间窗口在第x个频率的响度f2,则第i个时间窗口在第x个频率平滑后的响度c3=(第i+1个时间窗口在第x个频率的响度f3)*上升系数b+(1-上升系数b)*上升系数b*(第i个时间窗口在第x个频率平滑前的响度f2)+(1-上升系数b)*(1-上升系数b)*(第i-1个时间窗口在第x个频率的响度f1)。
其中,下降系数a和上升系数b可以根据需求设置。a和b的取值范围为0到1。
步骤233,基于第二频域响度集生成音频数据的播放显示效果,播放显示效果包括音频数据的频谱图、背景图播放效果、歌词播放效果、音乐喷泉效果、灯光效果中的至少一种。
可选的,在经过上述步骤的处理后得到的第二频域响度集,其中的频域响度即为符合人耳听觉的频谱数据。则基于第二频域响度集中的频域响度可以生成音频数据播放时所显示的播放显示效果。
例如,可以生成音乐播放时随音乐律动的频谱图。或者,根据音频响度来控制背景图播放效果,例如,控制图片显示时长、图片缩放大小、图片播放速度等。或者,还可以根据频域响度来控制灯光照射效果。通过这些播放显示效果,从视觉上为用户带来音乐播放体验,提高用户听到的音乐与播放显示效果的一致度,提高用户的音乐观赏体验。
可选的,上述对音频数据处理的步骤可以任意删减、调整先后顺序、组合得到新的实施例。
综上所述,本实施例提供的方法,通过对音频数据进行短时傅里叶变换,可以得到每个时间窗口的频域数据,该频域数据标识在时间窗口内音频数据在各个频率的振幅分布情况。然后基于频域数据中的振幅计算得到响度,进而得到每个时间窗口的频域响度。频域响度可以表示当前时间窗口内人耳听觉对各 个频率声波的响度感受。进一步的,对频域响度进行A加权滤波和梅尔标度转化。然后,对于响度分布集中在平均响度的时间窗口,将其中低于平均响度的响度降低取值,以突出高响度部分。再对每个时间窗口音频响度的首尾进行淡入淡出加窗,使响度值从中部向两边逐渐递减。最后对帧内数据进行平滑以及对帧间数据进行平滑,使频域响度值的视觉效果更为流程。通过上述方法计算得到的每个时间窗口的频域响度,更符合音频数据播放时人耳所听到的响度分布情况,根据该频域响度即可制作音频播放时的相关显示效果,以使显示效果更近人耳听觉效果,提高用户观赏播放显示效果时的视觉和听觉匹配度,提高观赏体验。
以下为本申请的装置实施例,对于装置实施例中未详细描述的细节,可以结合参考上述方法实施例中相应的记载,本文不再赘述。
图6示出了本申请的一个示例性实施例提供的音频处理装置的结构示意图。该装置可以通过软件、硬件或者两者的结合实现成为计算机设备的全部或一部分,该装置包括:
处理模块401,用于对音频数据进行短时傅里叶变换得到频域数据集,所述频域数据集中每个时间窗口对应一组频域数据,所述频域数据包括频率以及频率对应的振幅;
响度模块402,用于根据频域数据集中的振幅计算响度,得到第一频域响度集,所述第一频域响度集中每个时间窗口对应一组频域响度,所述频域响度包括频率以及频率对应的响度;
加窗模块403,用于采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集。
在一个可选的实施例中,所述加窗模块403,用于采用淡入函数,对所述第一频域响度集中每个时间窗口的所述频域响度的首部执行加窗处理,所述首部为频率低于第一频率阈值的部分;
所述加窗模块403,用于采用淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度的尾部执行加窗处理,所述尾部为频率高于第二频率阈值的部分。
在一个可选的实施例中,所述淡入函数为正弦函数的0至π/2部分;
所述淡出函数为余弦函数的0至π/2部分。
在一个可选的实施例中,所述装置还包括:
降低模块406,用于在所述频域响度的分布情况满足集中分布条件的情况下,降低所述频域响度中低于平均响度的响度值。
在一个可选的实施例中,所述降低模块406,用于基于所述频域响度计算平均响度、响度期望和响度方差;
所述降低模块406,用于在所述响度方差小于第一阈值,且所述响度期望与所述平均响度之差小于第二阈值的情况下,降低所述频域响度中响度低于所述平均响度的取值。
在一个可选的实施例中,所述第二频域响度集中每个时间窗口对应一组频域响度;所述装置还包括:
平滑模块405,用于对所述第二频域响度集中每个时间窗口的所述频域响度执行窗口内数据平滑。
在一个可选的实施例中,所述第二频域响度集中每个时间窗口对应一组频域响度;所述装置还包括:
平滑模块405,用于对所述第二频域响度集中每个时间窗口的所述频域响度执行窗口间数据平滑。
在一个可选的实施例中,所述平滑模块405,用于采用滑动平滑加权滤波算法,基于第i-1个时间窗口、第i个时间窗口、第i+1个时间窗口的所述频域响度,平滑所述第i个时间窗口的所述频域响度,i为正整数。
在一个可选的实施例中,所述处理模块401,用于对所述第一频域响度集进行A加权滤波处理和梅尔标度转化。
在一个可选的实施例中,所述装置还包括:
显示模块404,用于基于所述第二频域响度集生成所述音频数据的播放显示效果,所述播放显示效果包括所述音频数据的频谱图、背景图播放效果、歌词播放效果、音乐喷泉效果中的至少一种。
图7是本申请一个实施例提供的服务器的结构示意图。具体来讲:服务器800包括中央处理单元(英文:Central Processing Unit,简称:CPU)801、包括随机存取存储器(英文:Random Access Memory,简称:RAM)802和只读存 储器(英文:Read-Only Memory,简称:ROM)803的系统存储器804,以及连接系统存储器804和中央处理单元801的系统总线805。服务器800还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)806,和用于存储操作系统813、应用程序814和其他程序模块815的大容量存储设备807。
基本输入/输出系统806包括有用于显示信息的显示器808和用于用户帐号输入信息的诸如鼠标、键盘之类的输入设备809。其中显示器808和输入设备809都通过连接到系统总线805的输入/输出控制器810连接到中央处理单元801。基本输入/输出系统806还可以包括输入/输出控制器810以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入/输出控制器810还提供输出到显示屏、打印机或其他类型的输出设备。
大容量存储设备807通过连接到系统总线805的大容量存储控制器(未示出)连接到中央处理单元801。大容量存储设备807及其相关联的计算机可读介质为服务器800提供非易失性存储。也就是说,大容量存储设备807可以包括诸如硬盘或者只读光盘(英文:Compact Disc Read-Only Memory,简称:CD-ROM)驱动器之类的计算机可读介质(未示出)。
不失一般性,计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、可擦除可编程只读存储器(英文:Erasable Programmable Read-Only Memory,简称:EPROM)、电可擦除可编程只读存储器(英文:Electrically Erasable Programmable Read-Only Memory,简称:EEPROM)、闪存或其他固态存储器技术,CD-ROM、数字通用光盘(英文:Digital Versatile Disc,简称:DVD)或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知计算机存储介质不局限于上述几种。上述的系统存储器804和大容量存储设备807可以统称为存储器。
根据本申请的各种实施例,服务器800还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即服务器800可以通过连接在系统总线805上的网络接口单元811连接到网络812,或者说,也可以使用网络接口单元811来连接到其他类型的网络或远程计算机系统(未示出)。
本申请还提供了一种终端,该终端包括处理器和存储器,存储器中存储有至少一条指令,至少一条指令由处理器加载并执行以实现上述各个方法实施例提供的音频处理方法。需要说明的是,该终端可以是如下图8所提供的终端。
图8示出了本申请一个示例性实施例提供的终端900的结构框图。该终端900可以是:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。终端900还可能被称为用户帐号设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,终端900包括有:处理器901和存储器902。
处理器901可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器901可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器901也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器901可以在集成有GPU(Graphics Processing Unit,音频处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器901还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器902可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器902还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器902中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器901所执行以实现本申请中方法实施例提供的音频处理方法或音频处理方法。
在一些实施例中,终端900还可选包括有:外围设备接口903和至少一个外围设备。处理器901、存储器902和外围设备接口903之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口903 相连。具体地,外围设备包括:射频电路904、显示屏905、摄像头组件906、音频电路907、定位组件908和电源909中的至少一种。
外围设备接口903可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器901和存储器902。在一些实施例中,处理器901、存储器902和外围设备接口903被集成在同一芯片或电路板上;在一些其他实施例中,处理器901、存储器902和外围设备接口903中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路904用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路904通过电磁信号与通信网络以及其他通信设备进行通信。射频电路904将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。示例性的,射频电路904包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户帐号身份模块卡等等。射频电路904可以通过至少一种无线通信协议来与其它终端进行通信。该无线通信协议包括但不限于:万维网、城域网、内联网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路904还可以包括NFC(Near Field Communication,近距离无线通信)有关的电路,本申请对此不加以限定。
显示屏905用于显示UI(User Interface,用户帐号界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏905是触摸显示屏时,显示屏905还具有采集在显示屏905的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器901进行处理。此时,显示屏905还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏905可以为一个,设置终端900的前面板;在另一些实施例中,显示屏905可以为至少两个,分别设置在终端900的不同表面或呈折叠设计;在再一些实施例中,显示屏905可以是柔性显示屏,设置在终端900的弯曲表面上或折叠面上。甚至,显示屏905还可以设置成非矩形的不规则图形,也即异形屏。显示屏905可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件906用于采集图像或视频。示例性的,摄像头组件906包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头 设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件906还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路907可以包括麦克风和扬声器。麦克风用于采集用户帐号及环境的声波,并将声波转换为电信号输入至处理器901进行处理,或者输入至射频电路904以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端900的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器901或射频电路904的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路907还可以包括耳机插孔。
定位组件908用于定位终端900的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。定位组件908可以是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统或俄罗斯的伽利略系统的定位组件。
电源909用于为终端900中的各个组件进行供电。电源909可以是交流电、直流电、一次性电池或可充电电池。当电源909包括可充电电池时,该可充电电池可以是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池,无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。
在一些实施例中,终端900还包括有一个或多个传感器910。该一个或多个传感器910包括但不限于:加速度传感器911、陀螺仪传感器912、压力传感器913、指纹传感器914、光学传感器915以及接近传感器916。
加速度传感器911可以检测以终端900建立的坐标系的三个坐标轴上的加速度大小。比如,加速度传感器911可以用于检测重力加速度在三个坐标轴上 的分量。处理器901可以根据加速度传感器911采集的重力加速度信号,控制显示屏905以横向视图或纵向视图进行用户帐号界面的显示。加速度传感器911还可以用于游戏或者用户帐号的运动数据的采集。
陀螺仪传感器912可以检测终端900的机体方向及转动角度,陀螺仪传感器912可以与加速度传感器911协同采集用户帐号对终端900的3D动作。处理器901根据陀螺仪传感器912采集的数据,可以实现如下功能:动作感应(比如根据用户帐号的倾斜操作来改变UI)、拍摄时的图像稳定、游戏控制以及惯性导航。
压力传感器913可以设置在终端900的侧边框和/或显示屏905的下层。当压力传感器913设置在终端900的侧边框时,可以检测用户帐号对终端900的握持信号,由处理器901根据压力传感器913采集的握持信号进行左右手识别或快捷操作。当压力传感器913设置在显示屏905的下层时,由处理器901根据用户帐号对显示屏905的压力操作,实现对UI界面上的可操作性控件进行控制。可操作性控件包括按钮控件、滚动条控件、图标控件、菜单控件中的至少一种。
指纹传感器914用于采集用户帐号的指纹,由处理器901根据指纹传感器914采集到的指纹识别用户帐号的身份,或者,由指纹传感器914根据采集到的指纹识别用户帐号的身份。在识别出用户帐号的身份为可信身份时,由处理器901授权该用户帐号执行相关的敏感操作,该敏感操作包括解锁屏幕、查看加密信息、下载软件、支付及更改设置等。指纹传感器914可以被设置终端900的正面、背面或侧面。当终端900上设置有物理按键或厂商Logo时,指纹传感器914可以与物理按键或厂商Logo集成在一起。
光学传感器915用于采集环境光强度。在一个实施例中,处理器901可以根据光学传感器915采集的环境光强度,控制显示屏905的显示亮度。具体地,当环境光强度较高时,调高显示屏905的显示亮度;当环境光强度较低时,调低显示屏905的显示亮度。在另一个实施例中,处理器901还可以根据光学传感器915采集的环境光强度,动态调整摄像头组件906的拍摄参数。
接近传感器916,也称距离传感器,通常设置在终端900的前面板。接近传感器916用于采集用户帐号与终端900的正面之间的距离。在一个实施例中,当接近传感器916检测到用户帐号与终端900的正面之间的距离逐渐变小时, 由处理器901控制显示屏905从亮屏状态切换为息屏状态;当接近传感器916检测到用户帐号与终端900的正面之间的距离逐渐变大时,由处理器901控制显示屏905从息屏状态切换为亮屏状态。
本领域技术人员可以理解,图8中示出的结构并不构成对终端900的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
所述存储器还包括一个或者一个以上的程序,所述一个或者一个以上程序存储于存储器中,所述一个或者一个以上程序包含用于进行本申请实施例提供的音频处理方法。
本申请还提供一种计算机设备,该计算机设备包括:处理器和存储器,该存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,该至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现上述各方法实施例提供的音频处理方法。
本申请还提供一种计算机可读存储介质,该存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,该至少一条指令、至少一段程序、代码集或指令集由处理器加载并执行以实现上述各方法实施例提供的音频处理方法。
本申请还提供一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行上述可选实现方式中提供的音频处理方法。
应当理解的是,在本文中提及的“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上仅为本申请的可选实施例,并不用以限制本申请,凡在本申请的精神 和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (13)

  1. 一种音频处理方法,其特征在于,所述方法包括:
    对音频数据进行短时傅里叶变换得到频域数据集,所述频域数据集中每个时间窗口对应一组频域数据,所述频域数据包括频率以及频率对应的振幅;
    根据频域数据集中的振幅计算响度,得到第一频域响度集,所述第一频域响度集中每个时间窗口对应一组频域响度,所述频域响度包括频率以及频率对应的响度;
    采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集。
  2. 根据权利要求1所述的方法,其特征在于,所述采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集,包括:
    采用淡入函数,对所述第一频域响度集中每个时间窗口的所述频域响度的首部执行加窗处理,所述首部为频率低于第一频率阈值的部分;
    采用淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度的尾部执行加窗处理,所述尾部为频率高于第二频率阈值的部分。
  3. 根据权利要求2所述的方法,其特征在于,所述淡入函数为正弦函数的0至π/2部分;
    所述淡出函数为余弦函数的0至π/2部分。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集之前,还包括:
    在所述频域响度的分布情况满足集中分布条件的情况下,降低所述频域响度中低于平均响度的响度值。
  5. 根据权利要求4所述的方法,其特征在于,所述在所述频域响度的分布情 况满足集中分布条件的情况下,降低所述频域响度中低于平均响度的响度取值,包括:
    基于所述频域响度计算平均响度、响度期望和响度方差;
    在所述响度方差小于第一阈值,且所述响度期望与所述平均响度之差小于第二阈值的情况下,降低所述频域响度中响度低于所述平均响度的取值。
  6. 根据权利要求1至3任一所述的方法,其特征在于,所述第二频域响度集中每个时间窗口对应一组频域响度;
    所述采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集之后,还包括:
    对所述第二频域响度集中每个时间窗口的所述频域响度执行窗口内数据平滑。
  7. 根据权利要求1至3任一所述的方法,其特征在于,所述第二频域响度集中每个时间窗口对应一组频域响度;
    所述采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集之后,还包括:
    对所述第二频域响度集中每个时间窗口的所述频域响度执行窗口间数据平滑。
  8. 根据权利要求7所述的方法,其特征在于,所述对所述第二频域响度集中每个时间窗口的所述频域响度执行窗口间数据平滑,包括:
    采用滑动平滑加权滤波算法,基于第i-1个时间窗口、第i个时间窗口、第i+1个时间窗口的所述频域响度,平滑所述第i个时间窗口的所述频域响度,i为大于正整数。
  9. 根据权利要求8所述的方法,其特征在于,所述根据频域数据集中的振幅计算响度,得到第一频域响度集之后,还包括:
    对所述第一频域响度集进行A加权滤波处理和梅尔标度转化。
  10. 根据权利要求1至3任一所述的方法,其特征在于,所述方法还包括:
    基于所述第二频域响度集生成所述音频数据的播放显示效果,所述播放显示效果包括所述音频数据的频谱图、背景图播放效果、歌词播放效果、音乐喷泉效果、灯光效果中的至少一种。
  11. 一种音频处理装置,其特征在于,所述装置包括:
    处理模块,用于对音频数据进行短时傅里叶变换得到频域数据集,所述频域数据集中每个时间窗口对应一组频域数据,所述频域数据包括频率以及频率对应的振幅;
    响度模块,用于根据频域数据集中的振幅计算响度,得到第一频域响度集,所述第一频域响度集中每个时间窗口对应一组频域响度,所述频域响度包括频率以及频率对应的响度;
    加窗模块,用于采用淡入淡出函数,对所述第一频域响度集中每个时间窗口的所述频域响度执行首尾加窗处理,得到第二频域响度集。
  12. 一种计算机设备,其特征在于,所述计算机设备包括:处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行,以实现如权利要求1至10任一项所述的音频处理方法。
  13. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行,以实现如权利要求1至10任一项所述的音频处理方法。
PCT/CN2022/124432 2022-10-10 2022-10-10 音频处理方法、装置、设备及存储介质 WO2024077452A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/124432 WO2024077452A1 (zh) 2022-10-10 2022-10-10 音频处理方法、装置、设备及存储介质
CN202280003707.0A CN115956270A (zh) 2022-10-10 2022-10-10 音频处理方法、装置、设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/124432 WO2024077452A1 (zh) 2022-10-10 2022-10-10 音频处理方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2024077452A1 true WO2024077452A1 (zh) 2024-04-18

Family

ID=87283056

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/124432 WO2024077452A1 (zh) 2022-10-10 2022-10-10 音频处理方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN115956270A (zh)
WO (1) WO2024077452A1 (zh)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101645268A (zh) * 2009-08-19 2010-02-10 李宋 一种演唱和演奏的计算机实时分析系统
CN103971689A (zh) * 2013-02-04 2014-08-06 腾讯科技(深圳)有限公司 一种音频识别方法及装置
US20150205569A1 (en) * 2007-10-23 2015-07-23 Adobe Systems Incorporated Automatically correcting audio data
US20150221321A1 (en) * 2014-02-06 2015-08-06 OtoSense, Inc. Systems and methods for identifying a sound event
CN105118523A (zh) * 2015-07-13 2015-12-02 努比亚技术有限公司 音频处理方法和装置
CN106068008A (zh) * 2016-08-15 2016-11-02 歌尔科技有限公司 音频播放设备失真测试方法
US20180114537A1 (en) * 2016-02-16 2018-04-26 Red Pill VR, Inc. Real-time adaptive audio source separation
US20210191687A1 (en) * 2019-12-23 2021-06-24 Dolby Laboratories Licensing Corporation Inter-channel audio feature measurement and usages
CN113593602A (zh) * 2021-07-19 2021-11-02 深圳市雷鸟网络传媒有限公司 一种音频处理方法、装置、电子设备和存储介质
US20210343309A1 (en) * 2020-05-01 2021-11-04 Systèmes De Contrôle Actif Soft Db Inc. System and a method for sound recognition
CN114566172A (zh) * 2022-02-25 2022-05-31 北京砍石高科技有限公司 音频数据处理方法、装置、存储介质及电子设备

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150205569A1 (en) * 2007-10-23 2015-07-23 Adobe Systems Incorporated Automatically correcting audio data
CN101645268A (zh) * 2009-08-19 2010-02-10 李宋 一种演唱和演奏的计算机实时分析系统
CN103971689A (zh) * 2013-02-04 2014-08-06 腾讯科技(深圳)有限公司 一种音频识别方法及装置
US20150221321A1 (en) * 2014-02-06 2015-08-06 OtoSense, Inc. Systems and methods for identifying a sound event
CN105118523A (zh) * 2015-07-13 2015-12-02 努比亚技术有限公司 音频处理方法和装置
US20180114537A1 (en) * 2016-02-16 2018-04-26 Red Pill VR, Inc. Real-time adaptive audio source separation
CN106068008A (zh) * 2016-08-15 2016-11-02 歌尔科技有限公司 音频播放设备失真测试方法
US20210191687A1 (en) * 2019-12-23 2021-06-24 Dolby Laboratories Licensing Corporation Inter-channel audio feature measurement and usages
US20210343309A1 (en) * 2020-05-01 2021-11-04 Systèmes De Contrôle Actif Soft Db Inc. System and a method for sound recognition
CN113593602A (zh) * 2021-07-19 2021-11-02 深圳市雷鸟网络传媒有限公司 一种音频处理方法、装置、电子设备和存储介质
CN114566172A (zh) * 2022-02-25 2022-05-31 北京砍石高科技有限公司 音频数据处理方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
CN115956270A (zh) 2023-04-11

Similar Documents

Publication Publication Date Title
CN107967706B (zh) 多媒体数据的处理方法、装置及计算机可读存储介质
WO2021008055A1 (zh) 视频合成的方法、装置、终端及存储介质
CN111326132B (zh) 音频处理方法、装置、存储介质及电子设备
CN109547848B (zh) 响度调整方法、装置、电子设备以及存储介质
CN110764730A (zh) 播放音频数据的方法和装置
US11482237B2 (en) Method and terminal for reconstructing speech signal, and computer storage medium
CN108335703B (zh) 确定音频数据的重音位置的方法和装置
CN110618805B (zh) 调整设备电量的方法、装置、电子设备及介质
CN110688082B (zh) 确定音量的调节比例信息的方法、装置、设备及存储介质
US20220174206A1 (en) Camera movement control method and apparatus, device, and storage medium
CN109524016B (zh) 音频处理方法、装置、电子设备及存储介质
CN110708630B (zh) 控制耳机的方法、装置、设备及存储介质
WO2021139535A1 (zh) 播放音频的方法、装置、系统、设备及存储介质
CN110139143B (zh) 虚拟物品显示方法、装置、计算机设备以及存储介质
CN110931048A (zh) 语音端点检测方法、装置、计算机设备及存储介质
CN109243479B (zh) 音频信号处理方法、装置、电子设备及存储介质
WO2019237667A1 (zh) 播放音频数据的方法和装置
CN109065068B (zh) 音频处理方法、装置及存储介质
US20230014836A1 (en) Method for chorus mixing, apparatus, electronic device and storage medium
CN113963707A (zh) 音频处理方法、装置、设备和存储介质
CN112133332B (zh) 播放音频的方法、装置及设备
CN111984222A (zh) 调节音量的方法、装置、电子设备及可读存储介质
CN109360582B (zh) 音频处理方法、装置及存储介质
WO2024077452A1 (zh) 音频处理方法、装置、设备及存储介质
CN110808021A (zh) 音频播放的方法、装置、终端及存储介质