CN117809660A

CN117809660A - Terminal equipment and voice print feature-based audio processing method

Info

Publication number: CN117809660A
Application number: CN202310939391.2A
Authority: CN
Inventors: 温泓; 周鉴星; 林子毅
Original assignee: Hisense Electronic Technology Shenzhen Co ltd
Current assignee: Hisense Electronic Technology Shenzhen Co ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2024-04-02

Abstract

The application provides a terminal device and an audio processing method based on voiceprint features, wherein the method comprises the following steps: the method comprises the steps that a terminal device obtains initial audio and obtains target voiceprint features, wherein the target voiceprint features are voiceprint features used for generating target audio data, the initial audio is separated through a voiceprint encoder to obtain frequency domain signals, the frequency domain signals are sent to a voiceprint decoder, the frequency domain signals are restored to time domain signals through the voiceprint decoder, and finally target audio with the target voiceprint features is generated based on the time domain signals and the target voiceprint features. When the method is used for executing audio processing, the identification of text content is not relied on, the target audio with the target voiceprint characteristics can be directly output through the target voiceprint characteristics, the training time of an encoder, a decoder and the like is reduced, the method is not limited by the integrity of the initial audio, and the problems of low efficiency of the audio processing process and limited applicable scenes are solved.

Description

Terminal equipment and voice print feature-based audio processing method

Technical Field

The application relates to the technical field of voice conversion algorithms, in particular to a terminal device and an audio processing method based on voiceprint features.

Background

The terminal equipment such as the intelligent television and the mobile phone can output media such as video and audio. For example, the terminal device may be provided with song recording software or entertainment software, and audio processing such as voice print sound changing can be realized through the song recording software or the entertainment software, where the voice print is a special acoustic feature formed between fundamental waves and harmonic waves of language features in user audio.

In some embodiments, audio processing may be achieved by a method of deep learning audio. Audio data can be classified into parallel data and non-parallel data according to the data type thereof. In the process of executing audio processing on parallel data, taking sound changing processing as an example, after a large number of initial audio of a speaker and target audio after audio processing are provided, vectors which are only relevant to text content can be generated by marking words in the audio, and matching of the audio content in the parallel data is completed through the vectors. The end result is that the target audio containing the same text content can be output eventually regardless of what the speaker speaks, but this way is premised on having a sufficient amount of initial audio and target audio. When the audio processing is performed on the non-parallel data, text content in the initial audio, such as "today's weather is good", can be extracted by the text content encoder, and then, target language features of the target audio after becoming is extracted by the language feature encoder, which are separate processes. After separation is completed, recoding can be performed on the text content and the target language features through a decoder, and finally the target audio after audio processing is generated.

However, in the above processing procedure, training is required for all of the text content encoder, the language feature encoder and the decoder to obtain the target audio after accurate audio processing. In performing audio processing on parallel or non-parallel data in the manner described above, both the acquisition of the initial audio data and the training of the text content encoder, language feature encoder and decoder are very difficult and time consuming, resulting in an inefficient audio processing process. Moreover, the above processing method is only suitable for a scene where the initial audio of the speaker is complete audio, and is not suitable for incomplete audio, so that the applicable scene is limited.

Disclosure of Invention

Some embodiments of the present application provide a terminal device and an audio processing method based on voiceprint features, so as to solve the problems of low efficiency and limited applicable scenarios in the audio processing process.

In a first aspect, some embodiments of the present application provide a terminal device, including:

voiceprint encoder: configured to perform a separation of initial voiceprint features in initial audio;

voiceprint decoder: configured to output target audio with target voiceprint features;

the controller is configured to:

Acquiring initial audio and acquiring target voiceprint features, wherein the target voiceprint features are voiceprint features for generating target audio data;

separating the initial audio by a voiceprint encoder to obtain a frequency domain signal, and transmitting the frequency domain signal to a voiceprint decoder;

restoring the frequency domain signal to a time domain signal by the voiceprint decoder;

generating target audio with the target voiceprint feature based on the time domain signal and the target voiceprint feature.

In some embodiments, the controller performs the step of separating the initial audio by a voiceprint encoder resulting in a frequency domain signal, further configured to:

extracting a voice signal in the initial audio to obtain an initial voice signal;

high frequency enhancement preprocessing, framing windowing and discrete fourier transformation are sequentially performed on the initial speech signal to convert the initial speech signal into a frequency domain signal.

In some embodiments, the controller is further configured to:

filtering the frequency domain signal using a mel filter bank;

taking the logarithm of the filtered frequency domain signal to obtain logarithm energy;

and performing cosine transform on the logarithmic energy to obtain a Mel cepstrum coefficient so as to perform deep learning on the initial audio through the Mel cepstrum coefficient.

In some embodiments, the controller performs the step of restoring the frequency domain signal to a time domain signal by the voiceprint decoder, further configured to:

receiving the frequency domain signal by the voiceprint decoder;

performing frequency shift on the frequency domain signal according to the target voiceprint characteristics to obtain a fitting frequency signal;

performing amplitude matching on the fitted frequency signals to generate timbres and tones of the target voiceprint features;

performing an inverse discrete fourier transform on the tone and the tone, and performing time-domain resampling on the tone and the tone after the inverse discrete fourier transform;

frame stitching is performed on the time-domain resampled timbre and tone to restore the frequency-domain signal to a time-domain signal.

In some embodiments, the controller performs the step of frequency shifting the frequency domain signal according to the target voiceprint feature to obtain a fitted frequency signal, further configured to:

acquiring a target fundamental frequency of the target voiceprint feature;

and performing frequency shift matching on the frequency domain signal based on the target fundamental frequency to fit a fitted frequency signal with the same amplitude as the frequency domain signal and the same frequency as the target fundamental frequency.

In some embodiments, the controller performs an amplitude matching on the fitted frequency signal, generating timbres and tones of the target voiceprint feature, further configured to:

acquiring a first harmonic in the fitted frequency domain signal;

acquiring a second harmonic in the target voiceprint feature and an amplitude of the second harmonic relative to the target fundamental frequency;

a weighted calculation is performed on the first harmonic based on the second harmonic and the amplitude to generate a timbre and a pitch of the target voiceprint feature.

In some embodiments, the controller is further configured to:

acquiring the amplitude values of the initial audio under different harmonics;

performing weighted calculation on the amplitude by fitting a nonlinear function to perform training on the initial audio so as to obtain a classification result of the initial audio;

and outputting the classification result.

In some embodiments, the controller is further configured to:

detecting an audio category of the initial audio, wherein the audio category comprises voice audio and interference audio;

if the audio class is the interference audio, setting the terminal equipment to be in a standby state;

if the audio category is the human voice audio, starting the voice print encoder, and uploading an initial voice print feature set in the initial audio, wherein the initial voice print feature set is a set of voice print features in the initial audio;

Deep learning is performed on the initial voiceprint feature set, and classification preservation is performed on learning results of the deep learning.

In some embodiments, the controller performs the step of performing deep learning on the initial voiceprint feature set and performing classification preservation on a learning result of the deep learning, further configured to:

analyzing the initial voiceprint feature set to obtain initial voiceprint features in the initial audio;

labeling the initial voiceprint features by a deep learning algorithm;

classifying the marked initial voiceprint features to obtain a classified learning result;

and storing the learning result according to the classified category.

In a second aspect, some embodiments of the present application provide a voiceprint feature-based audio processing method, which may be applied to the terminal device of the first aspect, where the terminal device includes a voiceprint encoder, a voiceprint decoder, and a controller, and the voiceprint feature-based audio processing method includes:

As can be seen from the above technical solutions, some embodiments of the present application provide a terminal device and an audio processing method based on voiceprint features, where the method includes: the method comprises the steps that a terminal device obtains initial audio and obtains target voiceprint features, wherein the target voiceprint features are voiceprint features used for generating target audio data, the initial audio is separated through a voiceprint encoder to obtain frequency domain signals, the frequency domain signals are sent to a voiceprint decoder, the frequency domain signals are restored to time domain signals through the voiceprint decoder, and finally target audio with the target voiceprint features is generated based on the time domain signals and the target voiceprint features. When the method is used for executing audio processing, the identification of text content is not relied on, the target audio with the target voiceprint characteristics can be directly output through the target voiceprint characteristics, the training time of an encoder, a decoder and the like is reduced, the method is not limited by the integrity of the initial audio, and the problems of low efficiency of the audio processing process and limited applicable scenes are solved.

Drawings

In order to more clearly illustrate some embodiments of the present application or technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an operation scenario between a terminal device and a control device provided in some embodiments of the present application;

fig. 2 is a block diagram of a hardware configuration of a terminal device according to some embodiments of the present application;

FIG. 3 is a block diagram of a hardware configuration of a control device provided in some embodiments of the present application;

fig. 4 is a schematic diagram of software configuration in a terminal device according to some embodiments of the present application;

FIG. 5 is a schematic diagram of parallel data and non-parallel data provided in some embodiments of the present application;

FIG. 6 is a flow chart of audio processing performed on parallel data according to some embodiments of the present application;

FIG. 7 is a flow chart of performing audio processing on non-parallel data according to some embodiments of the present application;

FIG. 8 is a schematic diagram of a process for training a speech feature encoder via non-parallel data provided in some embodiments of the present application;

Fig. 9 is a flowchart of a method for executing audio processing by a terminal device based on voiceprint features according to some embodiments of the present application;

FIG. 10 is a schematic diagram of a process performed by a voiceprint encoder provided in some embodiments of the present application on initial audio;

FIG. 11 is a flowchart illustrating a method for recovering a frequency domain signal into a time domain signal by a voiceprint decoder according to some embodiments of the present application;

FIG. 12 is a schematic diagram of a process for recovering a frequency domain signal into a time domain signal by a voiceprint decoder according to some embodiments of the present application;

fig. 13 is a schematic view of a scenario in which a terminal device performs classification on voiceprint features according to some embodiments of the present application;

fig. 14 is a schematic flow chart of a terminal device performing deep learning on an initial voiceprint feature set according to some embodiments of the present disclosure;

fig. 15 is a flowchart of an audio processing method based on voiceprint features according to some embodiments of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of some embodiments of the present application more clear, the technical solutions of some embodiments of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application.

It should be noted that the brief description of the terms in some embodiments of the present application is only for convenience in understanding the embodiments described below, and is not intended to limit the implementation of some embodiments of the present application. Unless otherwise indicated, these terms should be construed in their ordinary and customary meaning.

The terms first, second, third and the like in the description and in the claims and in the above-described figures are used for distinguishing between similar or similar objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.

The terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements explicitly listed, but may include other elements not expressly listed or inherent to such product or apparatus.

The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware or/and software code that is capable of performing the function associated with that element.

Fig. 1 is a schematic diagram of an operation scenario between a terminal device and a control device provided in some embodiments of the present application. As shown in fig. 1, a user may operate the terminal device 200 through the mobile terminal 300 and the control device 100.

In some embodiments, the control device 100 may be a remote controller, and the communication between the remote controller and the terminal device includes infrared protocol communication or bluetooth protocol communication, and other short-range communication modes, etc., to control the terminal device 200 in a wireless mode or other wired mode. The user can control the terminal device 200 by inputting user instructions through keys on a remote controller, voice input, control panel input, etc.

In some embodiments, the mobile terminal 300 may install a software application with the terminal device 200, and implement connection communication through a network communication protocol for the purpose of one-to-one control operation and data communication. The audio/video content displayed on the mobile terminal 300 can also be transmitted to the terminal device 200, so as to realize the synchronous display function.

As also shown in fig. 1, the terminal device 200 also communicates data with the server 400 through a variety of communication means. The terminal device 200 may be permitted to make communication connection through a Local Area Network (LAN), a Wireless Local Area Network (WLAN), and other networks.

The terminal device 200 may additionally provide an intelligent network television function of a computer support function, including, but not limited to, a network television, an intelligent television, an Internet Protocol Television (IPTV), etc., in addition to the broadcast receiving television function.

Fig. 2 is a block diagram of a hardware configuration of a terminal device according to some embodiments of the present application.

In some embodiments, terminal device 200 includes at least one of a modem 210, a communicator 220, a detector 230, an external device interface 240, a controller 250, a display 260, an audio output interface 270, memory, a power supply, a user interface.

In some embodiments, the detector 230 is used to collect signals of the external environment or interaction with the outside.

In some embodiments, the display 260 includes a display screen component for presenting a picture, and a driving component for driving an image display, for receiving an image signal from the controller output, for displaying video content, image content, and components of a menu manipulation interface, and a user manipulation UI interface, etc.

In some embodiments, communicator 220 is a component for communicating with external devices or servers according to various communication protocol types.

In some embodiments, the controller 250 controls the operation of the terminal device and responds to the user's operations by various software control programs stored on the memory. The controller 250 controls the overall operation of the terminal device 200.

In some embodiments, a user may input a user command through a Graphical User Interface (GUI) displayed on the display 260, and the user input interface receives the user input command through the Graphical User Interface (GUI).

In some embodiments, user interface 280 is an interface that may be used to receive control inputs.

Fig. 3 is a block diagram of a hardware configuration of a control device according to some embodiments of the present application. As shown in fig. 3, the control device 100 includes a controller 110, a communication interface 130, a user input/output interface, a memory, and a power supply.

The control device 100 is configured to control the terminal device 200, and can receive an input operation instruction of a user, and convert the operation instruction into an instruction recognizable and responsive by the terminal device 200, functioning as an interaction mediation between the user and the terminal device 200.

In some embodiments, the control device 100 may be a smart device. Such as: the control apparatus 100 may install various applications of the control terminal apparatus 200 according to user demands.

In some embodiments, as shown in fig. 1, a mobile terminal 300 or other intelligent electronic device may serve a similar function as the control device 100 after installing an application that manipulates the terminal device 200.

The controller 110 includes a processor 112 and RAM 113 and ROM 114, a communication interface 130, and a communication bus. The controller 110 is used to control the operation and operation of the control device 100, as well as the communication collaboration among the internal components and the external and internal data processing functions.

The communication interface 130 enables communication of control signals and data signals with the terminal device 200 under the control of the controller 110. The communication interface 130 may include at least one of a WiFi chip 131, a bluetooth module 132, an NFC module 133, and other near field communication modules.

A user input/output interface 140, wherein the input interface includes at least one of a microphone 141, a touchpad 142, a sensor 143, keys 144, and other input interfaces.

In some embodiments, the control device 100 includes at least one of a communication interface 130 and an input-output interface 140. The control device 100 is provided with a communication interface 130 such as: the WiFi, bluetooth, NFC, etc. modules may send the user input instruction to the terminal device 200 through a WiFi protocol, or a bluetooth protocol, or an NFC protocol code.

A memory 190 for storing various operation programs, data and applications for driving and controlling the control device 100 under the control of the controller. The memory 190 may store various control signal instructions input by a user.

A power supply 180 for providing operating power support for the various elements of the control device 100 under the control of the controller.

Fig. 4 is a schematic view of software configuration in a terminal device according to some embodiments of the present application, in some embodiments, a system is divided into four layers, namely, an application layer (application layer), an application framework layer (Application Framework layer), a An Zhuoyun line (Android run) layer and a system library layer (system runtime layer), and a kernel layer from top to bottom.

In some embodiments, at least one application program is running in the application program layer, and these application programs may be a Window (Window) program of an operating system, a system setting program, a clock program, a camera application, and the like; or may be an application developed by a third party developer.

The framework layer provides an application programming interface (Aplication Pogramming Iterface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions. The application framework layer corresponds to a processing center that decides to let the applications in the application layer act.

As shown in fig. 4, the application framework layer in the embodiment of the present application includes a manager (manager), a Content Provider (Content Provider), a View System (View System), and the like.

In some embodiments, the activity manager is to: managing the lifecycle of the individual applications and typically the navigation rollback functionality.

In some embodiments, a window manager is used to manage all window programs.

In some embodiments, the system runtime layer provides support for the upper layer, the framework layer, and when the framework layer is accessed, the android operating system runs the C/C++ libraries contained in the system runtime layer to implement the functions to be implemented by the framework layer.

In some embodiments, the kernel layer is a layer between hardware and software. As shown in fig. 4, the kernel layer contains at least one of the following drivers: audio drive, display drive, bluetooth drive, camera drive, WIFI drive, USB drive, HDMI drive, sensor drive (e.g., fingerprint sensor, temperature sensor, touch sensor, pressure sensor, etc.), and the like.

In some embodiments, the kernel layer further includes a power driver module for power management.

In some embodiments, the software programs and/or modules corresponding to the software architecture in fig. 4 are stored in the first memory or the second memory shown in fig. 2 or fig. 3.

Based on the above terminal device 200, media such as video, audio, etc. can be output. For example, the terminal device 200 may be provided with song recording software or entertainment software, and audio processing such as voice print sound changing can be implemented through the song recording software or the entertainment software, where the voice print is a special acoustic feature formed between a fundamental wave and a harmonic wave of a language feature in user audio.

In some embodiments, audio processing may be achieved by a method of deep learning audio. Audio data can be classified into parallel data and non-parallel data according to the data type thereof. Fig. 5 is a schematic diagram of parallel data and non-parallel data provided in some embodiments of the present application, where, as shown in fig. 5, a speaker, i.e. a user, outputs a section of audio, for example, weather today is good, and the corresponding audio-processed target audio is the same content, i.e. the audio-processed target audio also contains the same text content "weather today is good". In the non-parallel data, a section of audio output by the user is ' today ' weather is good ', but the corresponding target audio is other audio with different contents, such as ' we climb mountain bars '. That is, the text content contained in the original audio before audio processing and the target audio after modification in the parallel data is the same, whereas the text content contained in the original audio before audio processing and the target audio after modification in the non-parallel data is different. The following describes the process of implementing audio processing for parallel data and non-parallel data, respectively.

Fig. 6 is a schematic flow chart of audio processing performed on parallel data according to some embodiments of the present application, as shown in fig. 6, in some embodiments, when audio processing is performed on parallel data, initial audio of a speaker may be extracted to generate text, then a search is performed based on the generated text to generate a vector set 1 only including text, the search is performed on the premise that a large amount of audio information of the speaker is needed, then a vector set 2 only including the same text is matched according to the generated vector set 1, the vector set 2 is generated on the premise that a large amount of audio processed target audio is needed, and finally an audio signal of the target audio is generated according to the vector set 2. Therefore, in the process of executing audio processing on the parallel data, after a large number of initial audio of a speaker and target audio after audio processing are provided, vectors only related to text content can be generated by marking words in the audio, and matching of the audio content in the parallel data is completed through the vectors. The end result is that the target audio containing the same text content can be output eventually regardless of what the speaker speaks, but this way is premised on having a sufficient amount of initial audio and target audio.

In an actual audio processing application scenario, most of the data types that need audio processing are non-parallel data. In some embodiments, feature separation may be used in performing audio processing on non-parallel data. The feature separation is based on a section of voice audio, text content is extracted from initial audio of a speaker, language features are extracted from target audio after audio processing, recoding is carried out on the text content and the language features, and audio processing is achieved after coding is completed.

Fig. 7 is a schematic flow chart of audio processing performed on non-parallel data according to some embodiments of the present application, as shown in fig. 7, when audio processing is performed on non-parallel data, text content in initial audio, such as "today's weather is good", may be extracted by a text content encoder, and then target language features of target audio after becoming physical may be extracted by a language feature encoder, where the two processes are separate processes. After separation is completed, recoding can be performed on the text content and the target language features through a decoder, and finally the target audio after audio processing is generated.

In the above processing, training is required for the text content encoder, the language feature encoder and the decoder to obtain the target audio after the accurate audio processing. Fig. 8 is a schematic diagram of a process of training a speech feature encoder through non-parallel data according to some embodiments of the present application, as shown in fig. 8, after feature extraction is performed on a plurality of initial audios and model training is performed on the audio of different speakers under different conditions, that is, after feature extraction and model training are performed on the audio of different speakers, features of speech weights, speech speeds and intonation of a plurality of speakers may be obtained, and finally a model library carrying speech weights, speech speeds and intonation of different speakers is generated. The training process is actually a fitting process, and the premise of training is that a sufficient amount of initial audio is required. While training of the decoder requires training to begin after both the text content encoder and the language feature encoder are trained.

It follows that the above-described approach to performing audio processing on parallel or non-parallel data, whether on the acquisition of the original audio data or on the training of the text content encoder, language feature encoder and decoder, is very difficult and time consuming, resulting in an inefficient audio processing process. Moreover, the above processing method is only suitable for a scene where the initial audio of the speaker is complete audio, and is not suitable for incomplete audio, so that the applicable scene is limited.

To address the problems of inefficiency of the audio processing process and limited applicable scenarios, some embodiments of the present application provide a terminal device 200, in some embodiments, the terminal device 200 may include a voiceprint encoder 201, a voiceprint decoder 202, and a controller 250. Wherein the voiceprint encoder 201 is configured to perform a separation of initial voiceprint features in the speaker's initial audio, the voiceprint decoder 202 is configured to output target audio with target voiceprint features, and the controller 250 is configured to perform an audio processing method based on the voiceprint features. When the terminal device 200 performs audio processing, it does not rely on the identification of text content, and can directly output the target audio with target voiceprint features through the target voiceprint features, so as to reduce the training time of the encoder, decoder, etc., and not limited by the integrity of the initial audio, thereby solving the problems of low efficiency of audio processing and limited applicable scenes.

In order to facilitate understanding of the technical solutions in some embodiments of the present application, the following details of each step are described with reference to some specific embodiments and the accompanying drawings. Fig. 9 is a flowchart of a method for a terminal device to execute audio processing based on voiceprint features according to some embodiments of the present application, as shown in fig. 9, in some embodiments, when the terminal device 200 executes the audio processing method based on voiceprint features, the method may include the following steps S1 to S4, which are specifically as follows:

step S1: the terminal device 200 acquires initial audio and acquires target voiceprint features.

To implement the audio processing function for the initial audio, in some embodiments, the terminal device 200 may acquire the initial audio and acquire a target voiceprint feature, wherein the target voiceprint feature is a voiceprint feature used to generate the target audio data.

The initial audio is, for example, audio before processing, which is input by a speaker, i.e., a user, and the target audio is, for example, audio after processing, which has the target voiceprint feature. In performing the audio processing, the target voiceprint feature of the target audio may be set first, and for example, may include a sound quality feature, a tone color feature, a tone feature, and the like, so that the feature included in the target audio can be known. After the completion of the execution of step S1, the following step S2 may be executed.

Step S2: the terminal device 200 separates the initial audio by the voiceprint encoder 201 to obtain a frequency domain signal, and transmits the frequency domain signal to the voiceprint decoder 202.

After the initial audio is obtained, in some embodiments, the terminal device 200 may separate the initial audio by the voiceprint encoder 201, obtain a frequency domain signal, and send the frequency domain signal to the voiceprint decoder 202. In some embodiments, the voiceprint encoder 201 may have the following function of separating voiceprint features such as a voice signal in the initial audio for decoding by the voiceprint decoder 202 in the next step, and collecting, uploading and saving the separated voiceprint features as a data sample set of the voiceprint features, so as to facilitate later classification of the audio. When separating the initial audio, the terminal device 200 may first extract a speech signal in the initial audio to obtain an initial speech signal, and then sequentially perform high-frequency enhancement preprocessing, framing windowing, and discrete fourier transformation on the initial speech signal to convert the initial speech signal into a frequency domain signal.

After the voice signal in the initial audio is separated, the voice signal can be subjected to high-frequency enhancement pretreatment, framing and windowing and discrete Fourier transformation in sequence, wherein the high-frequency enhancement pretreatment can enhance the voice in the voice signal, the framing and windowing can realize the segmentation treatment of the voice signal, and the discrete Fourier transformation can be used for the treatment of the voice signal to convert the time domain signal into the frequency domain signal. To ensure the integrity of the original audio data, in some embodiments, after the frequency domain signal is obtained, the frequency domain signal may be sent to the voiceprint decoder 202 for use by the voiceprint decoder 202 in subsequently generating the target audio.

In order to perform deep learning on the initial audio, in some embodiments, the terminal device 200 may also perform the following processing through the voiceprint encoder 201. Fig. 10 is a schematic diagram of a process of performing processing on initial audio by a voiceprint encoder according to some embodiments of the present application, as shown in fig. 10, the terminal device 200 may further use a mel filter bank to filter a frequency domain signal, log the filtered frequency domain signal to obtain log-energy, and perform cosine transform on the log-energy to obtain mel-cepstral coefficients, so as to perform deep learning on the initial audio by using the mel-cepstral coefficients.

Illustratively, after the frequency domain signal is filtered through the mel filter bank, the original audio cannot be completely restored, and thus, the frequency domain signal needs to be transmitted to the voiceprint decoder 202 before the filtering is performed. Then, the frequency domain signal is input to a Mel filter bank after discrete Fourier transform. In this way, on the one hand, the integrity of the initial audio data can be ensured, and on the other hand, the voiceprint characteristics of different users can be reflected through the filtering of the mel filter bank, so that different users can be distinguished, and a data basis is provided for the subsequent voiceprint decoder 202. After the completion of the execution of step S2, the following step S3 may be executed.

Step S3: the terminal device 200 restores the frequency domain signal to the time domain signal through the voiceprint decoder 202.

After the terminal device 200 obtains the frequency domain signal and sends the frequency domain signal to the voiceprint decoder 202, the frequency domain signal may be restored to the time domain signal by the voiceprint decoder 202. Fig. 11 is a schematic flow chart of restoring a frequency domain signal to a time domain signal by a voiceprint decoder according to some embodiments of the present application, as shown in fig. 11, when the terminal device 200 restores the frequency domain signal to the time domain signal, the terminal device may first receive the frequency domain signal by the voiceprint decoder 202, then perform frequency shift on the frequency domain signal according to the target voiceprint feature to obtain a fitted frequency signal, then perform amplitude matching on the fitted frequency signal to generate a tone and a tone of the target voiceprint feature, then perform inverse discrete fourier transform on the tone and the tone, and perform time domain resampling on the tone and the tone after the inverse discrete fourier transform, and finally perform frame stitching on the tone and the tone after the time domain resampling to restore the frequency domain signal to the time domain signal.

For example, fig. 12 is a schematic diagram of a process of restoring a frequency domain signal into a time domain signal by a voiceprint decoder according to some embodiments of the present application, as shown in fig. 12, after receiving the frequency domain signal, the voiceprint decoder 202 may sequentially perform frequency shifting, amplitude matching, inverse discrete fourier transform, time domain resampling, and frame splicing on the frequency domain signal, and finally restore the frequency domain signal of the initial audio into the time domain signal.

In some embodiments, in performing the frequency shift, the frequency domain signal may be frequency shifted by the target voiceprint feature to obtain a fitted frequency signal as follows. First, the terminal device 200 may acquire a target fundamental frequency of a target voiceprint feature, and then perform frequency shift matching on a frequency domain signal based on the target fundamental frequency to fit a fitted frequency signal having the same amplitude as the frequency domain signal and the same frequency as the target fundamental frequency. For example, the terminal device 200 may first perform a frequency shift matching process against the target fundamental frequency, and then fit a fitted frequency signal with the same amplitude as the frequency domain signal of the initial audio but with the same frequency as the target fundamental frequency after the audio processing, where the purpose of the frequency shift is to make the frequency domain signal of the initial audio conform to the target voiceprint feature after the audio processing.

In the above-mentioned frequency shift process, only the fundamental frequency signal is used as a reference to shift the frequency, and not only the fundamental frequency but also the frequencies of other orders such as the second harmonic and the third harmonic after the frequency shift are changed, and only the target fundamental frequency is schematically illustrated. For example, the second harmonic and the third harmonic may be integer multiples of the target fundamental frequency, and assuming that the target fundamental frequency is 200Hz, the second harmonic may be 400Hz, which is 2 times the target fundamental frequency, and the third harmonic may be 600Hz, which is three times the target fundamental frequency. That is, all frequencies will shift during frequency shifting, and as the fundamental frequency shifts, the harmonics of each order will also follow the frequency tuned to correspond to the target voiceprint feature.

In order to obtain the timbre and tone of the target voiceprint feature, when the terminal device 200 performs amplitude matching on the fitted frequency signal to generate the timbre and tone of the target voiceprint feature, the terminal device 200 may first obtain a first harmonic in the fitted frequency domain signal, then obtain the amplitudes of a second harmonic and a second harmonic in the target voiceprint feature relative to the target fundamental frequency, and then perform weighting calculation on the first harmonic based on the second harmonic and the amplitudes to generate the timbre and tone of the target voiceprint feature.

For example, after the first harmonic in the fitted frequency signal is obtained, a weighting calculation may be performed on the first harmonic in the initial audio of the speaker against the amplitude of the second harmonic in the target voiceprint feature relative to the target fundamental frequency, and when multiple harmonics are present, the weighting may be performed on each harmonic by comparing the amplitude of each harmonic of the target voiceprint feature relative to the target fundamental frequency. And after the frequency shift and the amplitude matching are completed, the tone and the tone of the target voiceprint feature can be obtained.

To ensure consistent speech speed before and after audio processing, in some embodiments, after the timbre and pitch of the target voiceprint feature are acquired, an inverse discrete fourier transform may be performed on the timbre and pitch, and a time-domain resampling may be performed on the timbre and pitch after the inverse discrete fourier transform. Because the target fundamental frequency changes, if the target audio is directly output only through the inverse discrete Fourier transform, the target fundamental frequency appears as different speaking speeds before and after the audio processing, namely speaking efficiency, so in order to ensure consistency of the speaking speeds before and after the audio processing, time domain resampling needs to be performed on the tone and the tone to ensure that the speaking speeds before and after the audio processing are unchanged.

To restore the time-domain sampled timbre and tone to a time-domain signal, in some embodiments, the terminal device 200 may perform frame stitching on the time-resampled timbre and tone, ultimately restoring the frequency-domain signal of the original audio to a time-domain signal. After the completion of the execution of step S3, the following step S4 may be executed.

Step S4: the terminal device 200 generates target audio with target voiceprint features based on the time domain signal and the target voiceprint features.

To generate the target audio with the target voiceprint feature, the terminal device 200 can generate the target audio with the target voiceprint feature based on the restored time-domain signal and the target voiceprint feature. In some embodiments, the time domain signal may be parsed by the voiceprint decoder 202, after which the parsed time domain signal and the target voiceprint features are re-encoded to reproduce the audio processed target audio.

It should be noted that, in the whole process of performing audio processing, training is not required to be performed on the encoder and decoder such as the text content encoder, the language feature encoder and the decoder, so that the training time period can be reduced and the operation flow can be simplified by implementing audio processing through the terminal device 200. And the whole audio processing process is not dependent on text content any more, and target audio can be generated only by using voiceprint features. For example, the terminal device 200 may fit the target voiceprint feature in combination with the frequency domain signal in the initial audio, so that the audio processing is not dependent on the text content, and the voiceprint feature extraction is applied to the audio processing scene, so as to achieve the effect that the target audio after the audio processing can be directly output by using only the voiceprint feature.

As can be seen from the above technical solution, the terminal device 200 provided in the above embodiment obtains the initial audio and obtains the target voiceprint feature, where the target voiceprint feature is a voiceprint feature for generating the target audio data, separates the initial audio by the voiceprint encoder 201 to obtain the frequency domain signal, sends the frequency domain signal to the voiceprint decoder 202, restores the frequency domain signal to the time domain signal by the voiceprint decoder 202, and finally generates the target audio with the target voiceprint feature based on the time domain signal and the target voiceprint feature. When the terminal equipment executes audio processing, the recognition of text content is not relied on, the target audio with target voiceprint characteristics can be directly output through the target voiceprint characteristics, the training time of an encoder, a decoder and the like is reduced, the terminal equipment is not limited by the integrity of initial audio, and the problems of low efficiency of an audio processing process and limited applicable scenes are solved.

In order to distinguish between different categories in the voiceprint features, in some embodiments, the terminal device 200 may further obtain the magnitudes of the initial audio at different harmonics, and then perform weighted calculation on the magnitudes by fitting a nonlinear function to perform training on the initial audio, to obtain a classification result of the initial audio, and output the classification result. Thus, different categories in the classification can be distinguished according to the output classification result.

For example, in a practical application scenario, the voiceprint characteristics of the user are not constant. For example, the voiceprint characteristics change significantly when the user is both unvoiced and voiced. Fig. 13 is a schematic view of a scenario in which a terminal device performs classification on voiceprint features according to some embodiments of the present application, as shown in fig. 13, in order to distinguish voiceprint features, in some embodiments, the terminal device 200 may use a three-layer structure mode to distinguish voiceprint features, where an output result of each neuron, i.e., each lattice, does not affect an output between the same layers, but affects an output of a next layer. Taking as an example the distinction between unvoiced and voiced sounds in voiceprint feature 1, where the first layer may be the input layer, each lattice represents a neuron, and each neuron corresponds to a coefficient of voiceprint feature 1, i.e. the amplitude at different harmonics of the initial audio of the speaker. The second layer may be a hidden layer, which is used to fit a nonlinear function, weight the coefficients of the voiceprint feature 1, and perform classification work. The third layer may be an output layer whose number of outputs is consistent with the classification result. For example, if the classification result of the voiceprint feature 1 includes both unvoiced sound and voiced sound, the output result of the output layer is 2.

To further distinguish between different categories in the voiceprint features, preventing interference from non-human voice, in some embodiments, terminal device 200 may also perform the following functions. Fig. 14 is a schematic flow chart of a terminal device performing deep learning on an initial voiceprint feature set according to some embodiments of the present application, and as shown in fig. 14, the terminal device 200 may first detect an audio class of initial audio, where the audio class includes a voice audio and an interference audio; if the audio class is the interference audio, the terminal device 200 is set to a standby state, if the audio class is the human voice audio, the voiceprint encoder 201 is started, and an initial voiceprint feature set in the initial audio is uploaded, wherein the initial voiceprint feature set is a set of voiceprint features in the initial audio, the initial voiceprint feature set at least comprises age features, gender features and tone quality features, then deep learning is performed on the initial voiceprint feature set, and classification storage is performed on learning results of the deep learning. In the step of performing deep learning on the initial voiceprint feature set and performing classification storage on the learning result of the deep learning, the terminal device 200 may parse the initial voiceprint feature set to obtain initial voiceprint features in initial audio, perform labeling on the initial voiceprint features through a deep learning algorithm, perform classification on the labeled initial voiceprint features to obtain a classified learning result, and finally store the learning result according to the classified classification.

For example, after the target voiceprint feature is set, the voice detection module may detect whether the initial audio is a voice, i.e. determine whether the initial audio is actually speaking, so as to prevent the sound source such as environmental noise from being encoded and uploaded, i.e. prevent the interference of non-voice. After different categories in the voiceprint features are distinguished, deep learning can be performed on the voiceprint features, for example, classification and labeling are performed on the voiceprint features of different crowds, and the results are classified and stored, so that the database can be conveniently searched and called. In an actual use scenario, the output class of the output layer may also be increased, for example, parameters such as age and sex of the speaker may be added in addition to unvoiced and voiced.

In some embodiments, the database may invoke the target voiceprint feature, which is then input to the voiceprint decoder 202, which ultimately outputs the audio processed target audio by an output module in the terminal device 200. In the audio processing process of the embodiment of the application, the audio processing process is divided into two processes of encoding and decoding, and the encoding and decoding results can be respectively stored in different media, so that the algorithm efficiency can be improved, and the product forms are flexible and changeable. Or may be stored in the same medium for easy recall, which is not particularly limited in this application. It should be noted that the terminal device 200 in the present application includes, but is not limited to, a television, a hand-held microphone, an earphone, and the like, in the form of a product having microphone hardware, or a storage medium implementing audio processing in the form of software through voiceprint features.

Based on the terminal device 200, some embodiments of the present application further provide an audio processing method based on voiceprint features, which can be applied to the terminal device 200 in the foregoing embodiments. Fig. 15 is a flowchart of an audio processing method based on voiceprint features according to some embodiments of the present application, as shown in fig. 15, in some embodiments, the audio processing method based on voiceprint features may include the following steps S1 to S4, which are specifically as follows:

The initial audio is, for example, audio before processing, which is input by a speaker, i.e., a user, and the target audio is, for example, audio after processing, which has the target voiceprint feature. In performing the audio processing, the target voiceprint feature of the target audio may be set first, and for example, may include a sound quality feature, a tone color feature, a tone feature, and the like, so that the feature included in the target audio can be known.

In order to perform deep learning on the initial audio, in some embodiments, the terminal device 200 may also perform the following processing through the voiceprint encoder 201. The terminal device 200 may further filter the frequency domain signal using a mel filter bank, then logarithmically take the filtered frequency domain signal to obtain logarithmically taken energy, and finally cosine-transforming the logarithmically taken energy to obtain mel-cepstrum coefficients, so as to perform deep learning on the initial audio through the mel-cepstrum coefficients.

Illustratively, after the frequency domain signal is filtered through the mel filter bank, the original audio cannot be completely restored, and thus, the frequency domain signal needs to be transmitted to the voiceprint decoder 202 before the filtering is performed. Then, the frequency domain signal is input to a Mel filter bank after discrete Fourier transform. In this way, on the one hand, the integrity of the initial audio data can be ensured, and on the other hand, the voiceprint characteristics of different users can be reflected through the filtering of the mel filter bank, so that different users can be distinguished, and a data basis is provided for the subsequent voiceprint decoder 202.

After the terminal device 200 obtains the frequency domain signal and sends the frequency domain signal to the voiceprint decoder 202, the frequency domain signal may be restored to the time domain signal by the voiceprint decoder 202. When the terminal device 200 restores the frequency domain signal to the time domain signal, the frequency domain signal may be received through the voiceprint decoder 202, then the frequency shift is performed on the frequency domain signal according to the target voiceprint feature to obtain a fitted frequency signal, then the amplitude matching is performed on the fitted frequency signal to generate the tone and the tone of the target voiceprint feature, then the inverse discrete fourier transform is performed on the tone and the tone, the time domain resampling is performed on the tone and the tone after the inverse discrete fourier transform, and finally the frame splicing is performed on the tone and the tone after the time domain resampling to restore the frequency domain signal to the time domain signal.

The voiceprint decoder 202 can sequentially perform frequency shifting, amplitude matching, inverse discrete fourier transform, time domain resampling, and frame splicing on the frequency domain signal after receiving the frequency domain signal, and finally restore the frequency domain signal of the original audio to a time domain signal. In order to obtain the timbre and tone of the target voiceprint feature, when the terminal device 200 performs amplitude matching on the fitted frequency signal to generate the timbre and tone of the target voiceprint feature, the terminal device 200 may first obtain a first harmonic in the fitted frequency domain signal, then obtain the amplitudes of a second harmonic and a second harmonic in the target voiceprint feature relative to the target fundamental frequency, and then perform weighting calculation on the first harmonic based on the second harmonic and the amplitudes to generate the timbre and tone of the target voiceprint feature.

As can be seen from the above technical solutions, the above embodiments provide an audio processing method based on a voiceprint feature, which includes acquiring initial audio by a terminal device, and acquiring a target voiceprint feature, where the target voiceprint feature is a voiceprint feature for generating target audio data, separating the initial audio by a voiceprint encoder to obtain a frequency domain signal, sending the frequency domain signal to a voiceprint decoder, restoring the frequency domain signal to a time domain signal by the voiceprint decoder, and generating a target audio with the target voiceprint feature based on the time domain signal and the target voiceprint feature. When the method is used for executing audio processing, the identification of text content is not relied on, the target audio with the target voiceprint characteristics can be directly output through the target voiceprint characteristics, the training time of an encoder, a decoder and the like is reduced, the method is not limited by the integrity of the initial audio, and the problems of low efficiency of the audio processing process and limited applicable scenes are solved.

The same and similar parts of the embodiments in this specification are referred to each other, and are not described herein.

It will be apparent to those skilled in the art that the techniques of embodiments of the present invention may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be embodied essentially or in parts contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method of the embodiments or some parts of the embodiments of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A terminal device, comprising:

the controller is configured to:

2. The terminal device of claim 1, wherein the controller performs the step of separating the initial audio by a voiceprint encoder to obtain a frequency domain signal, further configured to:

3. The terminal device of claim 2, wherein the controller is further configured to:

filtering the frequency domain signal using a mel filter bank;

4. The terminal device of claim 2, wherein the controller performs the step of recovering the frequency domain signal to a time domain signal by the voiceprint decoder, and is further configured to:

Receiving the frequency domain signal by the voiceprint decoder;

5. The terminal device of claim 4, wherein the controller performs the step of frequency shifting the frequency domain signal according to the target voiceprint feature to obtain a fitted frequency signal, further configured to:

acquiring a target fundamental frequency of the target voiceprint feature;

6. The terminal device of claim 5, wherein the controller performs an amplitude matching on the fitted frequency signal, generating timbre and tone of the target voiceprint feature, further configured to:

Acquiring a first harmonic in the fitted frequency domain signal;

7. The terminal device of claim 1, wherein the controller is further configured to:

acquiring the amplitude values of the initial audio under different harmonics;

and outputting the classification result.

8. The terminal device of claim 1, wherein the controller is further configured to:

9. The terminal device of claim 8, wherein the controller performs deep learning on the initial voiceprint feature set and classification saving on a learning result of the deep learning, and is further configured to:

labeling the initial voiceprint features by a deep learning algorithm;

and storing the learning result according to the classified category.

10. An audio processing method based on voiceprint features is applied to terminal equipment, wherein the terminal equipment comprises a voiceprint encoder, a voiceprint decoder and a controller, and is characterized by comprising the following steps: