WO2024050802A1 - 一种语音信号的处理方法、神经网络的训练方法及设备 - Google Patents

一种语音信号的处理方法、神经网络的训练方法及设备 Download PDF

Info

Publication number
WO2024050802A1
WO2024050802A1 PCT/CN2022/117989 CN2022117989W WO2024050802A1 WO 2024050802 A1 WO2024050802 A1 WO 2024050802A1 CN 2022117989 W CN2022117989 W CN 2022117989W WO 2024050802 A1 WO2024050802 A1 WO 2024050802A1
Authority
WO
WIPO (PCT)
Prior art keywords
transfer function
bone conduction
speech signal
training
neural network
Prior art date
Application number
PCT/CN2022/117989
Other languages
English (en)
French (fr)
Inventor
张立斌
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2022/117989 priority Critical patent/WO2024050802A1/zh
Publication of WO2024050802A1 publication Critical patent/WO2024050802A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • G10L21/0308Voice signal separating characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones

Definitions

  • the present application relates to the field of signal processing, and in particular to a speech signal processing method, a neural network training method and equipment.
  • air-conduction speech signals are mostly used for transmission of voice signals.
  • the transmission process can be shown in Figure 1, in which users on side A and users on side B conduct voice calls based on mobile phones, and side A speaks and B-side listening is an example.
  • the above transmission process is: 1 The user on side A speaks and sends out a voice signal S; 2 The conductive microphone on the mobile phone on side A collects the voice signal S; 3 The mobile phone on side A encodes and compresses the voice signal S and transmits it to B based on the wireless network. The side mobile phone; 4 The B side mobile phone decodes and restores the received signal to obtain the voice signal S; 5 The B side mobile phone plays the voice signal S based on the air conduction speaker of the B side mobile phone.
  • the speech signal S When the above-mentioned air conduction speech signal is transmitted, in addition to the voice of the user on side A, the speech signal S will also collect environmental noise at the same time. If the environmental noise is large, it will affect the listening effect of the user on side B. Therefore, in some high-noise scenes, such as mines and rescues, bone conduction voice signals are generally used for information transmission. This is because bone conduction microphones collect less environmental noise and have a high audio signal-to-noise ratio, which can greatly guarantee these scenes.
  • the user on side A speaks and emits a voice signal S1;
  • the bone conduction microphone on the head-mounted device (such as earphones) on side A is close to the face of the user on side A, and the audio signal S2 of the user on side A when speaking is collected based on the vibration of the skin ( It can also be called bone conduction voice signal S2).
  • This bone conduction voice signal S2 is sent to the A-side mobile phone (such as sent through the Bluetooth connection between the head-mounted device and the mobile phone); 3A-side mobile phone encodes the bone conduction voice signal S2 After compression, it is transmitted to the B-side mobile phone based on the wireless network; 4 The B-side mobile phone decodes and restores the received signal to obtain the bone conduction voice signal S2; 5 The B-side mobile phone plays the bone conduction voice signal S2 based on the air conduction speaker of the B-side mobile phone.
  • excitation source * excitation channel speech signal
  • excitation source the excitation source convolution excitation channel
  • excitation channels of air conduction speech signals are vocal cords, but the excitation channels of air conduction speech signals are the pharyngeal cavity & oral cavity, etc., while the excitation channels of bone conduction speech signals are muscles & bones, etc. Because the excitation channels of the two are different, the final listening sound will be quite different: the air conduction speech signal has small distortion and high brightness because the excitation channel is transmitted through air.
  • the excitation channel of the bone conduction speech signal is transmitted through solid and soft bodies, which has large distortion, is darker than the air conduction speech signal, and has poor listening comfort; more seriously, it will cause the opposite end (such as side B) to hear A
  • the voice on the other side is unnatural, and in severe cases it may not be possible to tell who is speaking, thus affecting the efficiency of the call.
  • Embodiments of the present application provide a speech signal processing method, a neural network training method, and equipment for mapping the transfer function (i.e., the first transfer function) of the bone conduction speech signal to air conduction based on the trained neural network.
  • the transfer function of the speech signal i.e. the second transfer function
  • the transfer function of the speech signal is used to achieve timbre compensation from the bone conduction speech signal to the air conduction speech signal, making the listening effect of the bone conduction speech signal brighter and more comfortable, which can ensure that it is used in mining, rescue, etc. Call efficiency in high-noise scenes.
  • embodiments of the present application first provide a speech signal processing method, which can be used in the field of signal processing.
  • the method includes: first, obtaining a bone conduction speech signal to be processed.
  • the bone conduction speech signal can be called the first Bone conduction speech signal.
  • excitation parameters are extracted based on the first bone conduction speech signal.
  • a transfer function of the first bone conduction voice signal (which may also be referred to as a first transfer function) is determined based on the first bone conduction voice signal.
  • the first transfer function is input to the trained neural network, thereby outputting a second transfer function, which is the predicted transmission of the air conduction speech signal. function, it can also be called a predictive transfer function.
  • the first air conduction speech signal corresponding to the first bone conduction speech signal is obtained.
  • the transfer function of the bone conduction speech signal i.e., the first transfer function
  • the transfer function of the air conduction speech signal i.e., the second transfer function
  • the trained neural network is based on the target loss function and is obtained by training the neural network using a training data set, where the training data set includes multiple training data, and the training data includes bone conduction speech
  • the first real transfer function of the signal is obtained based on the audio signal emitted by the sound source collected by the bone conduction microphone.
  • the output of the neural network is a predicted transfer function.
  • the predicted transfer function corresponds to the second true transfer function of the air conduction speech signal.
  • the second true transfer function is obtained based on the audio signal emitted by the sound source collected by the air conduction microphone.
  • the trained neural network is trained.
  • the mapping quality can be improved and flexibility can be achieved by increasing the scale of the mapping network.
  • the target loss function may be: an error value between the predicted transfer function and the second true transfer function.
  • the method of determining the first transfer function of the first bone conduction voice signal according to the first bone conduction voice signal may be: using excitation parameters, perform a test on the first bone conduction voice signal. Deconvolution operation is performed to obtain the first transfer function of the first bone conduction speech signal.
  • the method of obtaining the first air conduction speech signal corresponding to the first bone conduction speech signal may be: applying the second transfer function and the excitation parameters A convolution operation is performed to obtain the first air conduction speech signal corresponding to the first bone conduction speech signal.
  • the method of obtaining the first bone conduction voice signal may be: first, collect the audio signal through a bone conduction microphone, and then perform noise reduction on the collected audio signal.
  • the subtractive noise reduction algorithm performs noise reduction to obtain the first bone conduction speech signal.
  • the first bone conduction voice signal has undergone noise reduction processing, so the signal spectrum of the voice signal will be purer.
  • the excitation parameter includes a fundamental frequency of the first bone conduction speech signal and harmonics of the fundamental frequency.
  • the second aspect of the embodiment of the present application provides a method for training a neural network.
  • the method includes: first, collecting audio from a sound source (such as user A) through a bone conduction microphone and an air conduction microphone within a certain period of time.
  • the signals can be collected from n (n ⁇ 2) bone conduction speech signals and n air conduction speech signals emitted by user A when speaking.
  • a real transfer function (which can be called the first real transfer function) corresponding to each bone conduction speech signal can be obtained.
  • a total of n bone conduction speech signals can be obtained. a real transfer function.
  • a real transfer function (which can be called a second real transfer function) corresponding to each air conduction speech signal can be obtained.
  • a total of n air conduction speech signals can be obtained.
  • a second real transfer function Use the n first real transfer functions as the training data set of the neural network, and use the target loss function to train the neural network until the training termination condition is reached, thereby obtaining the trained neural network.
  • the target loss function is obtained based on the second real transfer function. .
  • the trained neural network is trained.
  • the mapping quality can be improved and flexibility can be achieved by increasing the scale of the mapping network.
  • the bone conduction microphone and the air conduction microphone are deployed on the same device, and the training process of the neural network is performed on the device.
  • the training processes of the bone conduction microphone, the air conduction microphone and the neural network are all performed on the same device, which can realize the function of online training and is flexible.
  • the same device includes: a head-mounted device, such as a headset.
  • the target loss function may be: an error value between the predicted transfer function and the second true transfer function.
  • reaching the training termination condition includes but is not limited to: the value of the target loss function reaches a preset threshold; or the target loss function begins to converge; or the number of training times reaches the preset number; or , the training duration reaches the preset duration; or, an instruction to terminate the training is obtained.
  • the third aspect of the embodiments of the present application provides an execution device, which has the function of implementing the method of the above-mentioned first aspect or any possible implementation manner of the first aspect.
  • This function can be implemented by hardware, or it can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the fourth aspect of the embodiments of the present application provides a training device, which has the function of implementing the method of the above-mentioned second aspect or any of the possible implementation methods of the second aspect.
  • This function can be implemented by hardware, or it can be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions.
  • the fifth aspect of the embodiment of the present application provides an execution device, which may include a memory, a processor, and a bus system.
  • the memory is used to store programs, and the processor is used to call the program stored in the memory to execute the first aspect of the embodiment of the present application. Or any possible implementation method of the first aspect.
  • the sixth aspect of the embodiment of the present application provides a training device, which may include a memory, a processor, and a bus system.
  • the memory is used to store programs, and the processor is used to call the program stored in the memory to execute the second aspect of the embodiment of the present application. Or any possible implementation method of the second aspect.
  • a seventh aspect of the present application provides a computer-readable storage medium.
  • the computer-readable storage medium stores instructions. When run on a computer, the computer can execute the above-mentioned first aspect or any possible implementation of the first aspect. method, or a method that enables the computer to execute the above second aspect or any of the possible implementation methods of the second aspect.
  • the eighth aspect of the embodiment of the present application provides a computer program that, when run on a computer, enables the computer to execute the above-mentioned first aspect or any of the possible implementation methods of the first aspect, or enables the computer to execute the above-mentioned method.
  • a ninth aspect of the embodiment of the present application provides a chip.
  • the chip includes at least one processor and at least one interface circuit.
  • the interface circuit is coupled to the processor.
  • the at least one interface circuit is used to perform a transceiver function and send instructions to At least one processor, at least one processor is used to run computer programs or instructions, which has the function of implementing the above-mentioned first aspect or any of the possible implementation methods of the first aspect, or it has the function of implementing the above-mentioned second aspect or any of the possible implementation methods.
  • the function of any possible implementation method can be realized by hardware, software, or a combination of hardware and software.
  • the hardware or software includes one or more functions corresponding to the above functions. module.
  • the interface circuit is used to communicate with other modules outside the chip. For example, the interface circuit can send the neural network trained on the chip to a target device (such as a headset, a mobile phone, a personal computer, etc.).
  • Figure 1 is a schematic diagram of a real-time call system based on a mobile phone provided by an embodiment of the present application
  • Figure 2 is another schematic diagram of a mobile phone-based real-time call system provided by an embodiment of the present application
  • FIG. 3 is a system architecture schematic diagram of the speech signal processing system provided by the embodiment of the present application.
  • Figure 4 is a schematic flow chart of a neural network training method provided by an embodiment of the present application.
  • Figure 5 is an example diagram of a training process provided by the embodiment of the present application.
  • Figure 6 is a schematic flow chart of a voice signal processing method provided by an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a training device provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of another execution device provided by an embodiment of the present application.
  • Figure 10 is a schematic structural diagram of another training device provided by an embodiment of the present application.
  • Embodiments of the present application provide a speech signal processing method, a neural network training method, and equipment for mapping the transfer function (i.e., the first transfer function) of the bone conduction speech signal to air conduction based on the trained neural network.
  • the transfer function of the speech signal i.e. the second transfer function
  • the transfer function of the speech signal is used to achieve timbre compensation from the bone conduction speech signal to the air conduction speech signal, making the listening effect of the bone conduction speech signal brighter and more comfortable, which can ensure that it is used in mining, rescue, etc. Call efficiency in high-noise scenes.
  • Neural networks can be composed of neural units. Specifically, they can be understood as neural networks with input layers, hidden layers, and output layers. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in between are all is the hidden layer. Among them, a neural network with many hidden layers is called a deep neural network (DNN).
  • DNN deep neural network
  • the work of each layer in a neural network can be expressed mathematically To describe, from the physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (a collection of input vectors). space), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4.
  • Space refers to the collection of all individuals of this type of thing, where W is the weight matrix of each layer of the neural network. , each value in this matrix represents the weight value of a neuron in this layer.
  • W determines the spatial transformation from the input space to the output space mentioned above, that is, the W of each layer of the neural network controls how to transform the space.
  • the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network. Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
  • loss function loss function
  • objective function object function
  • the error back propagation (BP) algorithm can be used to modify the size of the parameters in the initial neural network model, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • the speech signal processing system 200 includes an execution device 210, a training device 220, a database 230, a client device 240, a data storage system 250 and a data acquisition device 260 (such as a bone conduction microphone and an air conduction microphone).
  • the execution device 210 includes Calculation module 211.
  • the data collection device 260 is used to obtain a large-scale data set required by the user (i.e., a training data set, which may also be called a training set, and the training set includes training data), and stores the training set in the database 230.
  • the training device 220 is based on The training set maintained in the database 230 trains the neural network 201 constructed in this application.
  • the trained neural network 201 obtained by training is then used on the execution device 210.
  • the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
  • the data storage system 250 may be placed in the execution device 210 , or the data storage system 250 may be an external memory relative to the execution device 210 .
  • the trained neural network 201 trained by the training device 220 can be applied to different systems or devices (i.e., the execution device 210). Specifically, it can be a terminal device or a head-mounted device, such as a headset, a mobile phone, a tablet, or a computer. , cloud servers, etc.
  • the execution device 210 can be configured with an I/O interface 212 for data interaction with external devices, and the “user” can input data to the I/O interface 212 through the client device 240 .
  • the client device 240 may be a bone conduction microphone. The bone conduction voice signal collected by the client device is input to the execution device 210.
  • the execution device 210 first calculates the true transfer function of the bone conduction voice signal (ie, as described below). first transfer function), and then input the real transfer function into the calculation module 211.
  • the calculation module 211 processes the input real transfer function to obtain the predicted transfer function (ie, the second transfer function described below), and then the predicted transfer function is The transfer function is saved on the storage medium of the execution device 210 for use in subsequent downstream tasks.
  • the client device 240 can also be integrated in the execution device 210.
  • the execution device 210 when the execution device 210 is a head-mounted device (such as a headset), it can be obtained directly through the head-mounted device.
  • the bone conduction voice signal for example, the bone conduction voice signal can be collected through the bone conduction microphone on the head mounted device, or through the bone conduction voice signal sent by other devices received by the head mounted device etc., the source method of the bone conduction voice signal is not limited here
  • the calculation module 211 in the head-mounted device processes the real transfer function of the bone conduction voice signal to obtain the predicted transfer function, and will obtain The predicted transfer function is saved.
  • the product forms of the execution device 210 and the client device 240 are not limited here.
  • the data collection device 260 and/or the training device 220 can also be integrated in the execution device 210, for example, when the execution device 210 is a head-mounted device (such as a headset) When different people wear the head-mounted device, the timbres of the bone conduction speech signals collected may also be different. Therefore, the data collection device 260 and/or the training device 220 can be integrated into the execution device 210.
  • the voice of user A is collected through the data collection device 260 (such as a bone conduction microphone), and the neural network 201 is trained through the training device 220 (based on the air conduction speech signal collected by the air conduction microphone to obtain the real transfer function of the air conduction speech signal), the trained neural network 201 is directly used for subsequent applications of user A; similarly, when user B wears the head-mounted device, the data collection device 260 (such as a bone conduction microphone ) collects the voice of user B, and trains the neural network 201 through the training device 220. The trained neural network 201 is directly used for subsequent applications of user A. This can make the trained neural network 201 more accurate, adaptable when different users use the execution device 210, and have flexibility.
  • the data collection device 260 such as a bone conduction microphone
  • Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • data storage The system 250 is an external memory relative to the execution device 210.
  • the data storage system 250 can also be placed in the execution device 210;
  • the client device 240 is an external device relative to the execution device 210.
  • the client device 240 can also be integrated in the execution device 210; in Figure 3, the training device 220 is an external device relative to the execution device 210.
  • the training device 220 can also be integrated in the execution device 210; in Figure 3,
  • the data acquisition device 260 is an external device relative to the execution device 210. In other cases, the data acquisition device 260 can also be integrated in the execution device 210, and so on. This application does not limit this.
  • the training stage and the application stage (which can also be called the inference stage)
  • the following will describe the training method of the neural network and the speech signal provided by the embodiments of the present application from these two stages respectively.
  • the specific process of the processing method is described.
  • the training stage refers to the process in which the data collection device 260 in FIG. 3 collects training data, and the training device 220 uses the training data to perform training operations on the neural network 201.
  • Figure 4 is a schematic flowchart of a neural network training method provided by an embodiment of the present application. The method may specifically include the following steps:
  • the audio signals emitted by the sound source can be collected through bone conduction microphones and air conduction microphones within a certain period of time.
  • n (n ⁇ 2) bone conduction signals emitted by user A when speaking can be collected.
  • n first real transfer functions and n second real transfer functions are obtained respectively.
  • the first real transfer function corresponds to the bone conduction speech signal one-to-one.
  • the first The two real transfer functions correspond to the air conduction speech signal one-to-one.
  • the real transfer function h b (t) corresponding to each bone conduction speech signal s(t) can be obtained (which can be called the first real transfer function h b (t))
  • a total of n first real transfer functions h b (t) can be obtained from n bone conduction speech signals s(t).
  • the real transfer function h y (t) corresponding to each air conduction speech signal y (t) can also be further obtained based on each air conduction speech signal y (t) (which can be called the second real transfer function h y (t)), a total of n second real transfer functions h y (t) can be obtained from n air conduction speech signals y(t).
  • n first real transfer functions as the training data set of the neural network, use the target loss function to train the neural network until the training termination condition is reached, and obtain the trained neural network.
  • the target loss function is obtained based on the second real transfer function. .
  • first true transfer functions h b (t) are used as the training data set of the neural network, and the neural network is trained using the target loss function L until the training termination condition is reached, thereby obtaining the trained neural network.
  • the target loss function is based on The second true transfer function is obtained.
  • the target loss function L may be the error value between the predicted transfer function output by the neural network and the second true transfer function.
  • the training process can be: taking the first true transfer function h b (t) as the input of the neural network, and the output of the neural network is the predicted transfer function h c (t), and then continuously iteratively training until the training termination is reached.
  • the target loss function L
  • Figure 5 takes recurrent neural networks (RNNs) as an example to illustrate the relationship between the input data (i.e., training data) and output (i.e., prediction transfer function) of RNNs.
  • RNNs recurrent neural networks
  • reaching training termination conditions includes but is not limited to:
  • the target loss function reaches the preset threshold.
  • the target loss function After configuring the target loss function, you can set a threshold (for example, 0.03) for the target loss function in advance.
  • a threshold for example, 0.03
  • the target obtained by the current round of training is judged after each training. Whether the value of the loss function reaches the threshold, if not, the training will continue. If it reaches the preset threshold, the training will be terminated. Then the values of the network parameters of the neural network determined by the current round of training will be regarded as the final training.
  • the network parameter values of the neural network After configuring the target loss function, you can set a threshold (for example, 0.03) for the target loss function in advance.
  • the neural network After configuring the target loss function, the neural network can be iteratively trained. If the difference between the value of the target loss function obtained in the current round of training and the value of the target loss function obtained in the previous round of training is Within the preset range (for example, 0.01), the target loss function is considered to have converged and the training can be terminated. Then the values of the network parameters of the neural network determined in the current round of training will be used as the final trained neural network. network parameter values.
  • the number of iterations for training the neural network can be pre-configured (for example, 100 times). After configuring the target loss function, the neural network can be iteratively trained. After each round of training, The values of the network parameters of the neural network of the corresponding round are stored until the number of training iterations reaches the preset number. After that, the test data is used to verify the neural network obtained in each round, and the one with the best performance is selected. The values of the network parameters are used as the final network parameter values of the neural network.
  • the training duration reaches the preset duration.
  • the iteration time for training the neural network can be pre-configured (for example, 5 minutes). After configuring the target loss function, the neural network can be iteratively trained. After each round of training, The values of the network parameters of the neural network of the corresponding round are stored until the training iteration length reaches the preset length. After that, the test data is used to verify the neural network obtained in each round, and the one with the best performance is selected. The values of the network parameters are used as the final network parameter values of the neural network.
  • a training switch can be set in advance to trigger the generation of training start and training termination instructions.
  • the training switch When the training switch is turned on, the training start instruction is triggered, and the neural network starts iterative training.
  • the training switch is turned off, a command to terminate training is triggered, and the neural network stops training.
  • the period from when the training switch is turned on to when it is turned off is the training time of the neural network.
  • the values of the network parameters of the neural network for that round will be stored until training.
  • the test data is used to verify the neural network obtained in each round, and the value of the network parameter with the best performance is selected as the final network parameter value of the neural network.
  • n bone conduction speech signals s(t) and n air conduction speech signals y(t) may be collected at the beginning, and then based on the collected n bone conduction speech signals s(t) and n air conduction speech signals y(t) iteratively train the neural network (that is, collect the required training data first, and then train the neural network); it can also be collected The bone conduction speech signal s(t) and an air conduction speech signal y(t) train the neural network once. If the neural network currently trained does not meet the training termination conditions, the training data will be collected and executed again. The training process is until the training termination condition is obtained (that is, how many iterations are needed, and how many times it is collected). Specifically, this application does not limit the order between data collection and training.
  • the bone conduction microphone and the air conduction microphone can be deployed on the same device, for example, both can be deployed on a head-mounted device (such as an earphone), and the training process of the neural network can also be performed on a headset.
  • the training is performed on the device, that is, the training device is the device.
  • the training process of the above-mentioned neural network is an online training process.
  • the implementation process of the online training process can be:
  • the wearer puts on the head-mounted device (such as headphones).
  • the head-mounted device such as headphones
  • the wearer speaks, and the head-mounted device simultaneously collects the wearer's bone conduction speech signal s(t) and air conduction speech signal y(t) through the bone conduction microphone and air conduction microphone.
  • the enable online training switch is turned off.
  • the head-mounted device is triggered to terminate the training (that is, the training termination condition is to obtain the instruction to terminate the training).
  • the period from when the online training switch is turned on to off is the training time of the neural network.
  • the head-mounted device will use the first true transfer function h b (t) corresponding to the bone conduction speech signal s(t) as the neural network.
  • the input of the network and the output of the neural network are the predicted transfer function h c (t), and then iterative training is continued, and the network parameters of the neural network obtained last time when the online training switch is turned off (or the optimal network parameters among all times)
  • the network parameters are saved as the final neural network.
  • the online training process of the neural network may not be performed on the same device where the bone conduction microphone and the air conduction microphone are deployed.
  • the online training module can be deployed on the headset or on other devices, such as mobile phones, computers, and cloud servers.
  • the training process of the neural network may also be an offline training process (that is, the neural network is trained in advance).
  • the implementation process of the offline training process can be:
  • the audio signals emitted by the sound source are collected through the bone conduction microphone and the air conduction microphone to obtain the bone conduction speech signal s(t) and the air conduction speech signal y(t).
  • the application stage refers to the process in which the execution device 210 in FIG. 3 uses the trained neural network 201 to process the input data.
  • Figure 6 is a schematic flowchart of a speech signal processing method provided by an embodiment of the present application. The method may specifically include the following steps:
  • a bone conduction speech signal s(t) to be processed is obtained.
  • the bone conduction speech signal s(t) may be called a first bone conduction speech signal s(t).
  • the excitation parameter e(t) is extracted based on the first bone conduction speech signal s(t), where the excitation parameter e(t) may include the fundamental frequency of the first bone conduction speech signal s(t) and the fundamental frequency
  • the harmonics can be analyzed using the linear predictive coding (LPC) method of the speech signal.
  • LPC linear predictive coding
  • the implementation of obtaining the first bone conduction speech signal s(t) may be: first, the audio signal x(t) may be collected based on the bone conduction microphone (also called bone conduction microphone). conduct audio signal x(t)), and then perform noise reduction on the audio signal x(t) to obtain the first bone conduction speech signal s(t).
  • the bone conduction microphone also called bone conduction microphone
  • VAD voice activity detection
  • FFT fast fourier transform
  • the transfer function h b (t) of the first bone conduction voice signal s (t) is determined according to the first bone conduction voice signal s (t), which may also be called the first transfer function h b (t).
  • the second transfer function is the predicted transfer function of the air conduction speech signal.
  • the first transfer function h b (t) of the first bone conduction speech signal s (t) is input to the trained neural network, thereby outputting the second transfer function h c (t), the second transfer function h c (t) is the predicted transfer function of the air conduction speech signal, so it can also be called the predicted transfer function h c (t).
  • the trained neural network is a trained neural network obtained through the above training stage, that is, the trained neural network is based on the target loss function and uses the training data set to Obtained by network training, wherein the training data set includes multiple training data (including the first true transfer function of the bone conduction speech signal), and the first true transfer function is obtained based on the audio signal emitted by the sound source collected by the bone conduction microphone; the neural network The output is a predicted transfer function, which corresponds to a second real transfer function of the air conduction speech signal, and the second real transfer function is obtained based on the audio signal emitted by the sound source collected by the air conduction microphone.
  • the target loss function may be an error value between the predicted transfer function output by the neural network and the second true transfer function.
  • the first air conduction speech signal s' ( corresponding to the first bone conduction speech signal s (t) is obtained t).
  • subsequent processing is performed based on the timbre compensated speech signal (i.e., the first air conduction speech signal s’(t)), including but not limited to:
  • Voice calls can enhance the brightness of the voice after the call, achieve better voice perception quality, ensure that the perception of the other party is not affected, and improve the perceived brightness and comfort of the call voice.
  • Speech recognition improve the accuracy of speech recognition under bone conduction speech.
  • Figure 7 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • the execution device 700 may specifically include: an acquisition module 701, a first determination module 702, a calculation module 703 and a second determination module 704, where , the acquisition module 701 is used to acquire the first bone conduction voice signal, and extract the excitation parameters based on the first bone conduction voice signal; the first determination module 702 is used to determine the first bone conduction based on the first bone conduction voice signal The first transfer function of the speech signal; the calculation module 703 is used to input the first transfer function into the trained neural network to obtain the output second transfer function. The second transfer function is the predicted transfer function of the air conduction speech signal. ; The second determination module 704 is used to obtain the first air conduction speech signal corresponding to the first bone conduction speech signal according to the second transfer function and the excitation parameter.
  • the trained neural network is based on the target loss function and is obtained by training the neural network using a training data set; the training data set includes multiple training data, and the training data includes the first signal of the bone conduction speech signal.
  • a real transfer function the first real transfer function is obtained based on the audio signal emitted by the sound source collected by the bone conduction microphone; the output of the neural network is a predicted transfer function, and the predicted transfer function corresponds to the second real transfer function of the air conduction speech signal , the second real transfer function is obtained based on the audio signal emitted by the sound source collected by the air conduction microphone.
  • the target loss function includes: an error value between the predicted transfer function and the second true transfer function.
  • the first determination module 702 is specifically configured to: use the excitation parameter to perform a deconvolution operation on the first bone conduction speech signal to obtain a first transfer function of the first bone conduction speech signal. .
  • the second determination module 704 is specifically configured to perform a convolution operation on the second transfer function and the excitation parameter to obtain a first air conduction speech signal corresponding to the first bone conduction speech signal. .
  • the acquisition module 701 is specifically configured to collect audio signals through a bone conduction microphone, perform noise reduction on the audio signals, and obtain the first bone conduction voice signal.
  • the excitation parameters include the fundamental frequency of the first bone conduction speech signal and harmonics of the fundamental frequency.
  • the embodiment of the present application also provides a training device. Please refer to Figure 8 for details.
  • Figure 8 is a schematic diagram of a training device provided by the embodiment of the present application.
  • the training device 800 may specifically include: a collection module 801, a calculation module 802 and an iteration module. 803.
  • the collection module 801 is used to collect the audio signals emitted by the sound source through the bone conduction microphone and the air conduction microphone respectively, and obtain n bone conduction speech signals and n air conduction speech signals, n ⁇ 2;
  • the calculation module 802 used to obtain n first real transfer functions and n second real transfer functions based on n bone conduction speech signals and n air conduction speech signals respectively, and the first real transfer function is the same as the bone conduction speech signal.
  • the iteration module 803 is used to use n first real transfer functions as training data sets for the neural network, and use the target loss function to train the neural network Training is performed until the training termination condition is reached, and a trained neural network is obtained.
  • the target loss function is obtained based on the second true transfer function.
  • a bone conduction microphone and the air conduction microphone are deployed in the training device 800 .
  • the training device 800 includes: a head-mounted device.
  • the output of the neural network is a predicted transfer function
  • the target loss function includes: an error value between the predicted transfer function and the second true transfer function
  • reaching the training termination condition includes: the value of the target loss function reaches a preset threshold; or the target loss function begins to converge; or the number of training times reaches the preset number; or the training duration reaches Default duration; or, obtain the training termination instruction.
  • FIG. 9 is a schematic structural diagram of the training device provided by the embodiment of the present application.
  • the training device 900 can be deployed with the training equipment in the corresponding embodiment of Figure 8.
  • the described training device 800 is used to implement the functions of the training device 800 in the corresponding embodiment of Figure 8.
  • the training device 900 is implemented by one or more servers.
  • the training device 900 may produce relatively large errors due to different configurations or performances. The differences may include one or more central processing units (CPUs) 922 and memory 932, and one or more storage media 930 (such as one or more mass storage devices) that store applications 942 or data 944.
  • CPUs central processing units
  • storage media 930 such as one or more mass storage devices
  • the memory 932 and the storage medium 930 may be short-term storage or persistent storage.
  • the program stored in the storage medium 930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device 900 .
  • the central processor 922 may be configured to communicate with the storage medium 930 and execute a series of instruction operations in the storage medium 930 on the training device 900 .
  • the training device 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input and output interfaces 958, and/or, one or more operating systems 941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and more.
  • operating systems 941 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and more.
  • the central processor 922 is used to execute the training method of the neural network in the corresponding embodiment of Figure 4.
  • the central processor 922 can be used to: first, collect the audio signals emitted by the sound source through the bone conduction microphone and the air conduction microphone respectively, and obtain n bone conduction speech signals and n air conduction speech signals, n ⁇ 2; then, According to n bone conduction speech signals and n air conduction speech signals, n first real transfer functions and n second real transfer functions are obtained respectively.
  • the first real transfer function corresponds to the bone conduction speech signal one-to-one
  • the second real transfer function corresponds to the bone conduction speech signal.
  • the transfer function corresponds to the air conduction speech signal one-to-one; finally, the n first real transfer functions are used as the training data set of the neural network, and the neural network is trained using the target loss function until the training termination condition is reached, and the trained neural network is obtained , the target loss function is obtained based on the second true transfer function.
  • Figure 10 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
  • the execution device 1000 can be embodied as various terminal devices, such as a headset. Wearable devices (such as headphones), mobile phones, tablets, laptops, etc. are not limited here.
  • the execution device 700 described in the corresponding embodiment of FIG. 7 may be deployed on the execution device 1000 to implement the functions of the execution device 700 in the corresponding embodiment of FIG. 7 .
  • the execution device 1000 may include: a receiver 1001, a transmitter 1002, a processor 1003 and a memory 1004 (the number of processors 1003 in the execution device 1000 may be one or more, one processor is taken as an example in Figure 10 ), wherein the processor 1003 may include an application processor 10031 and a communication processor 10032.
  • the receiver 1001, the transmitter 1002, the processor 1003 and the memory 1004 may be connected through a bus or other means.
  • Memory 1004 may include read-only memory and random access memory and provides instructions and data to processor 1003 .
  • a portion of memory 1004 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1004 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.
  • Processor 1003 controls execution of operations of device 1000.
  • various components of the execution device 1000 are coupled together through a bus system, where in addition to a data bus, the bus system may also include a power bus, a control bus, a status signal bus, etc.
  • bus system may also include a power bus, a control bus, a status signal bus, etc.
  • various buses are called bus systems in the figure.
  • the method disclosed in the corresponding embodiment of FIG. 6 of this application can be applied to the processor 1003, or implemented by the processor 1003.
  • the processor 1003 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 1003.
  • the above-mentioned processor 1003 can be a general processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (ASIC), a field programmable Gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field-programmable gate array
  • the processor 1003 can implement or execute each method, step and logic block diagram disclosed in the embodiment corresponding to Figure 6 of this application.
  • a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 1004.
  • the processor 1003 reads the information in the memory 1004 and completes the steps of the above method in combination with its hardware.
  • the receiver 1001 may be configured to receive input numeric or character information and generate signal inputs related to performing relevant settings and functional controls of the device 1000 .
  • the transmitter 1002 can be used to output numeric or character information through the first interface; the transmitter 1002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1002 can also include a display device such as a display screen .
  • the processor 1003 is configured to process the first transfer function of the input first bone conduction speech signal through a trained neural network to obtain a corresponding second transfer function.
  • the trained neural network can be obtained through the training method corresponding to Figure 4 of this application. For details, please refer to the description in the method embodiments shown above in this application, and will not be described again here.
  • An embodiment of the present application also provides a computer-readable storage medium, which stores a program for signal processing. When it is run on a computer, it causes the computer to execute the embodiment shown in Figure 4. The steps performed by the described training device, or causing the computer to perform the steps performed by the execution device described in the embodiment shown in FIG. 6 .
  • the training device, execution device, etc. provided by the embodiment of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor.
  • the communication unit may be, for example, an input/output interface, a pin, or a communication unit. circuit etc.
  • the processing unit can execute the computer execution instructions stored in the storage unit, so that the chip in the training device performs the steps performed by the training device described in the embodiment shown in Figure 4, or, so that the chip in the execution device performs the steps shown in Figure 6
  • the illustrated embodiments describe steps performed by an execution device.
  • the storage unit is a storage unit within the chip, such as a register, cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or execution device, etc.) to execute the steps described in various embodiments of this application. method.
  • a computer device which can be a personal computer, training device, or execution device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available media may be magnetic media (such as floppy disks, hard disks, magnetic tapes), optical media (such as DVDs), or semiconductor media (such as solid state disks (SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Details Of Audible-Bandwidth Transducers (AREA)

Abstract

本申请公开了语音信号的处理方法、神经网络的训练方法及设备,应用信号处理领域中,包括:获取第一骨传导语音信号,并从中提取激励参数及确定其对应的第一传递函数;之后,将第一传递函数输入训练后的神经网络,得到第二传递函数(即预测的气传导语音信号的传递函数),训练后的神经网络基于目标损失函数,利用训练数据集对神经网络训练得到,最后根据第二传递函数及激励参数,得到与第一骨传导语音信号对应的第一气传导语音信号。本申请基于训练后的神经网络,将骨传导语音信号的传递函数映射为气传导语音信号的传递函数,以实现骨传导语音信号到气传导语音信号的音色补偿,使得骨传导语音信号的听音效果更加明亮、舒适,提高通话效率。

Description

一种语音信号的处理方法、神经网络的训练方法及设备 技术领域
本申请涉及信号处理领域,尤其涉及一种语音信号的处理方法、神经网络的训练方法及设备。
背景技术
在实时通话场景中,语音信号的传递大部分情况下采用的是气传导语音信号,传递过程可如图1所示,其中A侧用户和B侧用户基于手机进行语音通话,以A侧说话和B侧收听为例,上述传递过程为:①A侧用户说话,发出语音信号S;②A侧手机上的传导麦克风采集该语音信号S;③A侧手机将语音信号S编码压缩后基于无线网络传输到B侧手机;④B侧手机对接收到的信号进行解码还原得到语音信号S;⑤B侧手机基于B侧手机的气传导扬声器播放语音信号S。
上述气传导语音信号在传递时,语音信号S除了A侧用户说话声之外,也会同时采集到环境噪声,若环境噪声大,则会影响B侧用户的听音效果。因此,在一些高噪音场景下,如矿山、救援等,一般采用骨传导语音信号进行信息传递,这是因为骨传导麦克风采集的环境噪声少,音频信噪比较高,可以极大保证这些场景下的通话效率,其传递过程如图2所示,其中A侧用户和B侧用户基于手机进行语音通话,以A侧说话和B侧(假设B侧不变化)收听为例,上述传递过程为:①A侧用户说话,发出语音信号S1;②A侧头戴式设备(如耳机)上的骨传导麦克风紧贴A侧用户脸部,基于皮肤的振动采集到A侧用户说话时的音频信号S2(也可称为骨传导语音信号S2),这个骨传导语音信号S2被发送到A侧手机(如通过头戴式设备与手机之间的蓝牙连接发送);③A侧手机将骨传导语音信号S2编码压缩后基于无线网络传输到B侧手机;④B侧手机对接收到的信号进行解码还原得到骨传导语音信号S2;⑤B侧手机基于B侧手机的气传导扬声器播放骨传导语音信号S2。
气传导语音信号和骨传导语音信号的生成模型都是激励源卷积激励信道的方式(即激励源*激励信道=语音信号),但是二者的具体生成原理是不一样的:气传导语音信号和骨传导语音信号的激励源都是声带,但气传导语音信号的激励信道是咽腔&口腔等,而骨传导语音信号的激励信道是肌肉&骨骼等。因为二者的激励信道不同,所以最终的听音会有较大差异:气传导语音信号由于激励信道是空气传输,所以失真小,明亮度大。而骨传导语音信号的激励信道是固体和软体传输,失真大,明亮度比气传导语音信号要暗,听音舒适性差;更严重的,会导致对端(如B侧),听到的A侧的声音不自然,严重情况下可能分不清说话人是谁,从而影响通话效率。
发明内容
本申请实施例提供了一种语音信号的处理方法、神经网络的训练方法及设备,用于基于训练后的神经网络,将骨传导语音信号的传递函数(即第一传递函数)映射为气传导语音信号的传递函数(即第二传递函数),以实现骨传导语音信号到气传导语音信号的音色补 偿,使得骨传导语音信号的听音效果更加明亮、舒适,可以保证在如矿山、救援等高噪音场景下的通话效率。
基于此,本申请实施例提供以下技术方案:
第一方面,本申请实施例首先提供一种语音信号的处理方法,可用于信号处理领域中,该方法包括:首先,获取待处理的骨传导语音信号,该骨传导语音信号可称为第一骨传导语音信号。之后,基于该第一骨传导语音信号提取激励参数。之后,根据第一骨传导语音信号确定第一骨传导语音信号的传递函数(也可称为第一传递函数)。在得到第一骨传导语音信号的第一传递函数之后,将该第一传递函数输入到训练后的神经网络,从而输出第二传递函数,该第二传递函数为预测的气传导语音信号的传递函数,因此也可称为预测传递函数。最后,根据神经网络输出的第二传递函数及事先确定的激励参数,得到与第一骨传导语音信号对应的第一气传导语音信号。
在本申请上述实施方式中,基于训练后的神经网络,将骨传导语音信号的传递函数(即第一传递函数)映射为气传导语音信号的传递函数(即第二传递函数),以实现骨传导语音信号到气传导语音信号的音色补偿,使得骨传导语音信号的听音效果更加明亮、舒适,更接近气传导语音信号的听觉感知。
在第一方面的一种可能实现方式中,训练后的神经网络是基于目标损失函数,利用训练数据集对神经网络训练得到,其中,训练数据集包括多个训练数据,训练数据包括骨传导语音信号的第一真实传递函数,第一真实传递函数基于骨传导麦克风采集的声源发出的音频信号得到。神经网络的输出为预测传递函数,预测传递函数与气传导语音信号的第二真实传递函数对应,第二真实传递函数基于气传导麦克风采集的所述声源发出的音频信号得到。
在本申请上述实施方式中,具体阐述了训练后的神经网络是如何训练得到的,通过网络方式,可以通过增加映射网络的规模,提升映射质量,具备灵活性。
在第一方面的一种可能实现方式中,目标损失函数可以是:预测传递函数与第二真实传递函数之间的误差值。
在本申请上述实施方式中,具体阐述了目标损失函数的一种实现方式,简便易操作。
在第一方面的一种可能实现方式中,根据所述第一骨传导语音信号确定第一骨传导语音信号的第一传递函数的方式可以是:利用激励参数,对第一骨传导语音信号进行反卷积运算,得到第一骨传导语音信号的第一传递函数。
在本申请上述实施方式中,具体阐述了得到第一传递函数的一种实现方式,具备可实现性。
在第一方面的一种可能实现方式中,根据第二传递函数以及激励参数,得到与第一骨传导语音信号对应的第一气传导语音信号的方式可以是:对第二传递函数以及激励参数进行卷积运算,得到与第一骨传导语音信号对应的第一气传导语音信号。
在本申请上述实施方式中,具体阐述了得到第一气传导语音信号的一种实现方式,具备可实现性。
在第一方面的一种可能实现方式中,获取第一骨传导语音信号的方式可以是:首先, 通过骨传导麦克风采集音频信号,之后,对采集的音频信号进行降噪,例如,可以通过谱减法降噪算法进行降噪,从而得到第一骨传导语音信号。
在本申请上述实施方式中,第一骨传导语音信号是经过降噪处理的,因此语音信号的信号谱会更纯净。
在第一方面的一种可能实现方式中,激励参数包括第一骨传导语音信号的基频和基频的谐波。
在本申请上述实施方式中,具体阐述了提取的激励参数所包括的成分,具备可实现性。
本申请实施例第二方面提供了一种神经网络的训练方法,该方法包括:首先,可以在某个时长内分别通过骨传导麦克风以及气传导麦克风采集声源(如,用户A)发出的音频信号,例如,可以采集用户A说话时发出的n(n≥2)个骨传导语音信号和n个气传导语音信号。之后,可以进一步根据每个骨传导语音信号,得到与每个骨传导语音信号各自对应的真实传递函数(可称为第一真实传递函数),n个骨传导语音信号就可得到共n个第一真实传递函数。类似地,也可以进一步根据每个气传导语音信号,得到与每个气传导语音信号各自对应的真实传递函数(可称为第二真实传递函数),n个气传导语音信号就可得到共n个第二真实传递函数。将n个第一真实传递函数作为神经网络的训练数据集,利用目标损失函数对神经网络进行训练,直至达到训练终止条件,从而得到训练后的神经网络,目标损失函数基于第二真实传递函数得到。
在本申请上述实施方式中,具体阐述了训练后的神经网络是如何训练得到的,通过网络方式,可以通过增加映射网络的规模,提升映射质量,具备灵活性。
在第二方面的一种可能实现方式中,骨传导麦克风和气传导麦克风部署于同一设备,神经网络的训练过程在设备进行。
在本申请上述实施方式中,骨传导麦克风、气传导麦克风以及神经网络的训练过程均在同一设备,可实现在线训练的功能,具备灵活性。
在第二方面的一种可能实现方式中,该同一设备包括:头戴式设备,如耳机。
在本申请上述实施方式中,具体阐述了同一设备的一种形式,具备可实现性。
在第二方面的一种可能实现方式中,目标损失函数可以是:预测传递函数与第二真实传递函数之间的误差值。
在本申请上述实施方式中,具体阐述了目标损失函数的一种实现方式,简便易操作。
在第二方面的一种可能实现方式中,达到训练终止条件包括但不限于:目标损失函数的值达到预设阈值;或,目标损失函数开始收敛;或,训练的次数达到预设次数;或,训练的时长达到预设时长;或,获取到训练终止的指令。
在本申请上述实施方式中,具体阐述了达到训练终止条件的几种实现方式,可基于实际应用场景选择使用,具备广泛适用性。
本申请实施例第三方面提供一种执行设备,该执行设备具有实现上述第一方面或第一方面任意一种可能实现方式的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
本申请实施例第四方面提供一种训练设备,该训练设备具有实现上述第二方面或第二 方面任意一种可能实现方式的方法的功能。该功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。该硬件或软件包括一个或多个与上述功能相对应的模块。
本申请实施例第五方面提供一种执行设备,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于调用该存储器中存储的程序以执行本申请实施例第一方面或第一方面任意一种可能实现方式的方法。
本申请实施例第六方面提供一种训练设备,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于调用该存储器中存储的程序以执行本申请实施例第二方面或第二方面任意一种可能实现方式的方法。
本申请第七方面提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机可以执行上述第一方面或第一方面任意一种可能实现方式的方法,或,使得计算机可以执行上述第二方面或第二方面任意一种可能实现方式的方法。
本申请实施例第八方面提供了一种计算机程序,当其在计算机上运行时,使得计算机可以执行上述第一方面或第一方面任意一种可能实现方式的方法,或,使得计算机可以执行上述第二方面或第二方面任意一种可能实现方式的方法。
本申请实施例第九方面提供了一种芯片,该芯片包括至少一个处理器和至少一个接口电路,该接口电路和该处理器耦合,至少一个接口电路用于执行收发功能,并将指令发送给至少一个处理器,至少一个处理器用于运行计算机程序或指令,其具有实现如上述第一方面或第一方面任意一种可能实现方式的方法的功能,或,其具有实现如上述第二方面或第二方面任意一种可能实现方式的方法的功能,该功能可以通过硬件实现,也可以通过软件实现,还可以通过硬件和软件组合实现,该硬件或软件包括一个或多个与上述功能相对应的模块。此外,该接口电路用于与该芯片之外的其它模块进行通信,例如,该接口电路可将芯片上训练得到的神经网络发送给目标设备(如,耳机、手机、个人电脑等)。
附图说明
图1为本申请实施例提供的基于手机的实时通话系统的一个示意图;
图2为本申请实施例提供的基于手机的实时通话系统的另一示意图;
图3为本申请实施例提供的语音信号处理系统的一个系统架构示意图;
图4为本申请实施例提供的神经网络的训练方法的一个流程示意图;
图5为本申请实施例提供的一个训练过程的实例图;
图6为本申请实施例提供的语音信号的处理方法的一个流程示意图;
图7为本申请实施例提供的一个执行设备的结构示意图;
图8为本申请实施例提供的一个训练设备的结构示意图;
图9为本申请实施例提供的另一执行设备的结构示意图;
图10为本申请实施例提供的另一训练设备的结构示意图。
具体实施方式
本申请实施例提供了一种语音信号的处理方法、神经网络的训练方法及设备,用于基于训练后的神经网络,将骨传导语音信号的传递函数(即第一传递函数)映射为气传导语音信号的传递函数(即第二传递函数),以实现骨传导语音信号到气传导语音信号的音色补偿,使得骨传导语音信号的听音效果更加明亮、舒适,可以保证在如矿山、救援等高噪音场景下的通话效率。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。应理解的是,相关的概念解释可能会因为本申请实施例的具体情况有所限制,但并不代表本申请仅能局限于该具体情况,在不同实施例的具体情况可能也会存在差异,具体此处不做限定。
(1)神经网络
神经网络可以是由神经单元组成的,具体可以理解为具有输入层、隐含层、输出层的神经网络,一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。其中,具有很多层隐含层的神经网络则称为深度神经网络(deep neural network,DNN)。神经网络中的每一层的工作可以用数学表达式
Figure PCTCN2022117989-appb-000001
来描述,从物理层面,神经网络中的每一层的工作可以理解为通过五种对输入空间(输入向量的集合)的操作,完成输入空间到输出空间的变换(即矩阵的行空间到列空间),这五种操作包括:1、升维/降维;2、放大/缩小;3、旋转;4、平移;5、“弯曲”。其中1、2、3的操作由
Figure PCTCN2022117989-appb-000002
完成,4的操作由“+b”完成,5的操作则由“a()”来实现。这里之所以用“空间”二字来表述是因为被分类的对象并不是单个事物,而是一类事物,空间是指这类事物所有个体的集合,其中,W是神经网络各层的权重矩阵,该矩阵中的每一个值表示该层的一个神经元的权重值。该矩阵W决定着上文所述的输入空间到输出空间的空间变换,即神经网络每一层的W控制着如何变换空间。训练神经网络的目的,也就是最终得到训练好的神经网络的所有层的权重矩阵。因此,神经网络的训练过程本质上就是学习控制空间变换的方式,更具体的就是学习权重矩阵。
(2)损失函数
在训练神经网络的过程中,因为希望神经网络的输出尽可能的接近真正想要预测的值,可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重矩阵(当然,在第一次更新之前通常会有初始化的过程,即为神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重矩阵让它预测低一些,不断的调整,直到神经网络能够预测出真正想要的目标值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective  function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么神经网络的训练就变成了尽可能缩小这个loss的过程。
(3)反向传播算法
在神经网络的训练过程中,可以采用误差反向传播(back propagation,BP)算法修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经网络模型中的参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如权重矩阵。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
为了便于理解本方案,先结合图3对本申请实施例提供的语音信号处理系统的系统架构进行介绍,请参阅图3,图3为本申请实施例提供的语音信号处理系统的一个系统架构示意图。在图3中,语音信号处理系统200包括执行设备210、训练设备220、数据库230、客户设备240、数据存储系统250和数据采集设备260(如骨传导麦克风和气传导麦克风),执行设备210中包括计算模块211。其中,数据采集设备260用于获取用户需要的大规模数据集(即训练数据集,也可称为训练集,训练集中包括训练数据),并将训练集存入数据库230中,训练设备220基于数据库230中的维护的训练集对本申请构建的神经网络201进行训练。训练得到的训练后的神经网络201再在执行设备210上进行运用。执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、指令等存入数据存储系统250中。数据存储系统250可以置于执行设备210中,也可以为数据存储系统250相对执行设备210是外部存储器。
经由训练设备220训练得到的训练后的神经网络201可以应用于不同的系统或设备(即执行设备210)中,具体可以是端侧设备或头戴式设备,例如,耳机、手机、平板、电脑、云服务器等等。在图6中,执行设备210可以配置有I/O接口212,与外部设备进行数据交互,“用户”可以通过客户设备240向I/O接口212输入数据。如,客户设备240可以是骨传导麦克风,通过该客户设备采集到的骨传导语音信号输入至执行设备210,先由该执行设备210计算得到骨传导语音信号的真实传递函数(即下文所述的第一传递函数),再将该真实传递函数输入计算模块211,由计算模块211对输入的真实传递函数进行处理后得到预测传递函数(即下文所述的第二传递函数),再将该预测传递函数保存在执行设备210的存储介质上,以用于后续的下游任务。
此外,在本申请的一些实施方式中,客户设备240也可以集成在执行设备210中,如,当执行设备210为头戴式设备(如耳机)时,则可以直接通过该头戴式设备获取到该骨传导语音信号(如,可以通过该头戴式设备上自身具备的骨传导麦克风采集到的骨传导语音信号,或,通过该头戴式设备接收到的其他设备发送的骨传导语音信号等,此处对骨传导语音信号的来源方式不做限定),再由该头戴式设备内的计算模块211对该骨传导语音信号的真实传递函数进行处理,得到预测传递函数,并将得到的预测传递函数进行保存。此处 对执行设备210与客户设备240的产品形态不做限定。
还需要注意的是,在本申请的另一些实施方式中,数据采集设备260和/或训练设备220也可以集成在执行设备210中,如,当执行设备210为头戴式设备(如耳机)时,由于不同的人佩戴头戴式设备时采集的骨传导语音信号的音色等也可能各不相同,因此,可以将数据采集设备260和/或训练设备220集成在执行设备210中,当用户A佩戴该头戴式设备时,则通过数据采集设备260(如骨传导麦克风)采集用户A的说话声,并通过训练设备220训练神经网络201(基于气传导麦克风采集的气传导语音信号得到真实的气传导语音信号的传递函数),训练后的神经网络201直接用于后续用户A的应用;类似地,当用户B佩戴该头戴式设备时,则通过数据采集设备260(如骨传导麦克风)采集用户B的说话声,并通过训练设备220训练神经网络201,训练后的神经网络201直接用于后续用户A的应用。这样可使得训练后的神经网络201更准确,在不同的用户使用该执行设备210时都能适配,具备灵活性。
值得注意的,图3仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中;在图3中,客户设备240相对执行设备210是外部设备,在其他情况下,客户设备240也可以集成在执行设备210中;在图3中,训练设备220相对执行设备210是外部设备,在其他情况下,训练设备220也可以集成在执行设备210中;在图3中,数据采集设备260相对执行设备210是外部设备,在其他情况下,数据采集设备260也可以集成在执行设备210中,等等。本申请对此不做限定。
由于神经网络的应用一般分为两个阶段,分别为训练阶段和应用阶段(也可称推理阶段),下面分别从这两个阶段,对本申请实施例提供的神经网络的训练方法和语音信号的处理方法的具体流程进行描述。
一、训练阶段
本申请实施例中,训练阶段指的是上述图3中数据采集设备260采集训练数据、训练设备220利用训练数据对神经网络201执行训练操作的过程。具体地,请参阅图4,图4为本申请实施例提供的神经网络的训练方法的一个流程示意图,该方法具体可以包括如下步骤:
401、分别通过骨传导麦克风以及气传导麦克风采集声源发出的音频信号,得到n个骨传导语音信号和n个气传导语音信号,n≥2。
首先,可以在某个时长内分别通过骨传导麦克风以及气传导麦克风采集声源(如,用户A)发出的音频信号,例如,可以采集用户A说话时发出的n(n≥2)个骨传导语音信号s(t)和n个气传导语音信号y(t)。
402、分别根据n个骨传导语音信号和n个气传导语音信号,得到n个第一真实传递函数和n个第二真实传递函数,第一真实传递函数与骨传导语音信号一一对应,第二真实传递函数与气传导语音信号一一对应。
之后,可以进一步根据每个骨传导语音信号s(t),得到与每个骨传导语音信号s(t)各自 对应的真实传递函数h b(t)(可称为第一真实传递函数h b(t)),n个骨传导语音信号s(t)就可得到共n个第一真实传递函数h b(t)。类似地,也可以进一步根据每个气传导语音信号y(t),得到与每个气传导语音信号y(t)各自对应的真实传递函数h y(t)(可称为第二真实传递函数h y(t)),n个气传导语音信号y(t)就可得到共n个第二真实传递函数h y(t)。
403、将n个第一真实传递函数作为神经网络的训练数据集,利用目标损失函数对神经网络进行训练直至达到训练终止条件,得到训练后的神经网络,目标损失函数基于第二真实传递函数得到。
将n个第一真实传递函数h b(t)作为神经网络的训练数据集,利用目标损失函数L对神经网络进行训练,直至达到训练终止条件,从而得到训练后的神经网络,目标损失函数基于第二真实传递函数得到。作为一种示例,目标损失函数L可以是神经网络输出的预测传递函数与第二真实传递函数之间的误差值。
具体地,该训练过程可以是:将第一真实传递函数h b(t)作为神经网络的输入,神经网络的输出则为预测传递函数h c(t),然后不断迭代训练,直至达到训练终止条件,其中,目标损失函数L=|h c(t)-h y(t)|。作为一个示例,图5以递归神经网络(recurrent neural networks,RNNs)为例示意的RNNs的输入数据(即训练数据)、输出(即预测传递函数)的关系。
需要说明的是,在本申请的一些实施方式中,达到训练终止条件包括但不限于:
(1)目标损失函数达到预设阈值。
在配置好该目标损失函数后,可事先对该目标损失函数设置一个阈值(如,0.03),在对神经网络进行迭代训练的过程中,每次训练结束后都判断当前轮次训练得到的目标损失函数的取值是否达到阈值,若没达到,则继续训练,若达到设置的预设阈值,则终止训练,那么该当前轮次训练确定的神经网络的网络参数的取值就作为最终训练好的神经网络的网络参数取值。
(2)调整后的目标损失函数开始收敛。
在配置好该目标损失函数后,就可对神经网络进行迭代训练,若当前轮次训练得到的目标损失函数的取值与上一轮次训练得到的目标损失函数的取值之间的差值在预设范围(如,在0.01)之内,则认为该目标损失函数收敛了,可终止训练,那么该当前轮次训练确定的神经网络的网络参数的取值就作为最终训练好的神经网络的网络参数取值。
(3)训练达到预设次数。
在这种方式中,可预先配置对神经网络进行训练的迭代次数(如,100次),在配置好该目标损失函数后,可对神经网络进行迭代训练,每个轮次训练结束后,都将对应该轮次的神经网络的网络参数的取值存储下来,直至训练的迭代次数达到预设次数,之后,利用测试数据对各个轮次得到的神经网络进行验证,从中选择性能最好的那个网络参数的取值作为该神经网络最终的网络参数的取值。
(4)训练的时长达到预设时长。
在这种方式中,可预先配置对神经网络进行训练的迭代时长(如,5分钟),在配置好该目标损失函数后,可对神经网络进行迭代训练,每个轮次训练结束后,都将对应该轮次的神经网络的网络参数的取值存储下来,直至训练的迭代时长达到预设时长,之后,利用 测试数据对各个轮次得到的神经网络进行验证,从中选择性能最好的那个网络参数的取值作为该神经网络最终的网络参数的取值。
(5)获取到训练终止的指令。
在这种方式中,可预先设置一个训练开关,用于触发生成训练开始和训练终止的指令,当使能该训练开关开启时,则触发生成训练开始的指令,神经网络开始进行迭代训练,当使能该训练开关关闭时,则触发生成训练终止的指令,神经网络停止训练,训练开关开启至关闭的这段时长则为神经网络的训练时长。在配置好目标损失函数后,可通过开启或关闭训练开关对神经网络进行迭代训练,每个轮次训练结束后,都将对应该轮次的神经网络的网络参数的取值存储下来,直至训练开关关闭,之后,利用测试数据对各个轮次得到的神经网络进行验证,从中选择性能最好的那个网络参数的取值作为该神经网络最终的网络参数的取值。
还需要说明的是,在本申请实施例中,可以是一开始就采集好n个骨传导语音信号s(t)和n个气传导语音信号y(t),之后,再基于采集到的这n个骨传导语音信号s(t)和n个气传导语音信号y(t)对神经网络进行迭代训练(即先采集好所需的训练数据,再训练神经网络);也可以是采集到一个骨传导语音信号s(t)和一个气传导语音信号y(t)就对神经网络进行一次训练,若当前这一次训练的神经网络未达到训练终止条件,则继续重复采集一次训练数据、执行一次训练的过程,直至得到训练终止条件(即需要迭代多少次,则采集多少次)。具体本申请对数据采集和训练之间的次序不做限定。
需要说明的是,在本申请的一些实施方式中,骨传导麦克风和气传导麦克风可以部署于同一设备,如可以都部署于头戴式设备(如耳机)上,该神经网络的训练过程也可以在该设备上进行,即训练设备为该设备。在这种情况下,上述神经网络的训练过程则为在线训练过程。作为一种示例,该在线训练过程的实现过程可以是:
①首先,佩戴者佩戴该头戴式设备(如耳机)。
②使能在线训练开关,当佩戴者开启该开关时,则触发该头戴式设备开始训练。
③在使能在线训练前提下,佩戴者说话,由头戴式设备通过骨传导麦克风和气传导麦克风同时采集佩戴者的骨传导语音信号s(t)与气传导语音信号y(t)。
④使能在线训练开关关闭,当佩戴者关闭该开关时,则触发该头戴式设备终止训练(即训练终止条件为获取到训练终止的指令)。在线训练开关开启至关闭的这段时长则为神经网络的训练时长,头戴式设备会在该时长内将骨传导语音信号s(t)对应的第一真实传递函数h b(t)作为神经网络的输入,神经网络的输出为预测传递函数h c(t),然后不断迭代训练,并将在线训练开关关闭时最后一次得到的神经网络的网络参数(或所有次数中最优的网络参数)作为最终的神经网络的网络参数保存下来。
需要注意的是,在本申请的一些实施方式中,该神经网络的在线训练过程也可以不在部署有骨传导麦克风和气传导麦克风的同一设备上进行。作为一种示例,假设神经网络的在线训练过程是在在线训练模块中进行,且设备为耳机,那么在线训练模块可以部署在耳机上,也可以部署在其他设备上,如手机、电脑、云服务器等:a、如果部署在耳机上,则系统要将气导麦克的信号发送到耳机上;b、如果部署在其他设备上,则系统要将骨导麦克 的信号发送到耳机上。因此,训练好的神经网络可以保存在耳机上,也可以保存在其他设备上:a、如果保存在耳机上,则其他设备训练好神经网络后,需要将训练好的神经网络发送给耳机并由耳机保存;b、如果保存在其他设备上,则耳机训练好神经网络后,可以将训练好的神经网络发送给其他设备进行保存,并在需要使用训练好的神经网络时从其他设备处获取该训练好的神经网络。
需要说明的是,在本申请的一些实施方式中,该神经网络的训练过程也可以是离线训练过程(即神经网络提前训练好)。作为一种示例,该离线训练过程的实现过程可以是:
①同时通过骨传导麦克风以及气传导麦克风采集声源发出的音频信号,得到骨传导语音信号s(t)与气传导语音信号y(t)。
②将骨传导语音信号s(t)对应的第一真实传递函数h b(t)作为神经网络的输入,神经网络的输出为预测传递函数h c(t),不断迭代训练,直至达到训练终止条件,并将最后一次得到的神经网络的网络参数(或所有次数中最优的网络参数)作为最终的神经网络的网络参数保存下来。
二、应用阶段
本申请实施例中,应用阶段指的是上述图3中执行设备210利用训练后的神经网络201对输入的数据进行处理的过程。具体地,请参阅图6,图6为本申请实施例提供的语音信号的处理方法的一个流程示意图,该方法具体可以包括如下步骤:
601、获取第一骨传导语音信号,并基于第一骨传导语音信号提取激励参数。
首先,获取待处理的骨传导语音信号s(t),该骨传导语音信号s(t)可称为第一骨传导语音信号s(t)。之后,基于该第一骨传导语音信号s(t)提取激励参数e(t),其中,激励参数e(t)可以包括第一骨传导语音信号s(t)的基频和所述基频的谐波,可采用语音信号的线性预测(linear predictive coding,LPC)方法分析得到。
可选地,在本申请的一些实施方式中,获取第一骨传导语音信号s(t)的实现方式可以是:首先,可基于骨传导麦克风采集音频信号x(t)(也可称为骨传导音频信号x(t)),再对该音频信号x(t)进行降噪,得到所述的第一骨传导语音信号s(t)。作为一种示例,可以通过谱减法降噪算法进行降噪,具体包括如下几个步骤:1)计算音频信号x(t)的噪声谱:先对音频信号x(t)进行语音活动检测(voice activity detection,VAD)分析,取非语音部分n(t),并对该非语音部分n(t)进行快速傅里叶变换(fast fourier transform,FFT),得到N(w),即为背景噪声谱;2)降噪:对音频信号x(t)进行FFT变换,得到X(w),然后减去背景噪声谱N(w),得到纯净的语音信号的信号谱S(w),即S(w)=X(w)-N(w),之后对S(w)做逆快速傅里叶变换(inverse fast fourier transform,IFFT),得到该第一骨传导语音信号s(t),即为降噪后的干净语音信号。
602、根据第一骨传导语音信号确定第一骨传导语音信号的第一传递函数。
之后,根据第一骨传导语音信号s(t)确定第一骨传导语音信号s(t)的传递函数h b(t),也可称为第一传递函数h b(t)。具体地,可以利用得到的激励参数e(t),对第一骨传导语音信号s(t)进行反卷积运算,得到第一骨传导语音信号s(t)的第一传递函数h b(t),即h b(t)=e -1(t)*s(t)。
603、将第一传递函数输入训练后的神经网络,得到输出的第二传递函数,第二传递函 数为预测的气传导语音信号的传递函数。
在得到第一骨传导语音信号s(t)的第一传递函数h b(t)之后,则将该第一传递函数h b(t)输入到训练后的神经网络,从而输出第二传递函数h c(t),该第二传递函数h c(t)为预测的气传导语音信号的传递函数,因此也可称为预测传递函数h c(t)。
需要注意的是,在本申请实施例中,该训练后的神经网络是经由上述训练阶段得到的训练好的神经网络,也就是训练后的神经网络是基于目标损失函数,利用训练数据集对神经网络训练得到,其中,训练数据集包括多个训练数据(包括骨传导语音信号的第一真实传递函数),第一真实传递函数基于骨传导麦克风采集的声源发出的音频信号得到;神经网络的输出为预测传递函数,预测传递函数与气传导语音信号的第二真实传递函数对应,第二真实传递函数基于气传导麦克风采集的所述声源发出的音频信号得到。可选地,目标损失函数可以是神经网络输出的预测传递函数与第二真实传递函数之间的误差值。
604、根据第二传递函数以及激励参数,得到与第一骨传导语音信号对应的第一气传导语音信号。
最后,根据神经网络输出的第二传递函数h c(t)以及事先确定的激励参数e(t),得到与第一骨传导语音信号s(t)对应的第一气传导语音信号s’(t)。具体地,可以对第二传递函数h c(t)以及激励参数e(t)进行卷积运算,得到与第一骨传导语音信号s(t)对应的第一气传导语音信号s’(t),即s’(t)=e(t)*h c(t)。
需要说明的是,在本申请的一些实施方式中,基于音色补偿后的语音信号(即第一气传导语音信号s’(t))进行后续处理,包括但不限于:
1)语音通话,可以增强通话后的语音明亮度,实现更好的语音感知质量,保证通话对方的感知不受影响,提升了通话语音的感知明亮度和舒适度。
2)语音识别,提升骨传导语音下,语音识别的准确性。
3)声纹识别,提升骨传导语音下,声纹识别的准确性。
在上述实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。具体参阅图7,图7为本申请实施例提供的一个执行设备的结构示意图,该执行设备700具体可以包括:获取模块701、第一确定模块702、计算模块703以及第二确定模块704,其中,获取模块701,用于获取第一骨传导语音信号,并基于该第一骨传导语音信号提取激励参数;第一确定模块702,用于根据该第一骨传导语音信号确定该第一骨传导语音信号的第一传递函数;计算模块703,用于将该第一传递函数输入训练后的神经网络,得到输出的第二传递函数,该第二传递函数为预测的气传导语音信号的传递函数;第二确定模块704,用于根据该第二传递函数以及该激励参数,得到与该第一骨传导语音信号对应的第一气传导语音信号。
在一种可能的设计中,训练后的神经网络是基于目标损失函数,利用训练数据集对神经网络训练得到;该训练数据集包括多个训练数据,该训练数据包括骨传导语音信号的第一真实传递函数,该第一真实传递函数基于骨传导麦克风采集的声源发出的音频信号得到;该神经网络的输出为预测传递函数,该预测传递函数与气传导语音信号的第二真实传递函数对应,该第二真实传递函数基于气传导麦克风采集的该声源发出的音频信号得到。
在一种可能的设计中,目标损失函数包括:该预测传递函数与该第二真实传递函数之间的误差值。
在一种可能的设计中,第一确定模块702,具体用于:利用该激励参数,对该第一骨传导语音信号进行反卷积运算,得到该第一骨传导语音信号的第一传递函数。
在一种可能的设计中,第二确定模块704,具体用于:对该第二传递函数以及该激励参数进行卷积运算,得到与该第一骨传导语音信号对应的第一气传导语音信号。
在一种可能的设计中,获取模块701,具体用于:通过骨传导麦克风采集音频信号,对音频信号进行降噪,得到该第一骨传导语音信号。
在一种可能的设计中,激励参数包括该第一骨传导语音信号的基频和该基频的谐波。
需要说明的是,执行设备700中各模块/单元之间的信息交互、执行过程等内容,与本申请中图6对应的方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例还提供一种训练设备,具体参阅图8,图8为本申请实施例提供的一种训练设备的示意图,该训练设备800具体可以包括:采集模块801、计算模块802以及迭代模块803,其中,采集模块801,用于分别通过骨传导麦克风以及气传导麦克风采集声源发出的音频信号,得到n个骨传导语音信号和n个气传导语音信号,n≥2;计算模块802,用于分别根据n个该骨传导语音信号和n个该气传导语音信号,得到n个第一真实传递函数和n个第二真实传递函数,该第一真实传递函数与该骨传导语音信号一一对应,该第二真实传递函数与该气传导语音信号一一对应;迭代模块803,用于将n个该第一真实传递函数作为神经网络的训练数据集,利用目标损失函数对该神经网络进行训练直至达到训练终止条件,得到训练后的神经网络,该目标损失函数基于该第二真实传递函数得到。
在一种可能的设计中,骨传导麦克风和该气传导麦克风部署于该训练设备800中。
在一种可能的设计中,训练设备800包括:头戴式设备。
在一种可能的设计中,神经网络的输出为预测传递函数,该目标损失函数包括:该预测传递函数与该第二真实传递函数之间的误差值。
在一种可能的设计中,达到训练终止条件包括:该目标损失函数的值达到预设阈值;或,该目标损失函数开始收敛;或,训练的次数达到预设次数;或,训练的时长达到预设时长;或,获取到训练终止的指令。
需要说明的是,训练设备800中各模块/单元之间的信息交互、执行过程等内容,与本申请中图4对应的方法实施例基于同一构思,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
接下来介绍本申请实施例提供的另一种训练设备,请参阅图9,图9为本申请实施例提供的训练设备的一种结构示意图,训练设备900上可以部署有图8对应实施例中所描述的训练设备800,用于实现图8对应实施例中训练设备800的功能,具体的,训练设备900由一个或多个服务器实现,训练设备900可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)922和存储器932,一个或一个以上存储应用程序942或数据944的存储介质930(例如一个或一个以上海量存储 设备)。其中,存储器932和存储介质930可以是短暂存储或持久存储。存储在存储介质930的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对训练设备900中的一系列指令操作。更进一步地,中央处理器922可以设置为与存储介质930通信,在训练设备900上执行存储介质930中的一系列指令操作。
训练设备900还可以包括一个或一个以上电源926,一个或一个以上有线或无线网络接口950,一个或一个以上输入输出接口958,和/或,一个或一个以上操作系统941,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。
本申请实施例中,中央处理器922,用于执行图4对应实施例中神经网络的训练方法。例如,中央处理器922可以用于:首先,分别通过骨传导麦克风以及气传导麦克风采集声源发出的音频信号,得到n个骨传导语音信号和n个气传导语音信号,n≥2;之后,分别根据n个骨传导语音信号和n个气传导语音信号,得到n个第一真实传递函数和n个第二真实传递函数,第一真实传递函数与骨传导语音信号一一对应,第二真实传递函数与气传导语音信号一一对应;最后,将n个第一真实传递函数作为神经网络的训练数据集,利用目标损失函数对神经网络进行训练直至达到训练终止条件,得到训练后的神经网络,目标损失函数基于第二真实传递函数得到。
需要说明的是,中央处理器922执行上述各个步骤的具体方式,与本申请中图4对应的方法实施例基于同一构思,其带来的技术效果也与本申请上述实施例相同,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
接下来介绍本申请实施例提供的一种执行设备,请参阅图10,图10为本申请实施例提供的执行设备的一种结构示意图,执行设备1000具体可以表现为各种终端设备,如头戴式设备(如耳机)、手机、平板、笔记本电脑等,此处不做限定。其中,执行设备1000上可以部署有图7对应实施例中所描述的执行设备700,用于实现图7对应实施例中执行设备700的功能。具体的,执行设备1000可以包括:接收器1001、发射器1002、处理器1003和存储器1004(其中执行设备1000中的处理器1003的数量可以一个或多个,图10中以一个处理器为例),其中,处理器1003可以包括应用处理器10031和通信处理器10032。在本申请的一些实施例中,接收器1001、发射器1002、处理器1003和存储器1004可通过总线或其它方式连接。
存储器1004可以包括只读存储器和随机存取存储器,并向处理器1003提供指令和数据。存储器1004的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1004存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
处理器1003控制执行设备1000的操作。具体的应用中,执行设备1000的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。
本申请上述图6对应实施例揭示的方法可以应用于处理器1003中,或者由处理器1003实现。处理器1003可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述 方法的各步骤可以通过处理器1003中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1003可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1003可以实现或者执行本申请图6对应的实施例中公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1004,处理器1003读取存储器1004中的信息,结合其硬件完成上述方法的步骤。
接收器1001可用于接收输入的数字或字符信息,以及产生与执行设备1000的相关设置以及功能控制有关的信号输入。发射器1002可用于通过第一接口输出数字或字符信息;发射器1002还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1002还可以包括显示屏等显示设备。
本申请实施例中,在一种情况下,处理器1003,用于通过训练后的神经网络对输入的第一骨传导语音信号的第一传递函数进行处理,得到对应的第二传递函数。该训练后的神经网络可以是经过本申请图4对应的训练方法得到,具体内容可参见本申请前述所示的方法实施例中的叙述,此处不再赘述。
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述图4所示实施例描述的训练设备所执行的步骤,或者,使得计算机执行如前述图6所示实施例描述的执行设备所执行的步骤。
本申请实施例提供的训练设备、执行设备等具体可以为芯片,芯片包括:处理单元和通信单元,该处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使训练设备内的芯片执行上述图4所示实施例描述的训练设备所执行的步骤,或者,使得执行设备内的芯片执行如前述图6所示实施例描述的执行设备所执行的步骤。
可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条 或多条通信总线或信号线。
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者执行设备等)执行本申请各个实施例所述的方法。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等

Claims (29)

  1. 一种语音信号的处理方法,其特征在于,包括:
    获取第一骨传导语音信号,并基于所述第一骨传导语音信号提取激励参数;
    根据所述第一骨传导语音信号确定所述第一骨传导语音信号的第一传递函数;
    将所述第一传递函数输入训练后的神经网络,得到输出的第二传递函数,所述第二传递函数为预测的气传导语音信号的传递函数;
    根据所述第二传递函数以及所述激励参数,得到与所述第一骨传导语音信号对应的第一气传导语音信号。
  2. 根据权利要求1所述的方法,其特征在于,
    所述训练后的神经网络是基于目标损失函数,利用训练数据集对神经网络训练得到;
    所述训练数据集包括多个训练数据,所述训练数据包括骨传导语音信号的第一真实传递函数,所述第一真实传递函数基于骨传导麦克风采集的声源发出的音频信号得到;
    所述神经网络的输出为预测传递函数,所述预测传递函数与气传导语音信号的第二真实传递函数对应,所述第二真实传递函数基于气传导麦克风采集的所述声源发出的音频信号得到。
  3. 根据权利要求2所述的方法,其特征在于,所述目标损失函数包括:
    所述预测传递函数与所述第二真实传递函数之间的误差值。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述根据所述第一骨传导语音信号确定所述第一骨传导语音信号的第一传递函数包括:
    利用所述激励参数,对所述第一骨传导语音信号进行反卷积运算,得到所述第一骨传导语音信号的第一传递函数。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述根据所述第二传递函数以及所述激励参数,得到与所述第一骨传导语音信号对应的第一气传导语音信号包括:
    对所述第二传递函数以及所述激励参数进行卷积运算,得到与所述第一骨传导语音信号对应的第一气传导语音信号。
  6. 根据权利要求1-5中任一项所述的方法,其特征在于,所述获取第一骨传导语音信号包括:
    通过骨传导麦克风采集音频信号;
    对所述音频信号进行降噪,得到所述第一骨传导语音信号。
  7. 根据权利要求1-6中任一项所述的方法,其特征在于,所述激励参数包括所述第一骨传导语音信号的基频和所述基频的谐波。
  8. 一种神经网络的训练方法,其特征在于,包括:
    分别通过骨传导麦克风以及气传导麦克风采集声源发出的音频信号,得到n个骨传导语音信号和n个气传导语音信号,n≥2;
    分别根据n个所述骨传导语音信号和n个所述气传导语音信号,得到n个第一真实传递函数和n个第二真实传递函数,所述第一真实传递函数与所述骨传导语音信号一一对应,所述第二真实传递函数与所述气传导语音信号一一对应;
    将n个所述第一真实传递函数作为神经网络的训练数据集,利用目标损失函数对所述神经网络进行训练直至达到训练终止条件,得到训练后的神经网络,所述目标损失函数基于所述第二真实传递函数得到。
  9. 根据权利要求8所述的方法,其特征在于,所述骨传导麦克风和所述气传导麦克风部署于同一设备,所述神经网络的训练过程在所述设备进行。
  10. 根据权利要求9所述的方法,其特征在于,所述设备包括:
    头戴式设备。
  11. 根据权利要求8-10中任一项所述的方法,其特征在于,所述神经网络的输出为预测传递函数,所述目标损失函数包括:
    所述预测传递函数与所述第二真实传递函数之间的误差值。
  12. 根据权利要求8-11中任一项所述的方法,其特征在于,所述达到训练终止条件包括:
    所述目标损失函数的值达到预设阈值;
    或,
    所述目标损失函数开始收敛;
    或,
    训练的次数达到预设次数;
    或,
    训练的时长达到预设时长;
    或,
    获取到训练终止的指令。
  13. 一种执行设备,其特征在于,包括:
    获取模块,用于获取第一骨传导语音信号,并基于所述第一骨传导语音信号提取激励参数;
    第一确定模块,用于根据所述第一骨传导语音信号确定所述第一骨传导语音信号的第一传递函数;
    计算模块,用于将所述第一传递函数输入训练后的神经网络,得到输出的第二传递函数,所述第二传递函数为预测的气传导语音信号的传递函数;
    第二确定模块,用于根据所述第二传递函数以及所述激励参数,得到与所述第一骨传导语音信号对应的第一气传导语音信号。
  14. 根据权利要求13所述的设备,其特征在于,
    所述训练后的神经网络是基于目标损失函数,利用训练数据集对神经网络训练得到;
    所述训练数据集包括多个训练数据,所述训练数据包括骨传导语音信号的第一真实传递函数,所述第一真实传递函数基于骨传导麦克风采集的声源发出的音频信号得到;
    所述神经网络的输出为预测传递函数,所述预测传递函数与气传导语音信号的第二真实传递函数对应,所述第二真实传递函数基于气传导麦克风采集的所述声源发出的音频信号得到。
  15. 根据权利要求13所述的设备,其特征在于,所述目标损失函数包括:
    所述预测传递函数与所述第二真实传递函数之间的误差值。
  16. 根据权利要求13-15中任一项所述的设备,其特征在于,所述第一确定模块,具体用于:
    利用所述激励参数,对所述第一骨传导语音信号进行反卷积运算,得到所述第一骨传导语音信号的第一传递函数。
  17. 根据权利要求13-16中任一项所述的设备,其特征在于,所述第二确定模块,具体用于:
    对所述第二传递函数以及所述激励参数进行卷积运算,得到与所述第一骨传导语音信号对应的第一气传导语音信号。
  18. 根据权利要求13-17中任一项所述的设备,其特征在于,所述获取模块,具体用于:
    通过骨传导麦克风采集音频信号;
    对所述音频信号进行降噪,得到所述第一骨传导语音信号。
  19. 根据权利要求13-18中任一项所述的设备,其特征在于,所述激励参数包括所述第一骨传导语音信号的基频和所述基频的谐波。
  20. 一种训练设备,其特征在于,包括:
    采集模块,用于分别通过骨传导麦克风以及气传导麦克风采集声源发出的音频信号,得到n个骨传导语音信号和n个气传导语音信号,n≥2;
    计算模块,用于分别根据n个所述骨传导语音信号和n个所述气传导语音信号,得到n个第一真实传递函数和n个第二真实传递函数,所述第一真实传递函数与所述骨传导语音信号一一对应,所述第二真实传递函数与所述气传导语音信号一一对应;
    迭代模块,用于将n个所述第一真实传递函数作为神经网络的训练数据集,利用目标损失函数对所述神经网络进行训练直至达到训练终止条件,得到训练后的神经网络,所述目标损失函数基于所述第二真实传递函数得到。
  21. 根据权利要求20所述的设备,其特征在于,所述骨传导麦克风和所述气传导麦克风部署于所述训练设备。
  22. 根据权利要求21所述的设备,其特征在于,所述训练设备包括:
    头戴式设备。
  23. 根据权利要求20-22中任一项所述的设备,其特征在于,所述神经网络的输出为预测传递函数,所述目标损失函数包括:
    所述预测传递函数与所述第二真实传递函数之间的误差值。
  24. 根据权利要求20-23中任一项所述的设备,其特征在于,所述达到训练终止条件包括:
    所述目标损失函数的值达到预设阈值;
    或,
    所述目标损失函数开始收敛;
    或,
    训练的次数达到预设次数;
    或,
    训练的时长达到预设时长;
    或,
    获取到训练终止的指令。
  25. 一种执行设备,包括处理器和存储器,所述处理器与所述存储器耦合,其特征在于,
    所述存储器,用于存储程序;
    所述处理器,用于执行所述存储器中的程序,使得所述执行设备执行如权利要求1-7中任一项所述的方法。
  26. 一种训练设备,包括处理器和存储器,所述处理器与所述存储器耦合,其特征在于,
    所述存储器,用于存储程序;
    所述处理器,用于执行所述存储器中的程序,使得所述训练设备执行如权利要求8-12中任一项所述的方法。
  27. 一种计算机可读存储介质,包括程序,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1-12中任一项所述的方法。
  28. 一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1-12中任一项所述的方法。
  29. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行如权利要求1-12中任一项所述的方法。
PCT/CN2022/117989 2022-09-09 2022-09-09 一种语音信号的处理方法、神经网络的训练方法及设备 WO2024050802A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/117989 WO2024050802A1 (zh) 2022-09-09 2022-09-09 一种语音信号的处理方法、神经网络的训练方法及设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/117989 WO2024050802A1 (zh) 2022-09-09 2022-09-09 一种语音信号的处理方法、神经网络的训练方法及设备

Publications (1)

Publication Number Publication Date
WO2024050802A1 true WO2024050802A1 (zh) 2024-03-14

Family

ID=90192541

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/117989 WO2024050802A1 (zh) 2022-09-09 2022-09-09 一种语音信号的处理方法、神经网络的训练方法及设备

Country Status (1)

Country Link
WO (1) WO2024050802A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017687A (zh) * 2020-09-11 2020-12-01 歌尔科技有限公司 一种骨传导设备的语音处理方法、装置及介质
CN112565977A (zh) * 2020-11-27 2021-03-26 大象声科(深圳)科技有限公司 高频信号重建模型的训练方法和高频信号重建方法及装置
CN112599145A (zh) * 2020-12-07 2021-04-02 天津大学 基于生成对抗网络的骨传导语音增强方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112017687A (zh) * 2020-09-11 2020-12-01 歌尔科技有限公司 一种骨传导设备的语音处理方法、装置及介质
CN112565977A (zh) * 2020-11-27 2021-03-26 大象声科(深圳)科技有限公司 高频信号重建模型的训练方法和高频信号重建方法及装置
CN112599145A (zh) * 2020-12-07 2021-04-02 天津大学 基于生成对抗网络的骨传导语音增强方法

Similar Documents

Publication Publication Date Title
US20220230651A1 (en) Voice signal dereverberation processing method and apparatus, computer device and storage medium
TWI730584B (zh) 關鍵詞的檢測方法以及相關裝置
US10433075B2 (en) Low latency audio enhancement
WO2018068396A1 (zh) 语音质量评价方法和装置
WO2019099699A1 (en) Interactive system for hearing devices
WO2019242414A1 (zh) 语音处理方法、装置、存储介质及电子设备
WO2014114049A1 (zh) 一种语音识别的方法、装置
CN112017687B (zh) 一种骨传导设备的语音处理方法、装置及介质
US11818547B2 (en) Method, apparatus and system for neural network hearing aid
US11877125B2 (en) Method, apparatus and system for neural network enabled hearing aid
KR20210132615A (ko) 사운드 특성에 대한 음향 모델 컨디셔닝
US11832061B2 (en) Method, apparatus and system for neural network hearing aid
US12075215B2 (en) Method, apparatus and system for neural network hearing aid
WO2019119593A1 (zh) 语音增强方法及装置
US20240194220A1 (en) Position detection method, apparatus, electronic device and computer readable storage medium
CN114898762A (zh) 基于目标人的实时语音降噪方法、装置和电子设备
CN110191397B (zh) 一种降噪方法及蓝牙耳机
Li et al. Improved environment-aware–based noise reduction system for cochlear implant users based on a knowledge transfer approach: Development and usability study
WO2024050802A1 (zh) 一种语音信号的处理方法、神经网络的训练方法及设备
CN111863006B (zh) 一种音频信号处理方法、音频信号处理装置和耳机
CN113393863B (zh) 一种语音评价方法、装置和设备
CN114783455A (zh) 用于语音降噪的方法、装置、电子设备和计算机可读介质
CN114023352A (zh) 一种基于能量谱深度调制的语音增强方法及装置
CN112382296A (zh) 一种声纹遥控无线音频设备的方法和装置
CN116895284B (zh) 自适应声掩蔽方法、装置、设备及可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22957772

Country of ref document: EP

Kind code of ref document: A1