WO2024050802A1 - 一种语音信号的处理方法、神经网络的训练方法及设备 - Google Patents
一种语音信号的处理方法、神经网络的训练方法及设备 Download PDFInfo
- Publication number
- WO2024050802A1 WO2024050802A1 PCT/CN2022/117989 CN2022117989W WO2024050802A1 WO 2024050802 A1 WO2024050802 A1 WO 2024050802A1 CN 2022117989 W CN2022117989 W CN 2022117989W WO 2024050802 A1 WO2024050802 A1 WO 2024050802A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- transfer function
- bone conduction
- speech signal
- training
- neural network
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R1/00—Details of transducers, loudspeakers or microphones
- H04R1/10—Earpieces; Attachments therefor ; Earphones; Monophonic headphones
- H04R1/1083—Reduction of ambient noise
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02165—Two microphones, one receiving mainly the noise signal and the other one mainly the speech signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04R—LOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
- H04R2460/00—Details of hearing devices, i.e. of ear- or headphones covered by H04R1/10 or H04R5/033 but not provided for in any of their subgroups, or of hearing aids covered by H04R25/00 but not provided for in any of its subgroups
- H04R2460/13—Hearing devices using bone conduction transducers
Definitions
- the present application relates to the field of signal processing, and in particular to a speech signal processing method, a neural network training method and equipment.
- air-conduction speech signals are mostly used for transmission of voice signals.
- the transmission process can be shown in Figure 1, in which users on side A and users on side B conduct voice calls based on mobile phones, and side A speaks and B-side listening is an example.
- the above transmission process is: 1 The user on side A speaks and sends out a voice signal S; 2 The conductive microphone on the mobile phone on side A collects the voice signal S; 3 The mobile phone on side A encodes and compresses the voice signal S and transmits it to B based on the wireless network. The side mobile phone; 4 The B side mobile phone decodes and restores the received signal to obtain the voice signal S; 5 The B side mobile phone plays the voice signal S based on the air conduction speaker of the B side mobile phone.
- the speech signal S When the above-mentioned air conduction speech signal is transmitted, in addition to the voice of the user on side A, the speech signal S will also collect environmental noise at the same time. If the environmental noise is large, it will affect the listening effect of the user on side B. Therefore, in some high-noise scenes, such as mines and rescues, bone conduction voice signals are generally used for information transmission. This is because bone conduction microphones collect less environmental noise and have a high audio signal-to-noise ratio, which can greatly guarantee these scenes.
- the user on side A speaks and emits a voice signal S1;
- the bone conduction microphone on the head-mounted device (such as earphones) on side A is close to the face of the user on side A, and the audio signal S2 of the user on side A when speaking is collected based on the vibration of the skin ( It can also be called bone conduction voice signal S2).
- This bone conduction voice signal S2 is sent to the A-side mobile phone (such as sent through the Bluetooth connection between the head-mounted device and the mobile phone); 3A-side mobile phone encodes the bone conduction voice signal S2 After compression, it is transmitted to the B-side mobile phone based on the wireless network; 4 The B-side mobile phone decodes and restores the received signal to obtain the bone conduction voice signal S2; 5 The B-side mobile phone plays the bone conduction voice signal S2 based on the air conduction speaker of the B-side mobile phone.
- excitation source * excitation channel speech signal
- excitation source the excitation source convolution excitation channel
- excitation channels of air conduction speech signals are vocal cords, but the excitation channels of air conduction speech signals are the pharyngeal cavity & oral cavity, etc., while the excitation channels of bone conduction speech signals are muscles & bones, etc. Because the excitation channels of the two are different, the final listening sound will be quite different: the air conduction speech signal has small distortion and high brightness because the excitation channel is transmitted through air.
- the excitation channel of the bone conduction speech signal is transmitted through solid and soft bodies, which has large distortion, is darker than the air conduction speech signal, and has poor listening comfort; more seriously, it will cause the opposite end (such as side B) to hear A
- the voice on the other side is unnatural, and in severe cases it may not be possible to tell who is speaking, thus affecting the efficiency of the call.
- Embodiments of the present application provide a speech signal processing method, a neural network training method, and equipment for mapping the transfer function (i.e., the first transfer function) of the bone conduction speech signal to air conduction based on the trained neural network.
- the transfer function of the speech signal i.e. the second transfer function
- the transfer function of the speech signal is used to achieve timbre compensation from the bone conduction speech signal to the air conduction speech signal, making the listening effect of the bone conduction speech signal brighter and more comfortable, which can ensure that it is used in mining, rescue, etc. Call efficiency in high-noise scenes.
- embodiments of the present application first provide a speech signal processing method, which can be used in the field of signal processing.
- the method includes: first, obtaining a bone conduction speech signal to be processed.
- the bone conduction speech signal can be called the first Bone conduction speech signal.
- excitation parameters are extracted based on the first bone conduction speech signal.
- a transfer function of the first bone conduction voice signal (which may also be referred to as a first transfer function) is determined based on the first bone conduction voice signal.
- the first transfer function is input to the trained neural network, thereby outputting a second transfer function, which is the predicted transmission of the air conduction speech signal. function, it can also be called a predictive transfer function.
- the first air conduction speech signal corresponding to the first bone conduction speech signal is obtained.
- the transfer function of the bone conduction speech signal i.e., the first transfer function
- the transfer function of the air conduction speech signal i.e., the second transfer function
- the trained neural network is based on the target loss function and is obtained by training the neural network using a training data set, where the training data set includes multiple training data, and the training data includes bone conduction speech
- the first real transfer function of the signal is obtained based on the audio signal emitted by the sound source collected by the bone conduction microphone.
- the output of the neural network is a predicted transfer function.
- the predicted transfer function corresponds to the second true transfer function of the air conduction speech signal.
- the second true transfer function is obtained based on the audio signal emitted by the sound source collected by the air conduction microphone.
- the trained neural network is trained.
- the mapping quality can be improved and flexibility can be achieved by increasing the scale of the mapping network.
- the target loss function may be: an error value between the predicted transfer function and the second true transfer function.
- the method of determining the first transfer function of the first bone conduction voice signal according to the first bone conduction voice signal may be: using excitation parameters, perform a test on the first bone conduction voice signal. Deconvolution operation is performed to obtain the first transfer function of the first bone conduction speech signal.
- the method of obtaining the first air conduction speech signal corresponding to the first bone conduction speech signal may be: applying the second transfer function and the excitation parameters A convolution operation is performed to obtain the first air conduction speech signal corresponding to the first bone conduction speech signal.
- the method of obtaining the first bone conduction voice signal may be: first, collect the audio signal through a bone conduction microphone, and then perform noise reduction on the collected audio signal.
- the subtractive noise reduction algorithm performs noise reduction to obtain the first bone conduction speech signal.
- the first bone conduction voice signal has undergone noise reduction processing, so the signal spectrum of the voice signal will be purer.
- the excitation parameter includes a fundamental frequency of the first bone conduction speech signal and harmonics of the fundamental frequency.
- the second aspect of the embodiment of the present application provides a method for training a neural network.
- the method includes: first, collecting audio from a sound source (such as user A) through a bone conduction microphone and an air conduction microphone within a certain period of time.
- the signals can be collected from n (n ⁇ 2) bone conduction speech signals and n air conduction speech signals emitted by user A when speaking.
- a real transfer function (which can be called the first real transfer function) corresponding to each bone conduction speech signal can be obtained.
- a total of n bone conduction speech signals can be obtained. a real transfer function.
- a real transfer function (which can be called a second real transfer function) corresponding to each air conduction speech signal can be obtained.
- a total of n air conduction speech signals can be obtained.
- a second real transfer function Use the n first real transfer functions as the training data set of the neural network, and use the target loss function to train the neural network until the training termination condition is reached, thereby obtaining the trained neural network.
- the target loss function is obtained based on the second real transfer function. .
- the trained neural network is trained.
- the mapping quality can be improved and flexibility can be achieved by increasing the scale of the mapping network.
- the bone conduction microphone and the air conduction microphone are deployed on the same device, and the training process of the neural network is performed on the device.
- the training processes of the bone conduction microphone, the air conduction microphone and the neural network are all performed on the same device, which can realize the function of online training and is flexible.
- the same device includes: a head-mounted device, such as a headset.
- the target loss function may be: an error value between the predicted transfer function and the second true transfer function.
- reaching the training termination condition includes but is not limited to: the value of the target loss function reaches a preset threshold; or the target loss function begins to converge; or the number of training times reaches the preset number; or , the training duration reaches the preset duration; or, an instruction to terminate the training is obtained.
- the third aspect of the embodiments of the present application provides an execution device, which has the function of implementing the method of the above-mentioned first aspect or any possible implementation manner of the first aspect.
- This function can be implemented by hardware, or it can be implemented by hardware executing corresponding software.
- the hardware or software includes one or more modules corresponding to the above functions.
- the fourth aspect of the embodiments of the present application provides a training device, which has the function of implementing the method of the above-mentioned second aspect or any of the possible implementation methods of the second aspect.
- This function can be implemented by hardware, or it can be implemented by hardware executing corresponding software.
- the hardware or software includes one or more modules corresponding to the above functions.
- the fifth aspect of the embodiment of the present application provides an execution device, which may include a memory, a processor, and a bus system.
- the memory is used to store programs, and the processor is used to call the program stored in the memory to execute the first aspect of the embodiment of the present application. Or any possible implementation method of the first aspect.
- the sixth aspect of the embodiment of the present application provides a training device, which may include a memory, a processor, and a bus system.
- the memory is used to store programs, and the processor is used to call the program stored in the memory to execute the second aspect of the embodiment of the present application. Or any possible implementation method of the second aspect.
- a seventh aspect of the present application provides a computer-readable storage medium.
- the computer-readable storage medium stores instructions. When run on a computer, the computer can execute the above-mentioned first aspect or any possible implementation of the first aspect. method, or a method that enables the computer to execute the above second aspect or any of the possible implementation methods of the second aspect.
- the eighth aspect of the embodiment of the present application provides a computer program that, when run on a computer, enables the computer to execute the above-mentioned first aspect or any of the possible implementation methods of the first aspect, or enables the computer to execute the above-mentioned method.
- a ninth aspect of the embodiment of the present application provides a chip.
- the chip includes at least one processor and at least one interface circuit.
- the interface circuit is coupled to the processor.
- the at least one interface circuit is used to perform a transceiver function and send instructions to At least one processor, at least one processor is used to run computer programs or instructions, which has the function of implementing the above-mentioned first aspect or any of the possible implementation methods of the first aspect, or it has the function of implementing the above-mentioned second aspect or any of the possible implementation methods.
- the function of any possible implementation method can be realized by hardware, software, or a combination of hardware and software.
- the hardware or software includes one or more functions corresponding to the above functions. module.
- the interface circuit is used to communicate with other modules outside the chip. For example, the interface circuit can send the neural network trained on the chip to a target device (such as a headset, a mobile phone, a personal computer, etc.).
- Figure 1 is a schematic diagram of a real-time call system based on a mobile phone provided by an embodiment of the present application
- Figure 2 is another schematic diagram of a mobile phone-based real-time call system provided by an embodiment of the present application
- FIG. 3 is a system architecture schematic diagram of the speech signal processing system provided by the embodiment of the present application.
- Figure 4 is a schematic flow chart of a neural network training method provided by an embodiment of the present application.
- Figure 5 is an example diagram of a training process provided by the embodiment of the present application.
- Figure 6 is a schematic flow chart of a voice signal processing method provided by an embodiment of the present application.
- Figure 7 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
- Figure 8 is a schematic structural diagram of a training device provided by an embodiment of the present application.
- Figure 9 is a schematic structural diagram of another execution device provided by an embodiment of the present application.
- Figure 10 is a schematic structural diagram of another training device provided by an embodiment of the present application.
- Embodiments of the present application provide a speech signal processing method, a neural network training method, and equipment for mapping the transfer function (i.e., the first transfer function) of the bone conduction speech signal to air conduction based on the trained neural network.
- the transfer function of the speech signal i.e. the second transfer function
- the transfer function of the speech signal is used to achieve timbre compensation from the bone conduction speech signal to the air conduction speech signal, making the listening effect of the bone conduction speech signal brighter and more comfortable, which can ensure that it is used in mining, rescue, etc. Call efficiency in high-noise scenes.
- Neural networks can be composed of neural units. Specifically, they can be understood as neural networks with input layers, hidden layers, and output layers. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in between are all is the hidden layer. Among them, a neural network with many hidden layers is called a deep neural network (DNN).
- DNN deep neural network
- the work of each layer in a neural network can be expressed mathematically To describe, from the physical level, the work of each layer in the neural network can be understood as completing the transformation from the input space to the output space (that is, the row space of the matrix to the column space) through five operations on the input space (a collection of input vectors). space), these five operations include: 1. Dimension raising/reducing; 2. Zoom in/out; 3. Rotation; 4.
- Space refers to the collection of all individuals of this type of thing, where W is the weight matrix of each layer of the neural network. , each value in this matrix represents the weight value of a neuron in this layer.
- W determines the spatial transformation from the input space to the output space mentioned above, that is, the W of each layer of the neural network controls how to transform the space.
- the purpose of training a neural network is to finally obtain the weight matrix of all layers of the trained neural network. Therefore, the training process of neural network is essentially to learn how to control spatial transformation, and more specifically, to learn the weight matrix.
- loss function loss function
- objective function object function
- the error back propagation (BP) algorithm can be used to modify the size of the parameters in the initial neural network model, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial neural network model are updated by backpropagating the error loss information, so that the error loss converges.
- the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
- the speech signal processing system 200 includes an execution device 210, a training device 220, a database 230, a client device 240, a data storage system 250 and a data acquisition device 260 (such as a bone conduction microphone and an air conduction microphone).
- the execution device 210 includes Calculation module 211.
- the data collection device 260 is used to obtain a large-scale data set required by the user (i.e., a training data set, which may also be called a training set, and the training set includes training data), and stores the training set in the database 230.
- the training device 220 is based on The training set maintained in the database 230 trains the neural network 201 constructed in this application.
- the trained neural network 201 obtained by training is then used on the execution device 210.
- the execution device 210 can call data, codes, etc. in the data storage system 250, and can also store data, instructions, etc. in the data storage system 250.
- the data storage system 250 may be placed in the execution device 210 , or the data storage system 250 may be an external memory relative to the execution device 210 .
- the trained neural network 201 trained by the training device 220 can be applied to different systems or devices (i.e., the execution device 210). Specifically, it can be a terminal device or a head-mounted device, such as a headset, a mobile phone, a tablet, or a computer. , cloud servers, etc.
- the execution device 210 can be configured with an I/O interface 212 for data interaction with external devices, and the “user” can input data to the I/O interface 212 through the client device 240 .
- the client device 240 may be a bone conduction microphone. The bone conduction voice signal collected by the client device is input to the execution device 210.
- the execution device 210 first calculates the true transfer function of the bone conduction voice signal (ie, as described below). first transfer function), and then input the real transfer function into the calculation module 211.
- the calculation module 211 processes the input real transfer function to obtain the predicted transfer function (ie, the second transfer function described below), and then the predicted transfer function is The transfer function is saved on the storage medium of the execution device 210 for use in subsequent downstream tasks.
- the client device 240 can also be integrated in the execution device 210.
- the execution device 210 when the execution device 210 is a head-mounted device (such as a headset), it can be obtained directly through the head-mounted device.
- the bone conduction voice signal for example, the bone conduction voice signal can be collected through the bone conduction microphone on the head mounted device, or through the bone conduction voice signal sent by other devices received by the head mounted device etc., the source method of the bone conduction voice signal is not limited here
- the calculation module 211 in the head-mounted device processes the real transfer function of the bone conduction voice signal to obtain the predicted transfer function, and will obtain The predicted transfer function is saved.
- the product forms of the execution device 210 and the client device 240 are not limited here.
- the data collection device 260 and/or the training device 220 can also be integrated in the execution device 210, for example, when the execution device 210 is a head-mounted device (such as a headset) When different people wear the head-mounted device, the timbres of the bone conduction speech signals collected may also be different. Therefore, the data collection device 260 and/or the training device 220 can be integrated into the execution device 210.
- the voice of user A is collected through the data collection device 260 (such as a bone conduction microphone), and the neural network 201 is trained through the training device 220 (based on the air conduction speech signal collected by the air conduction microphone to obtain the real transfer function of the air conduction speech signal), the trained neural network 201 is directly used for subsequent applications of user A; similarly, when user B wears the head-mounted device, the data collection device 260 (such as a bone conduction microphone ) collects the voice of user B, and trains the neural network 201 through the training device 220. The trained neural network 201 is directly used for subsequent applications of user A. This can make the trained neural network 201 more accurate, adaptable when different users use the execution device 210, and have flexibility.
- the data collection device 260 such as a bone conduction microphone
- Figure 3 is only a schematic diagram of a system architecture provided by an embodiment of the present application.
- the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
- data storage The system 250 is an external memory relative to the execution device 210.
- the data storage system 250 can also be placed in the execution device 210;
- the client device 240 is an external device relative to the execution device 210.
- the client device 240 can also be integrated in the execution device 210; in Figure 3, the training device 220 is an external device relative to the execution device 210.
- the training device 220 can also be integrated in the execution device 210; in Figure 3,
- the data acquisition device 260 is an external device relative to the execution device 210. In other cases, the data acquisition device 260 can also be integrated in the execution device 210, and so on. This application does not limit this.
- the training stage and the application stage (which can also be called the inference stage)
- the following will describe the training method of the neural network and the speech signal provided by the embodiments of the present application from these two stages respectively.
- the specific process of the processing method is described.
- the training stage refers to the process in which the data collection device 260 in FIG. 3 collects training data, and the training device 220 uses the training data to perform training operations on the neural network 201.
- Figure 4 is a schematic flowchart of a neural network training method provided by an embodiment of the present application. The method may specifically include the following steps:
- the audio signals emitted by the sound source can be collected through bone conduction microphones and air conduction microphones within a certain period of time.
- n (n ⁇ 2) bone conduction signals emitted by user A when speaking can be collected.
- n first real transfer functions and n second real transfer functions are obtained respectively.
- the first real transfer function corresponds to the bone conduction speech signal one-to-one.
- the first The two real transfer functions correspond to the air conduction speech signal one-to-one.
- the real transfer function h b (t) corresponding to each bone conduction speech signal s(t) can be obtained (which can be called the first real transfer function h b (t))
- a total of n first real transfer functions h b (t) can be obtained from n bone conduction speech signals s(t).
- the real transfer function h y (t) corresponding to each air conduction speech signal y (t) can also be further obtained based on each air conduction speech signal y (t) (which can be called the second real transfer function h y (t)), a total of n second real transfer functions h y (t) can be obtained from n air conduction speech signals y(t).
- n first real transfer functions as the training data set of the neural network, use the target loss function to train the neural network until the training termination condition is reached, and obtain the trained neural network.
- the target loss function is obtained based on the second real transfer function. .
- first true transfer functions h b (t) are used as the training data set of the neural network, and the neural network is trained using the target loss function L until the training termination condition is reached, thereby obtaining the trained neural network.
- the target loss function is based on The second true transfer function is obtained.
- the target loss function L may be the error value between the predicted transfer function output by the neural network and the second true transfer function.
- the training process can be: taking the first true transfer function h b (t) as the input of the neural network, and the output of the neural network is the predicted transfer function h c (t), and then continuously iteratively training until the training termination is reached.
- the target loss function L
- Figure 5 takes recurrent neural networks (RNNs) as an example to illustrate the relationship between the input data (i.e., training data) and output (i.e., prediction transfer function) of RNNs.
- RNNs recurrent neural networks
- reaching training termination conditions includes but is not limited to:
- the target loss function reaches the preset threshold.
- the target loss function After configuring the target loss function, you can set a threshold (for example, 0.03) for the target loss function in advance.
- a threshold for example, 0.03
- the target obtained by the current round of training is judged after each training. Whether the value of the loss function reaches the threshold, if not, the training will continue. If it reaches the preset threshold, the training will be terminated. Then the values of the network parameters of the neural network determined by the current round of training will be regarded as the final training.
- the network parameter values of the neural network After configuring the target loss function, you can set a threshold (for example, 0.03) for the target loss function in advance.
- the neural network After configuring the target loss function, the neural network can be iteratively trained. If the difference between the value of the target loss function obtained in the current round of training and the value of the target loss function obtained in the previous round of training is Within the preset range (for example, 0.01), the target loss function is considered to have converged and the training can be terminated. Then the values of the network parameters of the neural network determined in the current round of training will be used as the final trained neural network. network parameter values.
- the number of iterations for training the neural network can be pre-configured (for example, 100 times). After configuring the target loss function, the neural network can be iteratively trained. After each round of training, The values of the network parameters of the neural network of the corresponding round are stored until the number of training iterations reaches the preset number. After that, the test data is used to verify the neural network obtained in each round, and the one with the best performance is selected. The values of the network parameters are used as the final network parameter values of the neural network.
- the training duration reaches the preset duration.
- the iteration time for training the neural network can be pre-configured (for example, 5 minutes). After configuring the target loss function, the neural network can be iteratively trained. After each round of training, The values of the network parameters of the neural network of the corresponding round are stored until the training iteration length reaches the preset length. After that, the test data is used to verify the neural network obtained in each round, and the one with the best performance is selected. The values of the network parameters are used as the final network parameter values of the neural network.
- a training switch can be set in advance to trigger the generation of training start and training termination instructions.
- the training switch When the training switch is turned on, the training start instruction is triggered, and the neural network starts iterative training.
- the training switch is turned off, a command to terminate training is triggered, and the neural network stops training.
- the period from when the training switch is turned on to when it is turned off is the training time of the neural network.
- the values of the network parameters of the neural network for that round will be stored until training.
- the test data is used to verify the neural network obtained in each round, and the value of the network parameter with the best performance is selected as the final network parameter value of the neural network.
- n bone conduction speech signals s(t) and n air conduction speech signals y(t) may be collected at the beginning, and then based on the collected n bone conduction speech signals s(t) and n air conduction speech signals y(t) iteratively train the neural network (that is, collect the required training data first, and then train the neural network); it can also be collected The bone conduction speech signal s(t) and an air conduction speech signal y(t) train the neural network once. If the neural network currently trained does not meet the training termination conditions, the training data will be collected and executed again. The training process is until the training termination condition is obtained (that is, how many iterations are needed, and how many times it is collected). Specifically, this application does not limit the order between data collection and training.
- the bone conduction microphone and the air conduction microphone can be deployed on the same device, for example, both can be deployed on a head-mounted device (such as an earphone), and the training process of the neural network can also be performed on a headset.
- the training is performed on the device, that is, the training device is the device.
- the training process of the above-mentioned neural network is an online training process.
- the implementation process of the online training process can be:
- the wearer puts on the head-mounted device (such as headphones).
- the head-mounted device such as headphones
- the wearer speaks, and the head-mounted device simultaneously collects the wearer's bone conduction speech signal s(t) and air conduction speech signal y(t) through the bone conduction microphone and air conduction microphone.
- the enable online training switch is turned off.
- the head-mounted device is triggered to terminate the training (that is, the training termination condition is to obtain the instruction to terminate the training).
- the period from when the online training switch is turned on to off is the training time of the neural network.
- the head-mounted device will use the first true transfer function h b (t) corresponding to the bone conduction speech signal s(t) as the neural network.
- the input of the network and the output of the neural network are the predicted transfer function h c (t), and then iterative training is continued, and the network parameters of the neural network obtained last time when the online training switch is turned off (or the optimal network parameters among all times)
- the network parameters are saved as the final neural network.
- the online training process of the neural network may not be performed on the same device where the bone conduction microphone and the air conduction microphone are deployed.
- the online training module can be deployed on the headset or on other devices, such as mobile phones, computers, and cloud servers.
- the training process of the neural network may also be an offline training process (that is, the neural network is trained in advance).
- the implementation process of the offline training process can be:
- the audio signals emitted by the sound source are collected through the bone conduction microphone and the air conduction microphone to obtain the bone conduction speech signal s(t) and the air conduction speech signal y(t).
- the application stage refers to the process in which the execution device 210 in FIG. 3 uses the trained neural network 201 to process the input data.
- Figure 6 is a schematic flowchart of a speech signal processing method provided by an embodiment of the present application. The method may specifically include the following steps:
- a bone conduction speech signal s(t) to be processed is obtained.
- the bone conduction speech signal s(t) may be called a first bone conduction speech signal s(t).
- the excitation parameter e(t) is extracted based on the first bone conduction speech signal s(t), where the excitation parameter e(t) may include the fundamental frequency of the first bone conduction speech signal s(t) and the fundamental frequency
- the harmonics can be analyzed using the linear predictive coding (LPC) method of the speech signal.
- LPC linear predictive coding
- the implementation of obtaining the first bone conduction speech signal s(t) may be: first, the audio signal x(t) may be collected based on the bone conduction microphone (also called bone conduction microphone). conduct audio signal x(t)), and then perform noise reduction on the audio signal x(t) to obtain the first bone conduction speech signal s(t).
- the bone conduction microphone also called bone conduction microphone
- VAD voice activity detection
- FFT fast fourier transform
- the transfer function h b (t) of the first bone conduction voice signal s (t) is determined according to the first bone conduction voice signal s (t), which may also be called the first transfer function h b (t).
- the second transfer function is the predicted transfer function of the air conduction speech signal.
- the first transfer function h b (t) of the first bone conduction speech signal s (t) is input to the trained neural network, thereby outputting the second transfer function h c (t), the second transfer function h c (t) is the predicted transfer function of the air conduction speech signal, so it can also be called the predicted transfer function h c (t).
- the trained neural network is a trained neural network obtained through the above training stage, that is, the trained neural network is based on the target loss function and uses the training data set to Obtained by network training, wherein the training data set includes multiple training data (including the first true transfer function of the bone conduction speech signal), and the first true transfer function is obtained based on the audio signal emitted by the sound source collected by the bone conduction microphone; the neural network The output is a predicted transfer function, which corresponds to a second real transfer function of the air conduction speech signal, and the second real transfer function is obtained based on the audio signal emitted by the sound source collected by the air conduction microphone.
- the target loss function may be an error value between the predicted transfer function output by the neural network and the second true transfer function.
- the first air conduction speech signal s' ( corresponding to the first bone conduction speech signal s (t) is obtained t).
- subsequent processing is performed based on the timbre compensated speech signal (i.e., the first air conduction speech signal s’(t)), including but not limited to:
- Voice calls can enhance the brightness of the voice after the call, achieve better voice perception quality, ensure that the perception of the other party is not affected, and improve the perceived brightness and comfort of the call voice.
- Speech recognition improve the accuracy of speech recognition under bone conduction speech.
- Figure 7 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
- the execution device 700 may specifically include: an acquisition module 701, a first determination module 702, a calculation module 703 and a second determination module 704, where , the acquisition module 701 is used to acquire the first bone conduction voice signal, and extract the excitation parameters based on the first bone conduction voice signal; the first determination module 702 is used to determine the first bone conduction based on the first bone conduction voice signal The first transfer function of the speech signal; the calculation module 703 is used to input the first transfer function into the trained neural network to obtain the output second transfer function. The second transfer function is the predicted transfer function of the air conduction speech signal. ; The second determination module 704 is used to obtain the first air conduction speech signal corresponding to the first bone conduction speech signal according to the second transfer function and the excitation parameter.
- the trained neural network is based on the target loss function and is obtained by training the neural network using a training data set; the training data set includes multiple training data, and the training data includes the first signal of the bone conduction speech signal.
- a real transfer function the first real transfer function is obtained based on the audio signal emitted by the sound source collected by the bone conduction microphone; the output of the neural network is a predicted transfer function, and the predicted transfer function corresponds to the second real transfer function of the air conduction speech signal , the second real transfer function is obtained based on the audio signal emitted by the sound source collected by the air conduction microphone.
- the target loss function includes: an error value between the predicted transfer function and the second true transfer function.
- the first determination module 702 is specifically configured to: use the excitation parameter to perform a deconvolution operation on the first bone conduction speech signal to obtain a first transfer function of the first bone conduction speech signal. .
- the second determination module 704 is specifically configured to perform a convolution operation on the second transfer function and the excitation parameter to obtain a first air conduction speech signal corresponding to the first bone conduction speech signal. .
- the acquisition module 701 is specifically configured to collect audio signals through a bone conduction microphone, perform noise reduction on the audio signals, and obtain the first bone conduction voice signal.
- the excitation parameters include the fundamental frequency of the first bone conduction speech signal and harmonics of the fundamental frequency.
- the embodiment of the present application also provides a training device. Please refer to Figure 8 for details.
- Figure 8 is a schematic diagram of a training device provided by the embodiment of the present application.
- the training device 800 may specifically include: a collection module 801, a calculation module 802 and an iteration module. 803.
- the collection module 801 is used to collect the audio signals emitted by the sound source through the bone conduction microphone and the air conduction microphone respectively, and obtain n bone conduction speech signals and n air conduction speech signals, n ⁇ 2;
- the calculation module 802 used to obtain n first real transfer functions and n second real transfer functions based on n bone conduction speech signals and n air conduction speech signals respectively, and the first real transfer function is the same as the bone conduction speech signal.
- the iteration module 803 is used to use n first real transfer functions as training data sets for the neural network, and use the target loss function to train the neural network Training is performed until the training termination condition is reached, and a trained neural network is obtained.
- the target loss function is obtained based on the second true transfer function.
- a bone conduction microphone and the air conduction microphone are deployed in the training device 800 .
- the training device 800 includes: a head-mounted device.
- the output of the neural network is a predicted transfer function
- the target loss function includes: an error value between the predicted transfer function and the second true transfer function
- reaching the training termination condition includes: the value of the target loss function reaches a preset threshold; or the target loss function begins to converge; or the number of training times reaches the preset number; or the training duration reaches Default duration; or, obtain the training termination instruction.
- FIG. 9 is a schematic structural diagram of the training device provided by the embodiment of the present application.
- the training device 900 can be deployed with the training equipment in the corresponding embodiment of Figure 8.
- the described training device 800 is used to implement the functions of the training device 800 in the corresponding embodiment of Figure 8.
- the training device 900 is implemented by one or more servers.
- the training device 900 may produce relatively large errors due to different configurations or performances. The differences may include one or more central processing units (CPUs) 922 and memory 932, and one or more storage media 930 (such as one or more mass storage devices) that store applications 942 or data 944.
- CPUs central processing units
- storage media 930 such as one or more mass storage devices
- the memory 932 and the storage medium 930 may be short-term storage or persistent storage.
- the program stored in the storage medium 930 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the training device 900 .
- the central processor 922 may be configured to communicate with the storage medium 930 and execute a series of instruction operations in the storage medium 930 on the training device 900 .
- the training device 900 may also include one or more power supplies 926, one or more wired or wireless network interfaces 950, one or more input and output interfaces 958, and/or, one or more operating systems 941, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and more.
- operating systems 941 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM and more.
- the central processor 922 is used to execute the training method of the neural network in the corresponding embodiment of Figure 4.
- the central processor 922 can be used to: first, collect the audio signals emitted by the sound source through the bone conduction microphone and the air conduction microphone respectively, and obtain n bone conduction speech signals and n air conduction speech signals, n ⁇ 2; then, According to n bone conduction speech signals and n air conduction speech signals, n first real transfer functions and n second real transfer functions are obtained respectively.
- the first real transfer function corresponds to the bone conduction speech signal one-to-one
- the second real transfer function corresponds to the bone conduction speech signal.
- the transfer function corresponds to the air conduction speech signal one-to-one; finally, the n first real transfer functions are used as the training data set of the neural network, and the neural network is trained using the target loss function until the training termination condition is reached, and the trained neural network is obtained , the target loss function is obtained based on the second true transfer function.
- Figure 10 is a schematic structural diagram of an execution device provided by an embodiment of the present application.
- the execution device 1000 can be embodied as various terminal devices, such as a headset. Wearable devices (such as headphones), mobile phones, tablets, laptops, etc. are not limited here.
- the execution device 700 described in the corresponding embodiment of FIG. 7 may be deployed on the execution device 1000 to implement the functions of the execution device 700 in the corresponding embodiment of FIG. 7 .
- the execution device 1000 may include: a receiver 1001, a transmitter 1002, a processor 1003 and a memory 1004 (the number of processors 1003 in the execution device 1000 may be one or more, one processor is taken as an example in Figure 10 ), wherein the processor 1003 may include an application processor 10031 and a communication processor 10032.
- the receiver 1001, the transmitter 1002, the processor 1003 and the memory 1004 may be connected through a bus or other means.
- Memory 1004 may include read-only memory and random access memory and provides instructions and data to processor 1003 .
- a portion of memory 1004 may also include non-volatile random access memory (NVRAM).
- NVRAM non-volatile random access memory
- the memory 1004 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, where the operating instructions may include various operating instructions for implementing various operations.
- Processor 1003 controls execution of operations of device 1000.
- various components of the execution device 1000 are coupled together through a bus system, where in addition to a data bus, the bus system may also include a power bus, a control bus, a status signal bus, etc.
- bus system may also include a power bus, a control bus, a status signal bus, etc.
- various buses are called bus systems in the figure.
- the method disclosed in the corresponding embodiment of FIG. 6 of this application can be applied to the processor 1003, or implemented by the processor 1003.
- the processor 1003 may be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the above method can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 1003.
- the above-mentioned processor 1003 can be a general processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and can further include an application specific integrated circuit (ASIC), a field programmable Gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
- DSP digital signal processor
- ASIC application specific integrated circuit
- FPGA field-programmable gate array
- the processor 1003 can implement or execute each method, step and logic block diagram disclosed in the embodiment corresponding to Figure 6 of this application.
- a general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc.
- the steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
- the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
- the storage medium is located in the memory 1004.
- the processor 1003 reads the information in the memory 1004 and completes the steps of the above method in combination with its hardware.
- the receiver 1001 may be configured to receive input numeric or character information and generate signal inputs related to performing relevant settings and functional controls of the device 1000 .
- the transmitter 1002 can be used to output numeric or character information through the first interface; the transmitter 1002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1002 can also include a display device such as a display screen .
- the processor 1003 is configured to process the first transfer function of the input first bone conduction speech signal through a trained neural network to obtain a corresponding second transfer function.
- the trained neural network can be obtained through the training method corresponding to Figure 4 of this application. For details, please refer to the description in the method embodiments shown above in this application, and will not be described again here.
- An embodiment of the present application also provides a computer-readable storage medium, which stores a program for signal processing. When it is run on a computer, it causes the computer to execute the embodiment shown in Figure 4. The steps performed by the described training device, or causing the computer to perform the steps performed by the execution device described in the embodiment shown in FIG. 6 .
- the training device, execution device, etc. provided by the embodiment of the present application may specifically be a chip.
- the chip includes: a processing unit and a communication unit.
- the processing unit may be, for example, a processor.
- the communication unit may be, for example, an input/output interface, a pin, or a communication unit. circuit etc.
- the processing unit can execute the computer execution instructions stored in the storage unit, so that the chip in the training device performs the steps performed by the training device described in the embodiment shown in Figure 4, or, so that the chip in the execution device performs the steps shown in Figure 6
- the illustrated embodiments describe steps performed by an execution device.
- the storage unit is a storage unit within the chip, such as a register, cache, etc.
- the storage unit may also be a storage unit located outside the chip in the wireless access device, such as Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
- ROM Read-only memory
- RAM random access memory
- the device embodiments described above are only illustrative.
- the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
- the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- the connection relationship between modules indicates that there are communication connections between them, which can be implemented as one or more communication buses or signal lines.
- the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
- the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or execution device, etc.) to execute the steps described in various embodiments of this application. method.
- a computer device which can be a personal computer, training device, or execution device, etc.
- the computer program product includes one or more computer instructions.
- the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data
- the center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
- wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
- wireless such as infrared, wireless, microwave, etc.
- the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media.
- the available media may be magnetic media (such as floppy disks, hard disks, magnetic tapes), optical media (such as DVDs), or semiconductor media (such as solid state disks (SSD)), etc.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Evolutionary Computation (AREA)
- Details Of Audible-Bandwidth Transducers (AREA)
- Telephone Function (AREA)
Abstract
Description
Claims (29)
- 一种语音信号的处理方法,其特征在于,包括:获取第一骨传导语音信号,并基于所述第一骨传导语音信号提取激励参数;根据所述第一骨传导语音信号确定所述第一骨传导语音信号的第一传递函数;将所述第一传递函数输入训练后的神经网络,得到输出的第二传递函数,所述第二传递函数为预测的气传导语音信号的传递函数;根据所述第二传递函数以及所述激励参数,得到与所述第一骨传导语音信号对应的第一气传导语音信号。
- 根据权利要求1所述的方法,其特征在于,所述训练后的神经网络是基于目标损失函数,利用训练数据集对神经网络训练得到;所述训练数据集包括多个训练数据,所述训练数据包括骨传导语音信号的第一真实传递函数,所述第一真实传递函数基于骨传导麦克风采集的声源发出的音频信号得到;所述神经网络的输出为预测传递函数,所述预测传递函数与气传导语音信号的第二真实传递函数对应,所述第二真实传递函数基于气传导麦克风采集的所述声源发出的音频信号得到。
- 根据权利要求2所述的方法,其特征在于,所述目标损失函数包括:所述预测传递函数与所述第二真实传递函数之间的误差值。
- 根据权利要求1-3中任一项所述的方法,其特征在于,所述根据所述第一骨传导语音信号确定所述第一骨传导语音信号的第一传递函数包括:利用所述激励参数,对所述第一骨传导语音信号进行反卷积运算,得到所述第一骨传导语音信号的第一传递函数。
- 根据权利要求1-4中任一项所述的方法,其特征在于,所述根据所述第二传递函数以及所述激励参数,得到与所述第一骨传导语音信号对应的第一气传导语音信号包括:对所述第二传递函数以及所述激励参数进行卷积运算,得到与所述第一骨传导语音信号对应的第一气传导语音信号。
- 根据权利要求1-5中任一项所述的方法,其特征在于,所述获取第一骨传导语音信号包括:通过骨传导麦克风采集音频信号;对所述音频信号进行降噪,得到所述第一骨传导语音信号。
- 根据权利要求1-6中任一项所述的方法,其特征在于,所述激励参数包括所述第一骨传导语音信号的基频和所述基频的谐波。
- 一种神经网络的训练方法,其特征在于,包括:分别通过骨传导麦克风以及气传导麦克风采集声源发出的音频信号,得到n个骨传导语音信号和n个气传导语音信号,n≥2;分别根据n个所述骨传导语音信号和n个所述气传导语音信号,得到n个第一真实传递函数和n个第二真实传递函数,所述第一真实传递函数与所述骨传导语音信号一一对应,所述第二真实传递函数与所述气传导语音信号一一对应;将n个所述第一真实传递函数作为神经网络的训练数据集,利用目标损失函数对所述神经网络进行训练直至达到训练终止条件,得到训练后的神经网络,所述目标损失函数基于所述第二真实传递函数得到。
- 根据权利要求8所述的方法,其特征在于,所述骨传导麦克风和所述气传导麦克风部署于同一设备,所述神经网络的训练过程在所述设备进行。
- 根据权利要求9所述的方法,其特征在于,所述设备包括:头戴式设备。
- 根据权利要求8-10中任一项所述的方法,其特征在于,所述神经网络的输出为预测传递函数,所述目标损失函数包括:所述预测传递函数与所述第二真实传递函数之间的误差值。
- 根据权利要求8-11中任一项所述的方法,其特征在于,所述达到训练终止条件包括:所述目标损失函数的值达到预设阈值;或,所述目标损失函数开始收敛;或,训练的次数达到预设次数;或,训练的时长达到预设时长;或,获取到训练终止的指令。
- 一种执行设备,其特征在于,包括:获取模块,用于获取第一骨传导语音信号,并基于所述第一骨传导语音信号提取激励参数;第一确定模块,用于根据所述第一骨传导语音信号确定所述第一骨传导语音信号的第一传递函数;计算模块,用于将所述第一传递函数输入训练后的神经网络,得到输出的第二传递函数,所述第二传递函数为预测的气传导语音信号的传递函数;第二确定模块,用于根据所述第二传递函数以及所述激励参数,得到与所述第一骨传导语音信号对应的第一气传导语音信号。
- 根据权利要求13所述的设备,其特征在于,所述训练后的神经网络是基于目标损失函数,利用训练数据集对神经网络训练得到;所述训练数据集包括多个训练数据,所述训练数据包括骨传导语音信号的第一真实传递函数,所述第一真实传递函数基于骨传导麦克风采集的声源发出的音频信号得到;所述神经网络的输出为预测传递函数,所述预测传递函数与气传导语音信号的第二真实传递函数对应,所述第二真实传递函数基于气传导麦克风采集的所述声源发出的音频信号得到。
- 根据权利要求13所述的设备,其特征在于,所述目标损失函数包括:所述预测传递函数与所述第二真实传递函数之间的误差值。
- 根据权利要求13-15中任一项所述的设备,其特征在于,所述第一确定模块,具体用于:利用所述激励参数,对所述第一骨传导语音信号进行反卷积运算,得到所述第一骨传导语音信号的第一传递函数。
- 根据权利要求13-16中任一项所述的设备,其特征在于,所述第二确定模块,具体用于:对所述第二传递函数以及所述激励参数进行卷积运算,得到与所述第一骨传导语音信号对应的第一气传导语音信号。
- 根据权利要求13-17中任一项所述的设备,其特征在于,所述获取模块,具体用于:通过骨传导麦克风采集音频信号;对所述音频信号进行降噪,得到所述第一骨传导语音信号。
- 根据权利要求13-18中任一项所述的设备,其特征在于,所述激励参数包括所述第一骨传导语音信号的基频和所述基频的谐波。
- 一种训练设备,其特征在于,包括:采集模块,用于分别通过骨传导麦克风以及气传导麦克风采集声源发出的音频信号,得到n个骨传导语音信号和n个气传导语音信号,n≥2;计算模块,用于分别根据n个所述骨传导语音信号和n个所述气传导语音信号,得到n个第一真实传递函数和n个第二真实传递函数,所述第一真实传递函数与所述骨传导语音信号一一对应,所述第二真实传递函数与所述气传导语音信号一一对应;迭代模块,用于将n个所述第一真实传递函数作为神经网络的训练数据集,利用目标损失函数对所述神经网络进行训练直至达到训练终止条件,得到训练后的神经网络,所述目标损失函数基于所述第二真实传递函数得到。
- 根据权利要求20所述的设备,其特征在于,所述骨传导麦克风和所述气传导麦克风部署于所述训练设备。
- 根据权利要求21所述的设备,其特征在于,所述训练设备包括:头戴式设备。
- 根据权利要求20-22中任一项所述的设备,其特征在于,所述神经网络的输出为预测传递函数,所述目标损失函数包括:所述预测传递函数与所述第二真实传递函数之间的误差值。
- 根据权利要求20-23中任一项所述的设备,其特征在于,所述达到训练终止条件包括:所述目标损失函数的值达到预设阈值;或,所述目标损失函数开始收敛;或,训练的次数达到预设次数;或,训练的时长达到预设时长;或,获取到训练终止的指令。
- 一种执行设备,包括处理器和存储器,所述处理器与所述存储器耦合,其特征在于,所述存储器,用于存储程序;所述处理器,用于执行所述存储器中的程序,使得所述执行设备执行如权利要求1-7中任一项所述的方法。
- 一种训练设备,包括处理器和存储器,所述处理器与所述存储器耦合,其特征在于,所述存储器,用于存储程序;所述处理器,用于执行所述存储器中的程序,使得所述训练设备执行如权利要求8-12中任一项所述的方法。
- 一种计算机可读存储介质,包括程序,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1-12中任一项所述的方法。
- 一种包含指令的计算机程序产品,其特征在于,当其在计算机上运行时,使得计算机执行如权利要求1-12中任一项所述的方法。
- 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行如权利要求1-12中任一项所述的方法。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202280098739.3A CN119654674A (zh) | 2022-09-09 | 2022-09-09 | 一种语音信号的处理方法、神经网络的训练方法及设备 |
PCT/CN2022/117989 WO2024050802A1 (zh) | 2022-09-09 | 2022-09-09 | 一种语音信号的处理方法、神经网络的训练方法及设备 |
EP22957772.1A EP4576080A1 (en) | 2022-09-09 | 2022-09-09 | Speech signal processing method, neural network training method and device |
US19/073,622 US20250210036A1 (en) | 2022-09-09 | 2025-03-07 | Speech Signal Processing Method, Neural Network Training Method, and Device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2022/117989 WO2024050802A1 (zh) | 2022-09-09 | 2022-09-09 | 一种语音信号的处理方法、神经网络的训练方法及设备 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US19/073,622 Continuation US20250210036A1 (en) | 2022-09-09 | 2025-03-07 | Speech Signal Processing Method, Neural Network Training Method, and Device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024050802A1 true WO2024050802A1 (zh) | 2024-03-14 |
Family
ID=90192541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/117989 WO2024050802A1 (zh) | 2022-09-09 | 2022-09-09 | 一种语音信号的处理方法、神经网络的训练方法及设备 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20250210036A1 (zh) |
EP (1) | EP4576080A1 (zh) |
CN (1) | CN119654674A (zh) |
WO (1) | WO2024050802A1 (zh) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112017687A (zh) * | 2020-09-11 | 2020-12-01 | 歌尔科技有限公司 | 一种骨传导设备的语音处理方法、装置及介质 |
CN112565977A (zh) * | 2020-11-27 | 2021-03-26 | 大象声科(深圳)科技有限公司 | 高频信号重建模型的训练方法和高频信号重建方法及装置 |
CN112599145A (zh) * | 2020-12-07 | 2021-04-02 | 天津大学 | 基于生成对抗网络的骨传导语音增强方法 |
-
2022
- 2022-09-09 EP EP22957772.1A patent/EP4576080A1/en active Pending
- 2022-09-09 CN CN202280098739.3A patent/CN119654674A/zh active Pending
- 2022-09-09 WO PCT/CN2022/117989 patent/WO2024050802A1/zh active Application Filing
-
2025
- 2025-03-07 US US19/073,622 patent/US20250210036A1/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112017687A (zh) * | 2020-09-11 | 2020-12-01 | 歌尔科技有限公司 | 一种骨传导设备的语音处理方法、装置及介质 |
CN112565977A (zh) * | 2020-11-27 | 2021-03-26 | 大象声科(深圳)科技有限公司 | 高频信号重建模型的训练方法和高频信号重建方法及装置 |
CN112599145A (zh) * | 2020-12-07 | 2021-04-02 | 天津大学 | 基于生成对抗网络的骨传导语音增强方法 |
Also Published As
Publication number | Publication date |
---|---|
US20250210036A1 (en) | 2025-06-26 |
EP4576080A1 (en) | 2025-06-25 |
CN119654674A (zh) | 2025-03-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US12293770B2 (en) | Voice signal dereverberation processing method and apparatus, computer device and storage medium | |
TWI730584B (zh) | 關鍵詞的檢測方法以及相關裝置 | |
US10433075B2 (en) | Low latency audio enhancement | |
CN110310623B (zh) | 样本生成方法、模型训练方法、装置、介质及电子设备 | |
CN103714824B (zh) | 一种音频处理方法、装置及终端设备 | |
WO2018068396A1 (zh) | 语音质量评价方法和装置 | |
WO2019099699A1 (en) | Interactive system for hearing devices | |
WO2019242414A1 (zh) | 语音处理方法、装置、存储介质及电子设备 | |
CN111833896A (zh) | 融合反馈信号的语音增强方法、系统、装置和存储介质 | |
Zhang et al. | Multi-channel multi-frame ADL-MVDR for target speech separation | |
WO2014114049A1 (zh) | 一种语音识别的方法、装置 | |
CN112017687B (zh) | 一种骨传导设备的语音处理方法、装置及介质 | |
US12075215B2 (en) | Method, apparatus and system for neural network hearing aid | |
WO2021114847A1 (zh) | 网络通话方法、装置、计算机设备及存储介质 | |
WO2019119593A1 (zh) | 语音增强方法及装置 | |
CN108198566B (zh) | 信息处理方法及装置、电子设备及存储介质 | |
US20240194220A1 (en) | Position detection method, apparatus, electronic device and computer readable storage medium | |
CN118354237A (zh) | 一种mems耳机的唤醒方法、装置、设备以及存储介质 | |
CN114822566A (zh) | 音频信号生成方法及系统、非暂时性计算机可读介质 | |
CN114783455A (zh) | 用于语音降噪的方法、装置、电子设备和计算机可读介质 | |
WO2025007866A1 (zh) | 语音增强方法、装置、电子设备及存储介质 | |
WO2024050802A1 (zh) | 一种语音信号的处理方法、神经网络的训练方法及设备 | |
CN113393863B (zh) | 一种语音评价方法、装置和设备 | |
CN115410586B (zh) | 音频处理方法、装置、电子设备及存储介质 | |
CN111863006A (zh) | 一种音频信号处理方法、音频信号处理装置和耳机 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22957772 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2022957772 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022957772 Country of ref document: EP Effective date: 20250320 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
WWP | Wipo information: published in national office |
Ref document number: 2022957772 Country of ref document: EP |