CN117558266B

CN117558266B - Model training method, device, equipment and computer readable storage medium

Info

Publication number: CN117558266B
Application number: CN202410049459.4A
Authority: CN
Inventors: 王雄
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2024-01-12
Filing date: 2024-01-12
Publication date: 2024-03-22
Anticipated expiration: 2044-01-12
Also published as: CN117558266A

Abstract

The application provides a model training method, device, equipment and computer readable storage medium; the method comprises the following steps: reading an initial training voice sample from an external memory, and loading the initial training voice sample into an internal memory; performing real-time simulation and convolution processing in an internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters to obtain a sample reverberation result and a noise reverberation result; and generating an extended training voice sample by using the sample reverberation result and the noise reverberation result, and training the voice recognition model by using the initial training voice sample and the extended training voice sample to obtain a trained voice recognition model. According to the method and the device, the time cost of data expansion can be reduced, the model training efficiency is improved, and the robustness of the trained voice recognition model is improved.

Description

Model training method, device, equipment and computer readable storage medium

Technical Field

The present application relates to artificial intelligence technology, and in particular, to a model training method, apparatus, device, and computer readable storage medium.

Background

The voice recognition is a technology for converting continuous voice flow into text, is widely applied to the fields of man-machine interaction, audio transcription and the like, and the effect of the technology in each application field mainly depends on the recognition accuracy of a voice recognition model, and better product use experience can be brought by higher accuracy. Most of the current mainstream speech recognition models are models based on neural networks, and the models are obtained by training a large amount of collected speech recognition data, so that the distribution of training data has a great influence on the accuracy of the speech recognition models, and the data distribution of the speech recognition models which are more fit with a real use scene is more beneficial to the training of the speech recognition models.

However, most of the voice training data are collected in some low-noise and low-reverberation scenes, or in some limited and fixed number of acoustic scenes, in order to ensure the expansibility of the data, or in consideration of the data collection cost, the data collection and recording can be performed in some high-noise and high-reverberation scenes, and the number of the scenes is nearly infinite when the model is actually used, so that the robustness of the voice recognition model is severely challenged due to the distribution difference of the using scene data and the training data.

Disclosure of Invention

The embodiment of the application provides a model training method, a model training device and a computer readable storage medium, which can complete the operations of simulating and generating RIR, convolution reverberation and noise superposition in real time in a training process, thereby shortening the model training time and improving the robustness of a model.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a model training method, which comprises the following steps:

reading an initial training voice sample from an external memory, and loading the initial training voice sample into an internal memory;

acquiring preset room impact response simulation parameters and reverberation noise simulation parameters;

performing real-time simulation in the internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters to obtain first room impact response voice data corresponding to the speaking object position and second room impact response voice data corresponding to the noise source position;

performing convolution processing on the initial training voice sample and the first room impact response voice data in the internal memory to obtain a sample reverberation result;

acquiring noise audio data, and carrying out convolution processing on the noise audio data and the second room impulse response voice data in the internal memory to obtain a noise reverberation result;

And generating an extended training voice sample for the initial training voice sample by utilizing the sample reverberation result and the noise reverberation result in the internal memory, and training a preset voice recognition model by utilizing the initial training voice sample and the extended training voice sample to obtain a trained voice recognition model.

The embodiment of the application provides a model training device, including:

the first acquisition module is used for reading the initial training voice sample from the external memory and loading the initial training voice sample into the internal memory;

the second acquisition module is used for acquiring preset room impact response simulation parameters and reverberation noise simulation parameters;

the real-time simulation module is used for carrying out real-time simulation in the internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters to obtain first room impact response voice data corresponding to the speaking object position and second room impact response voice data corresponding to the noise source position;

the first convolution module is used for carrying out convolution processing on the initial training voice sample and the first room impact response voice data in the internal memory to obtain a sample reverberation result;

The second convolution module is used for acquiring noise audio data, and carrying out convolution processing on the noise audio data and the second room impact response voice data in the internal memory to obtain a noise reverberation result;

the model training module is used for generating an extended training voice sample for the initial training voice sample by utilizing the sample reverberation result and the noise reverberation result in the internal memory, and training a preset voice recognition model by utilizing the initial training voice sample and the extended training voice sample to obtain a trained voice recognition model.

An embodiment of the present application provides an electronic device, including:

a memory for storing computer executable instructions;

and the processor is used for realizing the model training method provided by the embodiment of the application when executing the computer executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium, which stores a computer program or computer executable instructions for implementing the model training method provided by the embodiment of the application when being executed by a processor.

The embodiment of the application provides a computer program product, which comprises a computer program or computer executable instructions, and the computer program or the computer executable instructions realize the model training method provided by the embodiment of the application when being executed by a processor.

The embodiment of the application has the following beneficial effects:

firstly, an initial training voice sample is read from an external memory (hard disk), the initial training voice sample is loaded into an internal memory (internal memory), preset room impact response simulation parameters and reverberation noise simulation parameters are obtained, then real-time simulation is directly carried out in the internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters to obtain first room impact response voice data corresponding to the position of a speaking object and second room impact response voice data corresponding to the position of a noise source, convolution processing is continuously carried out on the first room impact response voice data and the second room impact response voice data in the internal memory to correspondingly obtain a sample reverberation result and a noise reverberation result, and finally, an expansion training voice sample is generated by utilizing the sample reverberation result and the noise reverberation result, so that data enhancement of the training voice sample can be directly realized in the internal memory, time cost caused by early simulation of expansion data can be avoided, and robustness of a trained voice recognition model can be ensured while model training duration is shortened.

Drawings

Fig. 1 is a schematic diagram of a network architecture of a speech recognition system 100 according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a server 400 provided in an embodiment of the present application;

FIG. 3A is a schematic flow chart of a model training method according to an embodiment of the present application;

FIG. 3B is a schematic diagram of an implementation flow chart of real-time simulation of room impact response in a memory according to an embodiment of the present application;

FIG. 3C is a schematic diagram of an implementation flow of determining target size information and wall absorption coefficients of a simulated acoustic environment according to an embodiment of the present application;

FIG. 3D is a schematic diagram of an implementation flow of determining a location of a speaking object according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an implementation flow for generating an extended training speech sample and training a speech recognition model according to an embodiment of the present application;

FIG. 5 is a schematic flow chart of an implementation of simulation by a scene caching mechanism according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of still another implementation of the model training method provided in the embodiment of the present application;

FIG. 7 is a schematic flow chart of an implementation of parallelized real-time data simulation provided by an embodiment of the present application;

fig. 8 is a schematic flow chart of another implementation of simulation by a scene buffering mechanism according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail with reference to the accompanying drawings, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and co-operating with other related parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Likewise, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

Unless defined otherwise, all technical and scientific terms used in the embodiments of the present application have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the embodiments of the application is for the purpose of describing the embodiments of the application only and is not intended to be limiting of the application.

Before further describing embodiments of the present application in detail, the terms and expressions that are referred to in the embodiments of the present application are described, and are suitable for the following explanation.

1) Automatic speech recognition (Automatic Speech Recognition, ASR) is the conversion of lexical content in human speech into computer readable inputs such as keys, binary codes, or character sequences.

2) Room impulse response (Room Impulse Response, RIR), a system response characterizing a room system, may be used for room equalization and calculation of room acoustic parameters, among other purposes.

3) Reverberation time (Reverberation Time, 60, R60), which refers to the duration of time required for the acoustic energy in a room to decrease by 60 dB after the sound source stops transmitting, is used to measure the degree of room reverberation, with a RIR having a greater RT60, the greater the reverberation, and vice versa.

In order to better understand the model training method provided in the embodiment of the present application, first, a model training method and existing drawbacks in the related art will be described.

In the prior art, a voice recognition model is generally utilized to perform voice recognition processing, the voice recognition model is required to be obtained through training a large number of training voice samples, so that the distribution of the training voice samples has a great influence on the accuracy of the voice recognition model, and the data distribution of the voice recognition model which is more fit with a real use scene is more beneficial to the training of the voice recognition model. Data augmentation of the training data is therefore required. The main flow scheme is realized by carrying out simulation expansion of noise, reverberation and scenes on data, and the implementation steps are as follows:

step one, noise data acquisition: various types of noise common in the environment are collected through various modes to form a noise set.

Step two, room impact response acquisition: and generating a room impact response in a real record or simulation mode at the same time, and superposing reverberation on the data.

Step three, data expansion: the collected noise is added into the existing training data, reverberation is added to the data by using RIR, the data is amplified into a plurality of times of the original data in the mode and stored in a hard disk, and model training is carried out by replacing the existing training data.

The technical drawbacks of the above solution are mainly manifested in the following points:

1) The RIR is generated by the following modes: the collection of real RIR has high cost and expense due to extremely high requirements on environment and the need of precise equipment and instruments for control, and the collection of small batch has limited effect due to the small magnitude. Therefore, the method of generating the RIR through simulation is often used for obtaining the RIR at present, but the method has large calculation amount because the generation process involves a complex signal processing flow, and is difficult to be added into a real-time training flow, and the RIR is usually stored in a hard disk process through a pre-generation mode and then read and used in the simulation process, but the method also brings additional hard disk read-write time, increases the time required by training, and the pre-generation mode is limited by the number of the RIRs.

2) Limitations of data expansion scheme: at present, most of common data expansion modes are simple to add reverberation to original data and add noise to training data according to a certain signal-to-noise ratio, the fineness of the mode is limited, the data distribution breadth is insufficient, the data distribution of an actual use scene cannot be really simulated under some conditions, and the model robustness is limited.

Based on this, the embodiments of the present application provide a model training method, apparatus, device, computer readable storage medium, and computer program product, which can complete the operations of generating RIR, convolution reverberation, and noise superposition in real time in a training process, avoiding the time cost generated by the advanced simulation of extended data. In the following, an exemplary application when the device is implemented as a server will be described.

Referring to fig. 1, fig. 1 is a schematic diagram of a network architecture of a speech recognition system 100 provided in an embodiment of the present application, as shown in fig. 1, where the network architecture includes a terminal 200, a network 300, and a server 400, in order to support a speech recognition application, the terminal 200 is connected to the server 400 through the network 300, and the network 300 may be a wide area network or a local area network, or a combination of the two.

The terminal 200 is configured to collect voice data, send the collected voice data to the server 400, identify the voice data by the server 400 using the trained voice identification model, obtain a voice identification result, send the voice identification result to the terminal 200 by the server 400, and display the voice identification result on the graphical interface 210 by the terminal 200.

When the server 400 trains the speech recognition model, it can firstly respond to the real-time simulation start instruction, read the initial training speech data from the external memory, load the initial training speech sample into the internal memory, and then obtain the preset room impact response simulation parameters and reverberation noise simulation parameters;

performing real-time simulation in the internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters to obtain first room impact response voice data corresponding to the speaking object position and second room impact response voice data corresponding to the noise source position; carrying out convolution processing on the initial training voice sample and the first room impact response voice data to obtain a sample reverberation result; acquiring noise audio data, and performing convolution processing on the noise audio data and the second room impulse response voice data to obtain a noise reverberation result; generating an extended training voice sample by using the sample reverberation result and the noise reverberation result, and training a preset voice recognition model by using the initial training voice sample and the extended training voice sample to obtain a trained voice recognition model.

In some embodiments, the server 400 may further send the trained speech recognition model to the terminal 200 after obtaining the trained speech recognition model, and the terminal 200 performs offline speech recognition on the collected speech data using the trained speech recognition model and displays the speech recognition result on the graphical interface 210.

In some embodiments, the server 400 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 200 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a smart television, a car terminal, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiments of the present application.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a server 400 provided in an embodiment of the present application, and the server 400 shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in server 400 are coupled together by bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, a digital signal processor (Digital Signal Processor, DSP), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (Random Access Memory, RAM). The memory 450 described in the embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (Universal Serial Bus, USB), etc.;

A presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows a model training apparatus 455 stored in a memory 450, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the first acquisition module 4551, the second acquisition module 4552, the real-time simulation module 4553, the first convolution module 4554, the second convolution module 4555, and the model training module 4556 are logical, and thus may be arbitrarily combined or further split according to the implemented functions. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and by way of example, the apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the model training method provided by the embodiments of the present application, e.g., the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (Application Specific Integrated Circuit, ASIC), digital signal processors (Digital Signal Processor, DSP), programmable logic devices (Programmable Logic Device, PLD), complex programmable logic devices (Complex Programmable Logic Device, CPLD), field programmable gate arrays (Field-Programmable Gate Array, FPGA), or other electronic components.

The model training method provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the server provided by the embodiment of the present application.

In the following, the model training method provided in the embodiment of the present application is described, and as mentioned above, the electronic device implementing the model training method in the embodiment of the present application may be a terminal, a server, or a combination of both. The execution subject of the respective steps will not be repeated hereinafter.

Referring to fig. 3A, fig. 3A is a schematic flow chart of a model training method according to an embodiment of the present application, which will be described with reference to the steps shown in fig. 3A, where the main body of the steps in fig. 3A is a server.

In step 101, initial training speech samples are read from an external memory and loaded into an internal memory.

In some embodiments, the external memory may be a large amount of program and data information for storing system files, large files, databases, and the like, which are located outside the host scope, and thus referred to as external memory, also referred to as external memory for short. Common external memories are floppy drives and disks, hard disks, optical memories, etc. The initial training voice sample can be voice data acquired by a voice acquisition device, and can also be character-to-white voice data in film and television works and the like. The initial training speech samples may be speech data that has not been subjected to a noisy reverberation process. The voice data is time-series data which records a change in sound on a time axis. The sound signal at each instant is sampled in digital form, typically by a microphone, forming a time series. After the initial training speech samples are read from the external memory, the initial training speech samples are loaded into the internal memory for subsequent real-time emulation processing on the internal memory. The internal memory is directly connected with the CPU, and the storage capacity is smaller but the speed is high. The internal memory is used for temporarily storing operation data in the CPU and data exchanged with an external memory such as a hard disk. The method is a bridge for communicating the external memory with the CPU, and all programs in the computer are run in the internal memory.

In step 102, preset room impulse response simulation parameters and reverberation noise simulation parameters are obtained.

In some embodiments, the room impulse response simulation parameters include size constraint information of the simulated acoustic environment, reverberation time constraint information, signal-to-noise ratio constraint information, microphone constraint information, speaking object constraint information, wherein the size constraint information of the simulated acoustic environment includes: maximum height, minimum height, maximum width, minimum width, maximum length, and minimum length of the simulated acoustic environment. The reverberation time constraint information includes: RT60 distribution center value, RT60 distribution width, RT60 minimum value, RT60 maximum value. The signal-to-noise ratio constraint information includes: signal-to-noise ratio distribution center value, signal-to-noise ratio distribution width, signal-to-noise ratio minimum value, signal-to-noise ratio maximum value. Microphone constraint information includes a first wall critical distance, a microphone maximum height, a microphone minimum height, a microphone array type, a number of microphones, and a microphone pitch. The speaking object constraint information includes: near field speaking object distance distribution center value, near field speaking object distance distribution width, near field speaking object distance minimum value, near field speaking object distance maximum value, far field speaking object distance distribution center value, far field speaking object distance distribution width, far field speaking object distance minimum value, far field speaking object distance maximum value, speaking object height minimum value, speaking object far field rate, speaking object wall critical distance.

In the embodiment of the application, the refinement degree of real-time simulation can be improved through the abundant room impact response simulation parameters and the reverberation noise simulation parameters, and high-quality expansion training voice samples are ensured to be obtained.

In step 103, real-time simulation is performed in the internal memory based on the room impulse response simulation parameters and the reverberation noise simulation parameters, so as to obtain first room impulse response voice data corresponding to the speaking object position and second room impulse response voice data corresponding to the noise source position.

In some embodiments, referring to fig. 3B, step 103 may be implemented by steps 1031 to 1036 described below, which are specifically described below.

In step 1031, target size information and wall absorption coefficients of the simulated acoustic environment are determined in an internal memory based on the size constraint information and reverberation time constraint information of the simulated acoustic environment.

In some embodiments, referring to fig. 3C, step 1031 may be implemented by steps 311 through 315 described below, which are specifically described below.

In step 311, a target length of the simulated acoustic environment is determined in the internal memory based on the maximum length and the minimum length in the size constraint information of the simulated acoustic environment.

In some embodiments, a random number may be determined from a minimum length to a maximum length using an average random distribution function, the random number being determined as a target length for simulating an acoustic environment.

Illustratively, the maximum length is 15 meters, the minimum length is 3 meters, the random number 8 is determined from 3 to 15, and then the target length of the simulated acoustic environment is determined to be 8 meters.

In step 312, a target width of the simulated acoustic environment is determined based on the maximum width and the minimum width in the size constraint information of the simulated acoustic environment.

In some embodiments, a random number may be determined from the minimum width to the maximum width using an average random distribution function, the random number being determined as a target width for simulating an acoustic environment.

Illustratively, the maximum width is 10 meters, the minimum width is 2 meters, the random number 5 is determined from 2 to 10, and then the target width of the simulated acoustic environment is determined to be 5 meters.

In step 313, a target height of the simulated acoustic environment is determined based on the maximum height and the minimum height in the size constraint information of the simulated acoustic environment.

In some embodiments, a random number may be determined from a minimum height to a maximum height using an average random distribution function, the random number being determined as a target height for simulating an acoustic environment.

By way of example, the maximum height is 5 meters, the minimum height is 2 meters, the random number 3 is determined from 2 to 5, and then the target height of the simulated acoustic environment is determined to be 3 meters.

In step 314, a target reverberation time is determined based on the reverberation time constraint information.

Wherein the reverberation time constraint information includes: RT60 distribution center value, RT60 distribution width, RT60 minimum value, RT60 maximum value. In some embodiments, the target reverberation time may be determined using a truncated normal distribution with a mean of the RT60 distribution center value, a variance of the RT60 distribution width/2, and the target reverberation time between the RT60 minimum and the RT60 maximum.

In step 315, a wall absorption coefficient is determined based on the target length, target width, target height, and target reverberation time of the simulated acoustic environment.

In some embodiments, the surface area and volume of the simulated acoustic environment may be determined based on the target length, target width, target height of the simulated acoustic environment, and then the wall absorption coefficient may be extrapolated back based on the Sibine formula. Wherein, the Sibine formula is shown in formula (1-1):

（1-1）；

wherein α is a wall absorption coefficient, S is a surface area of the simulated acoustic environment, and V is a volume of the simulated acoustic environment, so that the wall absorption coefficient can be determined according to the formula (1-2):

（1-2）。

It should be noted that, all the steps 311 to 315 are performed in the internal memory.

In step 1032, a microphone reference position and a microphone target position are determined based on the target size information and microphone constraint information of the simulated acoustic environment.

In some embodiments, the microphone constraint information includes a first wall critical distance, a microphone maximum height, a microphone minimum height, a microphone array type, a number of microphones, and a microphone pitch, and step 1032, when implemented, first determines a microphone reference position based on target size information of the simulated acoustic environment, the first wall critical distance, the microphone maximum height, and the microphone minimum height. The microphone reference location includes a microphone reference x-coordinate, a microphone reference y-coordinate, and a microphone reference z-coordinate. In some embodiments, a random number may be determined from a first wall critical distance to (target length-first wall critical distance) based on a uniform random distribution function, the random number is determined as a microphone reference x-coordinate, a random number is determined from the first wall critical distance to (target width-first wall critical distance) based on a uniform random distribution function, the random number is determined as a microphone reference y-coordinate, and a random number is determined from a microphone minimum height to a microphone maximum height based on a uniform random distribution function, and the random number is determined as a microphone reference z-coordinate.

A microphone target location is then determined based on the microphone reference location, the microphone array type, the number of microphones, and the microphone spacing. Wherein if the number of microphones is 1, the microphone array type and the microphone pitch are empty, and the microphone reference maintenance is directly determined as the microphone target position. If the number of microphones is at least 2, the microphone array type and the microphone pitch are not null, and the microphone array type may be linear or circular, and the microphone pitch may be 0.5 m, for example, when determining the microphone target positions based on the microphone reference positions, the microphone array type, the number of microphones, and the microphone pitch, the respective microphone target positions are sequentially determined based on the microphone pitch with the microphone reference positions as geometric centers.

Illustratively, the microphone reference position is (3,3,1.5), the microphone array type is linear and arranged transversely, the number of microphones is 3, the microphone pitch is 0.3 m, then the heights of the microphones are the same, and then the target positions of the three microphones are (2.7,3,1.5) and (3,3,1.5) (3.3,3,1.5), respectively.

In step 1033, a speaking object location is determined based on the microphone reference location and the speaking object constraint information.

In some embodiments, the speaking object constraint information includes a second wall critical distance, a speaking object far field rate, speaking object far field distance constraint information, speaking object near field distance constraint information, and speaking object height constraint information, and step 1033 may be implemented by steps 331 to 336 described below, which are described in detail below.

In step 331, a first random number is generated.

In some embodiments, the first random number may be generated from between 0 and 1 using a uniform random distribution function, and may be, for example, 0.3.

In step 332, it is determined whether the speaking subject far-field rate is less than the first random number.

The speaking object far field rate is a real number which is preset and is between 0 and 1, and is used for representing the probability that the speaking object is in the far field. Step 333 is entered when the far field rate of the speaking object is less than the first random number, and step 334 is entered when the far field rate of the speaking object is greater than or equal to the first random number.

In step 333, the distance between the speaking object and the microphone reference location is determined using the speaking object far field distance constraint information.

In some embodiments, speaking object far-field distance constraint information includes: a far-field talker distance distribution center value, a far-field talker distribution width, a far-field talker distance minimum value, and a far-field talker distance maximum value. Wherein the far-field speaker distance distribution center value characterizes a center value of a truncated normal distribution compliant with the microphone distance when the speaker is far-field (e.g., outside 1 m). The far-field speaking object distribution width represents the width of a truncated normal distribution obeyed by the speaking object at the far-field (outside 1 m) and the microphone distance, the far-field speaking object distance minimum represents the minimum value of the truncated normal distribution obeyed by the speaking object at the far-field (outside 1 m) and the microphone distance, and the far-field speaking object distance maximum represents the maximum value of the truncated normal distribution obeyed by the speaking object at the far-field (outside 1 m) and the microphone distance.

Step 333, when implemented, determines the distance between the speaking object and the microphone reference location using a truncated normal distribution with a mean value of the far-field speaking object distance distribution center value and a variance of the far-field speaking object distribution width/2, and the distance is between the far-field speaking object distance minimum value and the far-field speaking object distance maximum value.

In step 334, the distance between the speaking object and the microphone reference location is determined using the speaking object near-field distance constraint information.

In some embodiments, speaking object near field distance constraint information includes: near-field speaking object distance distribution center value, near-field speaking object distribution width, near-field speaking object distance minimum value, and near-field speaking object distance maximum value. Wherein the near field speaker distance profile center value characterizes a center value of a truncated normal profile obeying with the microphone distance when the speaker is in the near field (e.g., within 1 m). The near-field speaking object distribution width represents the width of a truncated normal distribution obeyed by the near-field speaking object and the microphone distance, the near-field speaking object distance minimum represents the minimum of the truncated normal distribution obeyed by the near-field speaking object and the microphone distance, and the near-field speaking object distance maximum represents the maximum of the truncated normal distribution obeyed by the near-field speaking object and the microphone distance.

When implemented, step 334 determines a distance between the speaking object and the microphone reference location using a truncated normal distribution with a mean value of the near-field speaking object distance distribution center value and a variance of the near-field speaking object distribution width/2, and the distance is between a near-field speaking object distance minimum value and a near-field speaking object distance maximum value.

In step 335, the azimuth of the speaking subject relative to the microphone is randomly generated.

In some embodiments, the azimuth of the speaking object relative to the microphone reference location may be generated from between 0 and 360 ° using a uniform random distribution function. The azimuth angle may be 30 ° by way of example.

In step 336, a speaking object location is determined based on the microphone reference location, the distance between the speaking object and the microphone reference location, the azimuth, the speaking object height constraint information.

In some embodiments, the speaking object position includes an x-coordinate of the speaking object, a y-coordinate of the speaking object, and a z-coordinate of the speaking object, and then step 336, when implemented, may be implemented by steps one through five described below, as described in more detail below.

Step one, based on the distance and azimuth between the speaking object and the microphone reference position, determining a first offset distance between the speaking object and the microphone reference position in the x-axis direction and a second offset distance between the speaking object and the microphone reference position in the y-axis direction.

In some embodiments, the product of the distance and the cosine of the azimuth angle is determined as a first offset distance and the product of the distance and the sine of the azimuth angle is determined as a second offset distance.

And step two, determining the initial x coordinate of the speaking object based on the x coordinate in the microphone reference position and the first offset distance.

In some embodiments, the x-coordinate in the microphone reference position is superimposed by a first offset distance, i.e. the sum of the x-coordinate in the microphone reference position and the first offset distance is determined as the initial x-coordinate of the speaking object.

And step three, determining an initial y coordinate of the speaking object based on the y coordinate in the microphone reference position and the second offset distance.

In some embodiments, the y-coordinate in the microphone reference position is superimposed by a second offset distance, i.e. the sum of the y-coordinate in the microphone reference position and the second offset distance is determined as the initial y-coordinate of the speaking object.

And step four, determining the z coordinate of the speaking object based on the speaking object height constraint information.

In some embodiments, the speaker height constraint information includes a speaker maximum height value and a speaker minimum height value. When this step is implemented, a random number may be determined from between the minimum height value of the speaking object and the maximum height value of the speaking object using a uniform random distribution function, and the random number is determined as the z-coordinate of the speaking object.

And fifthly, when at least one of the initial x coordinate and the initial y coordinate needs to be corrected based on the initial x coordinate, the initial y coordinate and the second wall critical distance, correcting the at least one of the initial x coordinate and the initial y coordinate which need to be corrected, and obtaining at least one of the corrected x coordinate and the corrected y coordinate.

In some embodiments, it is first determined whether the initial x-coordinate and the initial y-coordinate are within a settable region of the speaking object. For example, assuming that the simulated acoustic environment is a cuboid structure, and the coordinates of the top points of the bottom left corner of the cuboid are (0, 0), then determining the coordinates of eight top points of the simulated acoustic environment according to the target length, the target width and the target height of the simulated acoustic environment, after knowing the coordinates of the eight top points of the simulated acoustic environment and the second wall distance, determining the x coordinate value range of the settable region range of the speaking object as (the second wall critical distance, the target length-the second wall critical distance), and determining the y coordinate value range of the settable region range of the dialogue object as (the second wall critical distance, the target width-the second wall critical distance), at this time, determining whether the initial x coordinate needs to be corrected by determining whether the initial y coordinate is located in the y coordinate value range.

When the initial x coordinate is not located in the value range of the x coordinate, the initial x coordinate is determined to need to be corrected, and when the initial y coordinate is not located in the value range of the y coordinate, the initial y coordinate is determined to need to be corrected. When the initial x coordinate and/or the initial y coordinate to be corrected are corrected, the corrected x coordinate and/or the corrected y coordinate are obtained, and when the correction is implemented, the initial x coordinate will be described as an example. If the initial x-coordinate is less than the second wall critical distance, the second wall critical distance is determined as the corrected x-coordinate, and if the initial x-coordinate is greater than (target length-second wall critical distance), the (target length-second wall critical distance) is determined as the corrected x-coordinate. When correcting the initial y coordinate to be corrected, if the initial y coordinate is smaller than the second wall critical distance, the second wall critical distance is determined as the corrected y coordinate, and if the initial y coordinate is larger than (target width-second wall critical distance), the (target length-second wall critical distance) is determined as the corrected y coordinate.

For example, the simulated acoustic environment has a target length of 5 meters, a target width of 3 meters, a second wall critical distance of 0.3 meters, an x-coordinate range of (0.3,4.7), a y-coordinate range of (0.3,2.7), an initial x-coordinate of-2, and an initial y-coordinate of 8, and the initial x-coordinate and the initial y-coordinate need to be corrected at this time, the corrected x-coordinate is 0.3 because the initial x-coordinate is smaller than the second wall critical distance of 0.3, and the corrected y-coordinate is 2.7 because the initial y-coordinate is larger than (target width-second wall critical distance, i.e., 2.7).

In step 1034, a noise source location is determined based on target size information of the simulated acoustic environment and the reverberation noise simulation parameters.

In some embodiments, the reverberation noise simulation parameters include a third wall critical distance, noise source height constraint information, the noise source location including an x-coordinate of the noise source, a y-coordinate of the noise source, and a z-coordinate of the noise source, and accordingly, step 1034, when implemented, may be implemented by steps one through three described below, in particular.

Step one, determining the x coordinate of a noise source based on the target length and the third wall critical distance in the target size information of the simulated acoustic environment.

In some embodiments, a random number may be determined from the third wall critical distance to (target length-third wall critical distance) using a uniform random distribution function, the random number being determined as the noise source x-coordinate.

And step two, determining the y coordinate of the noise source based on the target width and the third wall critical distance in the target size information of the simulated acoustic environment.

In some embodiments, a random number may be determined from the third wall critical distance to (target width-third wall critical distance) based on a uniform random distribution function, the random number being determined as the noise source y-coordinate.

And thirdly, determining the z coordinate of the noise source based on the noise height constraint information.

In some embodiments, the noise height constraint information includes a noise source minimum height and a noise source maximum height, and step three, when implemented, may determine a random number from between the noise source minimum height and the noise source maximum height based on a uniform random distribution function and determine the random number as the noise source z-coordinate.

It should be noted that the reverberation noise simulation parameters include a maximum number of noise sources and a minimum number of noise sources, and the target number of noise sources may be determined from between the minimum number of noise sources and the maximum number of noise sources based on a uniform random distribution function, and then, for each noise source, the position of each noise source is determined through the above steps one to three.

In step 1035, first room impulse response speech data corresponding to the speaking object position is generated based on the target size information of the simulated acoustic environment, the microphone target position, and the speaking object position.

In some embodiments, after determining the target size information, the microphone target position, and the speaking target position of the simulated acoustic environment, in step 1035, the first room impulse response voice data may be generated according to a preset RIP generating function by using a mirroring method, the first room impulse response voice data may be generated by using a ray tracing method, or the first room impulse response voice data may be obtained by using a hybrid method in a simulation. When the first room impact response voice data is determined by adopting the mixing method, firstly generating initial first room impact response voice data under the condition of lower order by a mirror image method, and then complementing RIR details by a ray tracing method to form final first room impact response voice data. Here, the maximum order of the mirroring method needs to be set in the simulation process, the larger the order RIR is, the larger the calculation amount is, and the maximum order may be 15, 17, or the like, for example.

In step 1036, second room impulse response speech data corresponding to the noise source location is generated from the size information of the simulated acoustic environment, the microphone target location, and the noise source location.

Similar to the implementation process of step 1035, in generating the second room impulse response voice data corresponding to the noise source position, the second room impulse response voice data may be generated according to a preset RIP generating function and size information of the simulated acoustic environment, the microphone target position, and the noise source position by using a mirroring method, the second room impulse response voice data may be generated by using a ray tracing method, or the second room impulse response voice data may be obtained by using a mixing method. When the second room impact response voice data is determined by adopting the mixing method, firstly generating initial second room impact response voice data under the condition of lower order by a mirror image method, and then complementing RIR details by a ray tracing method to form final second room impact response voice data.

It should be noted that, all of the steps 1031 to 1036 are performed in the internal memory, so that the data simulation efficiency can be improved.

The following description continues with reference to fig. 3A, following step 103.

In step 104, the initial training speech sample is convolved with the first room impulse response speech data in the internal memory to obtain a sample reverberation result.

In some embodiments, the initial training speech sample is normalized in the internal memory to obtain a normalized initial training speech sample, and then the normalized initial training speech sample and the first room impulse response speech data are convolved to obtain a sample reverberation result.

In step 105, noise audio data is acquired, and convolution processing is performed on the noise audio data and the second room impulse response speech data in the internal memory, so as to obtain a noise reverberation result.

In some embodiments, the maximum number of noise sources and the minimum number of noise sources are included in the reverberation noise simulation parameters, a target number of noise sources is first determined from between the minimum number of noise sources and the maximum number of noise sources based on a uniform random distribution function, and then the noise audio data of the target number is acquired. Illustratively, the target number is 5, then at this time 5 noise audio data are randomly selected from the pre-established noise data set. Then, similar to step 104, each piece of noise audio data is normalized respectively, normalized noise audio data is obtained correspondingly, and then convolution calculation is performed on each piece of normalized noise audio data and the second room impulse response voice data, so as to obtain a noise reverberation result corresponding to each noise source. If the target number of noise sources is 5, then in this step, 5 noise reverberation results are obtained.

In step 106, an extended training speech sample for the initial training speech sample is generated in the internal memory by using the sample reverberation result and the noise reverberation result, and a preset speech recognition model is trained by using the initial training speech sample and the extended training speech sample, so as to obtain a trained speech recognition model.

In some embodiments, referring to fig. 4, step 106 may be implemented by steps 1061 through 1068 described below, which are described in detail below.

In step 1061, a preset snr parameter and a normalization parameter are obtained in the internal memory, and a target snr is determined based on the snr parameter, and a target normalization value is determined based on the normalization parameter.

In some embodiments, the signal-to-noise ratio constraint information comprises: when the target signal-to-noise ratio is determined based on the signal-to-noise ratio parameters, a random number is determined by using a truncated normal distribution function with a mean value of the signal-to-noise ratio distribution center value and a variance of the signal-to-noise ratio distribution width/2, wherein the random number is located in a value range of [ signal-to-noise ratio minimum value, signal-to-noise ratio maximum value ]. The normalization parameters include: the maximum normalized value and the minimum normalized value may be determined by using a uniform random distribution function when determining the target normalized value based on the normalized parameter, and determining a random number from the minimum normalized value to the maximum normalized value, and determining the random number as the target normalized value.

In step 1062, weighting coefficients for the noise reverberation result are determined based on the target signal-to-noise ratio and the sample reverberation result.

In some embodiments, the sample power of the sample reverberation result is first determined, and then the target noise power is determined based on the target signal-to-noise ratio and the sample power. And determining the noise power of the noise reverberation results, if a plurality of noise reverberation results exist, determining the noise power of each noise reverberation result, adding the noise powers to obtain the total noise power, and determining the ratio of the target noise power and the total noise power as the weight coefficient of the noise reverberation result.

In step 1063, a weighted reverberation result is determined based on the sample reverberation result, the noise reverberation result, and the weight coefficients.

In some embodiments, the product of the noise reverberation result and the weight coefficient is determined as a weighted noise reverberation result, and the sample reverberation result is added to the weighted noise reverberation result to obtain a summation result, then a signal strength maximum in the summation result is determined, and the summation result is divided by the signal strength maximum to obtain a weighted reverberation result.

In step 1064, the weighted reverberation result is normalized based on the target normalization value, resulting in an extended training speech sample.

In some embodiments, assuming the target normalized value is G, this will beAnd determining the product of the weighted reverberation result and the normalized weight as an extended training voice sample.

The data enhancement of one initial training voice sample is completed through the steps 1061 to 1064, and an extended training voice sample of the initial training voice sample is obtained, and step 201 may be executed to determine, as acoustic scene data, target size information of a simulated acoustic environment, a microphone target position, a speaking object position, first room impact response voice data, and second room impact response voice data when the extended training voice sample is generated, and add the acoustic scene data to a scene buffer queue.

It should be noted that, in the process of performing data enhancement on the initial training voice samples in steps 1061 to 1064 described above is performed in a real-time simulation thread, in an actual application process, in order to further improve data simulation efficiency, a plurality of real-time simulation threads may be established, where the plurality of real-time simulation threads form a simulation thread pool, and when performing data simulation on a plurality of initial training voice samples included in a batch, the plurality of real-time simulation threads in the simulation thread pool are utilized to perform data simulation on a plurality of initial training voice samples in the batch in parallel. The size of the thread pool can be determined according to the computing resources of the training environment, and for example, the thread pool can comprise 5 real-time simulation threads, so that the real-time simulation can be performed on 5 initial training voice samples at the same time, and the real-time simulation and data expansion efficiency can be further improved.

In step 1065, the first batch of extended training speech samples is buffered in a sample sharing queue.

In step 1066, in the model training thread, the pre-set speech recognition model is trained using the first set of initial training speech samples and the extended training speech samples.

After the data expansion of the initial training voice samples of the first batch is completed, a model training thread can be started, the model training thread carries out model training, and meanwhile, the data simulation and data expansion are carried out in a real-time simulation thread, so that the parallel processing of training and simulation is realized, and the model training efficiency is improved.

In step 1067, the initial training voice samples of the subsequent batch are obtained, simulation processing is performed on the initial training voice samples of the subsequent batch in parallel in the real-time simulation thread, the extended training voice samples of the subsequent batch are correspondingly obtained, and the extended training voice samples of the subsequent batch are cached in the sample sharing queue.

In some embodiments, to further improve the data simulation efficiency, multiple real-time simulation threads may be established, so that multiple real-time simulation threads are utilized to perform simulation processing on the same batch of initial training voice samples in parallel. That is, after the initial training voice samples of the subsequent batch are obtained, the initial training voice samples of the same batch are distributed to the plurality of real-time simulation threads, the initial training voice samples distributed to the initial training voice samples are subjected to simulation processing in parallel by utilizing the plurality of real-time simulation threads, the extended training voice samples of the initial training voice samples of the same batch are obtained, and the extended training voice samples are cached in the sample sharing queue.

In step 1068, in the model training thread, training the speech recognition model is continued using the initial training speech samples and the extended training speech samples of the subsequent batch until a trained speech recognition model is obtained.

In the model training method provided by the embodiment of the application, an initial training voice sample is firstly read from an external memory (hard disk), the initial training voice sample is loaded into an internal memory (internal memory), preset room impact response simulation parameters and reverberation noise simulation parameters are obtained, then real-time simulation is directly carried out in the internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters to obtain first room impact response voice data corresponding to the speaking object position and second room impact response voice data corresponding to the noise source position, convolution processing is carried out on the first room impact response voice data and the second room impact response voice data in the internal memory, a sample reverberation result and a noise reverberation result are correspondingly obtained, finally, an expansion training voice sample is generated by utilizing the sample reverberation result and the noise reverberation result, and a voice recognition model is trained by utilizing the initial training voice sample and the expansion training voice sample, so that a trained voice recognition model is obtained. In addition, in the model training process, the training thread and the real-time simulation thread are executed in parallel, so that the model training efficiency can be further improved.

In some embodiments, steps 201 through 209 shown in FIG. 5 may also be performed after the generation of the augmented training speech samples, as described below in connection with FIG. 5.

In step 201, target size information of the simulated acoustic environment, microphone target position, speaking object position, first room impulse response voice data and second room impulse response voice data when generating the extended training voice sample are determined as acoustic scene data, and the acoustic scene data is added to a scene buffer queue.

In step 202, when a new extended training speech sample needs to be generated for a new initial training speech sample, a randomly generated new scene simulation rate is obtained.

In some embodiments, a random number between 0 and 1 may be generated using a uniform random distribution function, which is determined as a new scene simulation rate. Illustratively, the new scene simulation rate may be 0.4.

In step 203, it is determined whether the new scene simulation rate is less than a preset simulation rate threshold.

Wherein the simulation rate threshold may be a real number between 0 and 1, for example, the simulation rate threshold may be 0.1, 0.2, etc. When the new scene simulation rate is smaller than a preset simulation rate threshold, the new scene simulation can be stopped, and step 204 is performed at this time; when the new scene simulation rate is greater than or equal to the preset simulation rate threshold, it is indicated that new scene data simulation is required, and step 206 is performed.

In step 204, target acoustic scene data is randomly acquired from a scene cache queue.

In some embodiments, the target acoustic scene data includes target size information of the simulated acoustic environment, microphone target locations, speaking object locations, first room impulse response speech data, and second room impulse response speech data.

In step 205, new extended training speech samples for the new initial training speech samples are generated in the internal memory using the new initial training speech samples, the first room impulse response speech data and the second room impulse response speech data in the target acoustic scene data.

In some embodiments, step 205, when implemented, performs normalization processing on the new initial training voice sample to obtain a normalized new initial training voice sample, and then performs convolution calculation on the normalized new initial training voice sample and the first room impact response voice data to obtain a new sample reverberation result; and obtaining new noise source audio data, carrying out normalization processing on the new noise source audio data to obtain normalized new noise source audio data, and carrying out convolution calculation on the normalized new noise source audio data and second room impulse response voice data to obtain a new noise reverberation result. Thereafter, referring to steps 1061-1064, new augmented training speech samples are generated based on the new sample reverberation results and the new noise reverberation results.

In step 206, real-time simulation is performed in the internal memory based on the room impulse response simulation parameters and the reverberation noise simulation parameters, so as to obtain new first room impulse response voice data corresponding to the speaking object position and new second room impulse response voice data corresponding to the noise source position.

In step 207, the new initial training speech sample is convolved with the new first room impulse response speech data in the internal memory to obtain a new sample reverberation result.

In step 208, new noise audio data is acquired, and the new noise audio data is convolved with the new second room impulse response speech data in the internal memory to obtain a new noise reverberation result.

In step 209, a new augmented training speech sample for the new initial training speech sample is generated in the internal memory using the new sample reverberation result and the new noise reverberation result.

It should be noted that the implementation procedures of steps 206 to 209 are similar to the implementation procedures of steps 103 to 106, except that the parameter values randomly determined by using the uniform random distribution function or the truncated normal distribution function are different, and reference may be made to the implementation procedures of steps 103 to 106 in the implementation.

In the embodiment of steps 201 to 209, after generating an extended training voice sample, the target size information, the microphone target position, the speaking object position, the first room impulse response voice data and the second room impulse response voice data of the simulated acoustic environment adopted when generating the extended training voice sample are used as acoustic scene data and stored in the scene buffer queue, when a new extended training voice sample needs to be generated, a new scene simulation rate is randomly generated, if it is determined that the simulation needs to be performed again based on the new scene simulation rate, the new first room impulse response voice data and the new second room impulse response voice data are simulated in real time in the memory, and then a new extended training voice sample is generated based on the new initial training voice sample, the first room impulse response voice data and the second room impulse response voice data in the target acoustic scene data; if the simulation is not required to be carried out again based on the new scene simulation rate, the target acoustic scene data can be randomly determined from the scene cache queue, and at the moment, a new extended training voice sample is generated directly based on the new initial training voice sample, the first room impact response voice data and the second room impact response voice data in the target acoustic scene data, so that the real-time simulation time length and the model training time length can be further shortened, and the model training efficiency is improved.

In the following, an exemplary application of the embodiments of the present application in a practical application scenario will be described.

The model training method provided by the embodiment of the application can be applied to training of a voice recognition model, for example, the voice recognition model used in an offline voice recognition algorithm of a cloud intelligent voice assistant technical scheme can be trained, and the voice recognition model used in the online voice recognition algorithm can also be trained. By means of the fine real-time simulation data simulation method, the noise and reverberation distribution of a real scene is simulated, so that data expansion of training voice samples is achieved, robustness of a voice recognition model is improved, and user experience is effectively improved. In the model training process, the training process is optimized through scene caching and parallelization strategies, and the training cost is reduced.

Fig. 6 is a schematic diagram of another implementation flow of the model training method provided in the embodiment of the present application, as shown in fig. 6, training codes first read a training corpus X from a hard disk and load the training corpus X into a memory, then obtain a simulated corpus X 'in the memory through a data real-time simulation module 601, where the simulated corpus X' is still stored in the memory for subsequent model training.

In the embodiment of the present application, the refinement of the data real-time simulation module 601 in the simulation flow is reflected in the richness of the parameters used in real-time RIR simulation and reverberation noise simulation, wherein the list of the parameters used is shown in table 1:

table 1, real time data simulation parameter table

/>

The RIR simulation flow is illustrated and can be described as three steps:

the first step: and (5) constructing an acoustic environment.

First, the length, width and height of the room are generated, wherein the length of the room is a random value determined by a uniform random distribution function from the minimum value of the length of the room to the maximum value of the length of the room, the width of the room is a random value determined by a uniform random distribution function from the minimum value of the width of the room to the maximum value of the width of the room, and the height of the room is a random value determined by a uniform random distribution function from the minimum value of the height of the room to the maximum value of the height of the room.

Based on the relevant parameters of RT60 (RT 60 distribution center value, RT60 distribution width, RT60 minimum value, RT60 maximum value) in table 1, RT60 values are randomly generated by truncated normal distribution:

wherein, the distribution rule of the truncated normal distribution is:

p (Y) to N (RT 60 distribution center value, RT60 distribution width/2), when RT60 minimum value < Y < RT60 maximum value;

P (Y) =0, RT60 maximum < Y, or Y < RT60 minimum;

wherein P (Y) represents the random probability of the random value Y, and the wall absorption coefficient can be obtained by solving according to the Sibine formula, and the simulated acoustic environment with accurate length, width, height and wall absorption coefficient can be obtained.

2) And a second step of: a microphone position and a sound source position are determined.

Firstly, generating a reference position of a microphone according to the following formulas (2-1) to (2-3) according to the length, width and height of a room and related parameters of the microphone position:

（2-1）;

（2-2）;

（2-3）;

the x, y and z axes respectively represent the axes of the length, width and height of the room, if the number of the microphones is 1, the reference position is determined as a microphone target position, if the generation of array microphone data is selected, the reference position is taken as a geometric center position, and the coordinate position of each microphone in the microphone array is determined according to the type of the microphone array, the number of the microphones and the microphone spacing.

Subsequently, a random value of 0 to 1 subject to a uniform distribution is regenerated and compared with the far-field rate of the speaker, and if the far-field rate of the speaker is smaller than the random value, the relative distance d between the speaker and the microphone is determined according to the formula (2-4):

（2-4）；

if the speaker far field rate is greater than or equal to the random value, determining a relative distance between the speaker and the microphone according to equation (2-5):

（2-5）；

At this time, according to the polar coordinate theory, an angle of 0-360 is randomly generated to represent the azimuth angle of the speaker relative to the microphone in the xOy planeThe speaker x-axis coordinates are determined according to equation (2-7):

（2-7）；

determining speaker y-axis coordinates according to equation (2-8):

（2-8）；

determining speaker z-axis coordinates according to formulas (2-9):

（2-9）；

and then correcting the speaker coordinates according to the speaker wall critical distance, when the speaker position is realized, firstly determining whether the speaker position is outside the room based on the speaker x-axis coordinates and the speaker y-axis coordinates, and if the speaker position is determined to be outside the room, correcting at least one of the speaker x-coordinates and the speaker y-coordinates based on the speaker wall critical distance to obtain corrected speaker positions, wherein the speaker positions of the corrected speakers are positioned in the room, and the distances between the speaker positions and the walls are larger than or equal to the speaker wall critical distance.

Finally, the number of noise sources is generated by using the maximum number of noise sources and the minimum number of noise sources in a uniform random number mode, and then the position coordinates of the ith noise source can be determined by the following formulas (2-10) to (2-12):

2-10）；

（2-11）；

（2-12）；

at this time, the microphone position coordinates, speaker position coordinates, and noise source position coordinates in the room can be obtained.

And a third step of: an impulse response is generated.

After the acoustic environment, the microphone and the sound source position are determined, in the embodiment of the application, RIR voice data are obtained in a simulation mode by using a mixing method, initial RIR voice data are generated under the condition of low order by a mirror image method, and then the RIR voice data detail is complemented by a ray tracing method to form final RIR voice data. The larger the order RIR is, the larger the calculation amount is, but the larger the calculation amount is, the typical maximum order used in the embodiment of the application is 17.

The steps of reverberation and noise simulation are as follows:

1) The first step: noise data is selected.

And extracting the corresponding number of noise from the noise data set according to the number of noise sources determined in the RIR generation step.

2) And a second step of: generating a signal to noise ratio, which can be achieved by the following equation (2-13):

signal to noise ratio = truncated normal distribution (signal to noise ratio distribution center value, signal to noise ratio distribution width, signal to noise ratio minimum, signal to noise ratio maximum) (2-13);

3) And a third step of: RIR convolution and noise accumulation, for a speaker, the data used is a speech recognition training data sample X (t) currently being processed in a training corpus X, if the corresponding speaker position is RIR speech data r _s (t) determining the sample reverberation result according to the formula (2-14)：

（2-14）;

Here, theRepresenting convolution operation, norm represents normalization processing of the speech signal amplitude. Similarly, for the noise source position, if the number of noise sources selected is N, the noise signal extracted in the first step is N ₁ (t)、…、n _N (t) RIR of the corresponding noise source position is r _n ⁱ (t) then for the ith noise, then the noise reverberation result is determined according to equation (2-15)：

（2-15）;

Finally, generating a normalized value G from the minimum normalized value and the maximum normalized value by using uniform random distribution, and determining a speech recognition training data sample x' (t) after reverberation and noise simulation by the following formulas (2-16) to (2-18):

（2-16）；

（2-17）；

（2-18）；

wherein, sigma is calculated by combining the voice signal of the speaker position and the noise signal corresponding to each noise source position through a signal-to-noise ratio calculation method.

In general, the RIR consumes relatively much time in the calculation process, so if it is not optimized and accelerated, in the original serial training process, the time occupied by real-time data simulation will lead to the overall increase of training time, and in order to solve this problem, in the embodiment of the present application, a parallelized real-time data simulation strategy is designed, and the flow is shown in FIG. 7: the parallelization of the real-time simulation thread 702 is realized in two aspects, the parallelization of the training thread 701 and the real-time simulation thread 702 is designed, corresponding Batch (Batch) data used by each step of training is continuously simulated in the real-time simulation thread 702 and is cached in a shared queue, then the data of each Batch after data simulation is sequentially read from the shared queue in the training thread, and the trained data is released from the queue, so that the difference between the overall training speed and the training speed without real-time data simulation is not great as long as the speed of data simulation thread production data is ensured to be faster than the consumption speed of the training thread. In the second aspect, a data simulation thread pool is designed, and each Batch and each piece of data in each Batch are not dependent on each other when being simulated, so that the data simulation flow of each Batch can be distributed to the thread pool for processing in a parallel thread pool mode, and the size of the thread pool can be determined according to the computing resources of a training environment.

In addition, the embodiment of the invention designs a scene buffering mechanism while ensuring the refinement and diversity of simulation, and the principle is shown in fig. 8, wherein the acoustic scene refers to various parameters and RIR signals generated in the RIR simulation flow, because the elements can form basic acoustic scene data. As shown in fig. 8, the execution steps of the scene buffering mechanism include:

step 801, generating a random number uniformly distributed between 0 and 1.

Step 802, judging a new scene simulation rate R _new Whether or not less than the random number.

Wherein, if the new scene simulation rate R _new Less than the random number, step 803 is entered; if the new scene simulation rate R _new Greater than or equal to the random number, step 804 is entered.

Step 803, randomly extracting target acoustic scene data.

In some embodiments, target acoustic scene data is randomly extracted from an existing acoustic scene buffer queue, expanded training speech samples are formed based on the target acoustic scene data, and model training is performed by using the expanded training speech samples.

At step 804, the simulation generates new acoustic scene data.

In some embodiments, after the new acoustic scene data is generated by simulation, the newly generated acoustic scene data is pushed into an acoustic scene buffer queue, and expanded training speech samples are formed using the new acoustic scene data, and model training is performed using the expanded training speech samples.

Wherein the acoustic scene buffer queue follows the principle of first-in first-out, the queue size can be set according to actual conditions, and in the embodiment of the application, the typical value of the queue size is 1000, R _new May be 0.1.

According to the data refinement real-time simulation method provided by the embodiment of the application, voice recognition training data of fine simulation reverberation and noise can be generated under the condition of taking the rapidness into consideration, the robustness of a voice recognition model is effectively improved, the time and economic cost required by training are reduced, and the test result of data simulation speed improvement brought by using a parallelization strategy and scene buffering is as follows:

TABLE 2 time comparison Table for data simulation with different methods

From the results, the two strategies can improve the simulation speed by approximately 20 times compared with the baseline, so that the fine real-time simulation and the rapid training are combined.

It can be appreciated that in the embodiments of the present application, related data such as user information, training voice samples, voice data to be identified, etc. are related, when the embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of related data needs to comply with related laws and regulations and standards of related countries and regions.

Continuing with the description below of an exemplary architecture implemented as a software module for model training apparatus 455 provided in an embodiment of the present application, in some embodiments, as shown in fig. 2, the software modules stored in model training apparatus 455 of memory 450 may include:

a first obtaining module 4551, configured to read an initial training voice sample from an external memory, and load the initial training voice sample into an internal memory;

the second obtaining module 4552 is configured to obtain preset room impact response simulation parameters and reverberation noise simulation parameters;

the real-time simulation module 4553 is configured to perform real-time simulation in the internal memory based on the room impact response simulation parameter and the reverberation noise simulation parameter, to obtain first room impact response voice data corresponding to a speaking object position and second room impact response voice data corresponding to a noise source position;

a first convolution module 4554, configured to perform convolution processing on an initial training speech sample and the first room impulse response speech data in the internal memory, to obtain a sample reverberation result;

a second convolution module 4555, configured to obtain noise audio data, and perform convolution processing on the noise audio data and the second room impact response speech data in the internal memory to obtain a noise reverberation result;

The model training module 4556 is configured to generate an extended training speech sample for an initial training speech sample by using the sample reverberation result and the noise reverberation result in the internal memory, and train a preset speech recognition model by using the initial training speech sample and the extended training speech sample to obtain a trained speech recognition model.

In some embodiments, the room impulse response simulation parameters include size constraint information, reverberation time constraint information, signal-to-noise ratio constraint information, microphone constraint information, speaking object constraint information of the simulated acoustic environment, and correspondingly, the real-time simulation module 4553 is further configured to: determining target size information and wall absorption coefficients of the simulated acoustic environment in the internal memory based on the size constraint information and reverberation time constraint information of the simulated acoustic environment; determining a microphone reference position and a microphone target position according to the target size information of the simulated acoustic environment and the microphone constraint information; determining a speaking object position according to the microphone reference position and the speaking object constraint information; determining a noise source position according to the target size information of the simulated acoustic environment and the reverberation noise simulation parameters; generating first room impact response voice data corresponding to the speaking object position according to the target size information of the simulated acoustic environment, the microphone target position and the speaking object position; and generating second room impulse response voice data corresponding to the noise source position according to the size information of the simulated acoustic environment, the microphone target position and the noise source position.

In some embodiments, the real-time simulation module 4553 is further to: determining a target length of the simulated acoustic environment in the internal memory based on a maximum length and a minimum length in size constraint information of the simulated acoustic environment; determining a target width of the simulated acoustic environment based on the maximum width and the minimum width in the size constraint information of the simulated acoustic environment; determining a target height of the simulated acoustic environment based on the maximum height and the minimum height in the size constraint information of the simulated acoustic environment; determining a target reverberation time based on the reverberation time constraint information; a wall absorption coefficient is determined based on the target length, target width, target height, and target reverberation time of the simulated acoustic environment.

In some embodiments, the microphone constraint information includes a first wall critical distance, a microphone maximum height, a microphone minimum height, a microphone array type, a number of microphones, and a microphone pitch, the real-time simulation module 4553 further configured to: determining a microphone reference position according to the target size information of the simulated acoustic environment, the first wall critical distance, the maximum microphone height and the minimum microphone height; and determining a microphone target position according to the microphone reference position, the microphone array type, the microphone number and the microphone spacing.

In some embodiments, the speaking object constraint information includes a second wall critical distance, a speaking object far field rate, speaking object far field distance constraint information, speaking object near field distance constraint information, speaking object height constraint information, the real-time simulation module 4553 further configured to: generating a first random number, and determining the distance between the speaking object and the microphone reference position by using the far-field distance constraint information of the speaking object when the far-field rate of the speaking object is smaller than the first random number; when the far field rate of the speaking object is greater than or equal to the first random number, determining the distance between the speaking object and the microphone reference position by utilizing the near field distance constraint information of the speaking object; randomly generating an azimuth of the speaking object relative to the microphone; a speaking object location is determined based on the microphone reference location, a distance between the speaking object and the microphone reference location, the azimuth, and the speaking object height constraint information.

In some embodiments, the speaking object location includes an x-coordinate of the speaking object, a y-coordinate of the speaking object, and a z-coordinate of the speaking object, the real-time simulation module 4553 further configured to: determining a first offset distance in an x-axis direction of the speaking object and the microphone reference position and a second offset distance in a y-axis direction of the speaking object and the microphone reference position based on a distance and an azimuth between the speaking object and the microphone reference position; determining an initial x-coordinate of the speaking object based on the x-coordinate in the microphone reference location and the first offset distance; determining an initial y-coordinate of the speaking object based on the y-coordinate in the microphone reference location and the second offset distance; determining a z-coordinate of the speaking object based on the speaking object height constraint information; and when determining that at least one of the initial x coordinate and the initial y coordinate needs to be corrected based on the initial x coordinate, the initial y coordinate and the second wall critical distance, correcting the at least one of the initial x coordinate and the initial y coordinate to be corrected, and obtaining at least one of the corrected x coordinate and the corrected y coordinate.

In some embodiments, the reverberation noise simulation parameters include a third wall critical distance, noise source height constraint information, the noise source location including an x-coordinate of the noise source, a y-coordinate of the noise source, and a z-coordinate of the noise source, the real-time simulation module 4553 further configured to: determining an x coordinate of a noise source based on a target length and a third wall critical distance in the target size information of the simulated acoustic environment; determining a y coordinate of a noise source based on a target width and a third wall critical distance in the target size information of the simulated acoustic environment; based on the noise height constraint information, a z-coordinate of a noise source is determined.

In some embodiments, model training module 4556 is further to: acquiring preset signal-to-noise ratio parameters and normalization parameters, determining a target signal-to-noise ratio based on the signal-to-noise ratio parameters, and determining a target normalization value based on the normalization parameters; determining a weight coefficient of the noise reverberation result based on the target signal-to-noise ratio and the sample reverberation result; determining a weighted reverberation result based on the sample reverberation result, the noise reverberation result, and the weight coefficient; normalizing the weighted reverberation result based on the target normalization value to obtain an extended training voice sample for the initial training voice sample.

In some embodiments, model training module 4556 is further to: caching the first batch of extended training voice samples into a sample sharing queue; in the model training thread, training a preset voice recognition model by using the initial training voice sample and the extended training voice sample of the first batch; obtaining the subsequent batch of the expanded training voice samples, carrying out simulation processing on the initial training voice samples of the subsequent batch in parallel in a real-time simulation thread, correspondingly obtaining the subsequent batch of the expanded training voice samples, and caching the subsequent batch of the expanded training voice samples into the sample sharing queue; and in the model training thread, training the preset voice recognition model continuously by utilizing the initial training voice samples and the extended training voice samples of the subsequent batches until a trained voice recognition model is obtained.

In some embodiments, the apparatus further comprises: the data storage module is used for determining target size information of a simulated acoustic environment, the microphone target position, the speaking object position, the first room impact response voice data and the second room impact response voice data when the extended training voice sample is generated as acoustic scene data, and adding the acoustic scene data to a scene cache queue; the third acquisition module is used for acquiring a new scene simulation rate generated randomly when a new extended training voice sample is required to be generated for a new initial training voice sample; a fourth obtaining module, configured to randomly obtain target acoustic scene data from the scene cache queue when the new scene simulation rate is less than a preset simulation rate threshold; a first generation module for generating a new extended training speech sample for the new initial training speech sample using the new initial training speech sample, the first room impulse response speech data and the second room impulse response speech data in the target acoustic scene data.

The real-time simulation module is further used for performing real-time simulation in the internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters when the new scene simulation rate is greater than or equal to the preset simulation rate threshold value, so as to obtain new first room impact response voice data corresponding to the speaking object position and new second room impact response voice data corresponding to the noise source position; the first convolution module is further configured to convolve a new initial training speech sample with the new first room impact response speech data in the internal memory, so as to obtain a new sample reverberation result; the second convolution module is further used for obtaining new noise audio data, and carrying out convolution processing on the new noise audio data and the new second room impact response voice data in the internal memory to obtain a new noise reverberation result; a second generation module for generating new extended training speech samples for the new initial training speech samples using the new sample reverberation result and the new noise reverberation result in the internal memory.

Embodiments of the present application provide a computer program product comprising a computer program or computer-executable instructions stored in a computer-readable storage medium. The processor of the electronic device reads the computer-executable instructions from the computer-readable storage medium, and the processor executes the computer-executable instructions, so that the electronic device executes the model training method described in the embodiment of the application.

The present embodiments provide a computer readable storage medium storing computer executable instructions or a computer program stored therein, which when executed by a processor, cause the processor to perform the model training method provided by the embodiments of the present application, for example, the model training method as shown in fig. 3A and 6.

In some embodiments, the computer readable storage medium may be RAM, ROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, computer-executable instructions may be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, in the form of programs, software modules, scripts, or code, and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, computer-executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, such as in one or more scripts in a hypertext markup language (Hyper Text Markup Language, HTML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, computer-executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modifications, equivalent substitutions, improvements, etc. that are within the spirit and scope of the present application are intended to be included within the scope of the present application.

Claims

1. A method of model training, the method comprising:

2. The method of claim 1, wherein the room impact response simulation parameters include size constraint information, reverberation time constraint information, signal-to-noise ratio constraint information, microphone constraint information, and speaking object constraint information of the simulated acoustic environment, and wherein performing real-time simulation in the internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters to obtain first room impact response voice data corresponding to a speaking object position and second room impact response voice data corresponding to a noise source position includes:

determining target size information and wall absorption coefficients of the simulated acoustic environment in the internal memory based on the size constraint information and reverberation time constraint information of the simulated acoustic environment;

Determining a microphone reference position and a microphone target position according to the target size information of the simulated acoustic environment and the microphone constraint information;

determining a speaking object position according to the microphone reference position and the speaking object constraint information;

determining a noise source position according to the target size information of the simulated acoustic environment and the reverberation noise simulation parameters;

generating first room impact response voice data corresponding to the speaking object position according to the target size information of the simulated acoustic environment, the microphone target position and the speaking object position;

and generating second room impulse response voice data corresponding to the noise source position according to the target size information of the simulated acoustic environment, the microphone target position and the noise source position.

3. The method of claim 2, wherein determining target size information and wall absorption coefficients of the simulated acoustic environment in the internal memory based on the size constraint information, reverberation time constraint information of the simulated acoustic environment comprises:

determining a target length of the simulated acoustic environment in the internal memory based on a maximum length and a minimum length in size constraint information of the simulated acoustic environment;

Determining a target width of the simulated acoustic environment based on the maximum width and the minimum width in the size constraint information of the simulated acoustic environment;

determining a target height of the simulated acoustic environment based on the maximum height and the minimum height in the size constraint information of the simulated acoustic environment;

determining a target reverberation time based on the reverberation time constraint information;

a wall absorption coefficient is determined based on the target length, target width, target height, and target reverberation time of the simulated acoustic environment.

4. The method of claim 2, wherein the microphone constraint information includes a first wall critical distance, a microphone maximum height, a microphone minimum height, a microphone array type, a number of microphones, and a microphone pitch, and wherein determining the microphone reference location and the microphone target location based on the target size information of the simulated acoustic environment and the microphone constraint information comprises:

determining a microphone reference position according to the target size information of the simulated acoustic environment, the first wall critical distance, the maximum microphone height and the minimum microphone height;

and determining a microphone target position according to the microphone reference position, the microphone array type, the microphone number and the microphone spacing.

5. The method of claim 4, wherein the speaking object constraint information includes a second wall critical distance, a speaking object far field rate, speaking object far field distance constraint information, speaking object near field distance constraint information, speaking object height constraint information, and wherein determining the speaking object position based on the microphone reference position and the speaking object constraint information comprises:

generating a first random number, and determining the distance between the speaking object and the microphone reference position by using the far-field distance constraint information of the speaking object when the far-field rate of the speaking object is smaller than the first random number;

when the far field rate of the speaking object is greater than or equal to the first random number, determining the distance between the speaking object and the microphone reference position by utilizing the near field distance constraint information of the speaking object;

randomly generating an azimuth of the speaking object relative to the microphone;

a speaking object location is determined based on the microphone reference location, a distance between the speaking object and the microphone reference location, the azimuth, and the speaking object height constraint information.

6. The method of claim 5, wherein the speaking object location comprises an x-coordinate of the speaking object, a y-coordinate of the speaking object, and a z-coordinate of the speaking object, wherein the determining the speaking object location based on the microphone reference location, the distance between the speaking object and the microphone reference location, the azimuth, the speaking object altitude constraint information comprises:

Determining a first offset distance in an x-axis direction of the speaking object and the microphone reference position and a second offset distance in a y-axis direction of the speaking object and the microphone reference position based on a distance and an azimuth between the speaking object and the microphone reference position;

determining an initial x-coordinate of the speaking object based on the x-coordinate in the microphone reference location and the first offset distance;

determining an initial y-coordinate of the speaking object based on the y-coordinate in the microphone reference location and the second offset distance;

determining a z-coordinate of the speaking object based on the speaking object height constraint information;

and when determining that at least one of the initial x coordinate and the initial y coordinate needs to be corrected based on the initial x coordinate, the initial y coordinate and the second wall critical distance, correcting the at least one of the initial x coordinate and the initial y coordinate to be corrected, and obtaining at least one of the corrected x coordinate and the corrected y coordinate.

7. The method of claim 2, wherein the reverberation noise simulation parameters include a third wall critical distance, noise source height constraint information, the noise source location includes an x-coordinate of the noise source, a y-coordinate of the noise source, and a z-coordinate of the noise source, and wherein determining the noise source location based on the target size information of the simulated acoustic environment and the reverberation noise simulation parameters comprises:

Determining an x coordinate of a noise source based on a target length and a third wall critical distance in the target size information of the simulated acoustic environment;

determining a y coordinate of a noise source based on a target width and a third wall critical distance in the target size information of the simulated acoustic environment;

based on the noise height constraint information, a z-coordinate of a noise source is determined.

8. The method of any of claims 1 to 7, wherein the generating an augmented training speech sample using the sample reverberation result and the noise reverberation result comprises:

acquiring preset signal-to-noise ratio parameters and normalization parameters, determining a target signal-to-noise ratio based on the signal-to-noise ratio parameters, and determining a target normalization value based on the normalization parameters;

determining a weight coefficient of the noise reverberation result based on the target signal-to-noise ratio and the sample reverberation result;

determining a weighted reverberation result based on the sample reverberation result, the noise reverberation result, and the weight coefficient;

and normalizing the weighted reverberation result based on the target normalization value to obtain an extended training voice sample.

9. The method according to any one of claims 1 to 7, wherein the initial training speech samples include at least a first batch of initial training speech samples, and correspondingly the extended training speech samples include at least a first batch of extended training speech samples, and training a preset speech recognition model by using the initial training speech samples and the extended training speech samples to obtain a trained speech recognition model includes:

Caching the first batch of extended training voice samples into a sample sharing queue;

in the model training thread, training a preset voice recognition model by using the initial training voice sample and the extended training voice sample of the first batch;

acquiring initial training voice samples of a subsequent batch, carrying out simulation processing on the initial training voice samples of the subsequent batch in parallel in a real-time simulation thread, correspondingly acquiring extended training voice samples of the subsequent batch, and caching the extended training voice samples of the subsequent batch into the sample sharing queue;

and in the model training thread, training the voice recognition model continuously by utilizing the initial training voice samples and the extended training voice samples of the subsequent batches until a trained voice recognition model is obtained.

10. The method as recited in claim 9, wherein the method further comprises:

establishing a plurality of real-time simulation threads;

the simulation processing is performed on the initial training voice samples of the subsequent batch in parallel in the real-time simulation thread, and the expanding training voice samples of the subsequent batch are correspondingly obtained, which comprises the following steps:

distributing the initial training voice samples of the same batch to the plurality of real-time simulation threads;

And performing simulation processing on the initial training voice samples distributed to the multiple real-time simulation threads in parallel to obtain the expanded training voice samples of the initial training voice samples of the same batch.

11. The method according to any one of claims 2 to 7, further comprising:

determining target size information of a simulated acoustic environment, the microphone target position, the speaking object position, the first room impact response voice data and the second room impact response voice data when the extended training voice sample is generated as acoustic scene data, and adding the acoustic scene data to a scene cache queue;

when a new expansion training voice sample is required to be generated for a new initial training voice sample, acquiring a new scene simulation rate generated randomly;

when the new scene simulation rate is smaller than a preset simulation rate threshold value, randomly acquiring target acoustic scene data from the scene cache queue;

generating new extended training speech samples for the new initial training speech samples using the new initial training speech samples, first room impulse response speech data and second room impulse response speech data in the target acoustic scene data in the internal memory.

12. The method as recited in claim 11, wherein the method further comprises:

when the new scene simulation rate is greater than or equal to the preset simulation rate threshold, performing real-time simulation in the internal memory based on the room impact response simulation parameters and the reverberation noise simulation parameters to obtain new first room impact response voice data corresponding to the speaking object position and new second room impact response voice data corresponding to the noise source position;

performing convolution processing on the new initial training voice sample and the new first room impact response voice data in the internal memory to obtain a new sample reverberation result;

acquiring new noise audio data, and carrying out convolution processing on the new noise audio data and the new second room impulse response voice data in the internal memory to obtain a new noise reverberation result;

generating new augmented training speech samples for the new initial training speech samples using the new sample reverberation results and the new noise reverberation results in the internal memory.

13. A model training apparatus, the apparatus comprising:

The first acquisition module is used for reading an initial training voice sample from the external memory and loading the initial training voice sample into the internal memory;

14. An electronic device, the electronic device comprising:

a memory for storing computer executable instructions;

a processor for implementing the method of any one of claims 1 to 12 when executing computer executable instructions or computer programs stored in the memory.

15. A computer-readable storage medium storing computer-executable instructions or a computer program, which when executed by a processor implement the method of any one of claims 1 to 12.