CN113470628A

CN113470628A - Voice recognition method and device

Info

Publication number: CN113470628A
Application number: CN202110792834.0A
Authority: CN
Inventors: 李程帅; 周全; 孙进伟
Original assignee: Qingdao Xinxin Microelectronics Technology Co Ltd
Current assignee: Qingdao Xinxin Microelectronics Technology Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-10-01
Anticipated expiration: 2041-07-14

Abstract

The application discloses a voice recognition method and a voice recognition device, which are used for enhancing the robustness of a voice recognition model. The application provides a voice recognition method, which comprises the following steps: masking predetermined room impulse response, RIR, data; convolving the masked RIR data with original voice data to obtain new voice data; and training a voice recognition model by using the new voice data.

Description

Voice recognition method and device

Technical Field

The present application relates to the field of information technology, and in particular, to a method and an apparatus for speech recognition.

Background

The existing voice recognition technology mainly depends on an algorithm based on deep learning, in order to obtain a voice recognition model with a high recognition rate, a large amount of voice data matched with a real scene is needed, wherein room reverberation, the distance angle between a speaker and a microphone and the like are one of important factors influencing the performance of the voice recognition model, however, reverberation of a shielded or special-shaped room is difficult to simulate by the algorithm, for example, the recognition rate is obviously reduced under the conditions that the speaker is in a restaurant, the microphone is in a living room, or the speaker is back to the microphone, shielding exists between the speaker and the microphone, and the like, and a large amount of reverberation data is difficult to acquire, so that massive data coverage of the conditions cannot be achieved.

Disclosure of Invention

The embodiment of the application provides a voice recognition method and a voice recognition device, which are used for enhancing the robustness of a voice recognition model.

The voice recognition method provided by the embodiment of the application comprises the following steps:

masking predetermined room impulse response, RIR, data;

convolving the masked RIR data with original voice data to obtain new voice data;

and training a voice recognition model by using the new voice data.

By this method, predetermined room impulse response, RIR, data is masked; convolving the masked RIR data with original voice data to obtain new voice data; the new voice data is utilized to train the voice recognition model, so that the robustness of the voice recognition model is enhanced, the voice recognition rate of the voice recognition model under the conditions of room shielding, multi-angle and the like is improved, and the method is simple, efficient and high in applicability.

Optionally, masking the predetermined room impulse response RIR data specifically includes:

determining a time period for masking the RIR data;

and replacing the RIR data of the time period with a preset value.

Optionally, the preset value is zero, or is an average value of a part of the RIR data in the RIR data, or is a random number.

Optionally, the RIR data for the time period comprises one or more periods of RIR data in the RIR data.

Optionally, the starting position of the time period comprises a random value within a preset range from 0.

Optionally, the duration of the time period comprises a random value within a preset range from 0.

Optionally, the RIR data is generated by simulation in advance, or acquired in a real scene.

An embodiment of the present application provides a speech recognition apparatus, including:

a first unit for masking predetermined room impulse response RIR data;

the second unit is used for convolving the masked RIR data with the original voice data to obtain new voice data;

a third unit for training a speech recognition model using the new speech data.

Another embodiment of the present application provides a computing device, which includes a memory and a processor, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions stored in the memory and executing any one of the above methods according to the obtained program.

Another embodiment of the present application provides a computer storage medium having stored thereon computer-executable instructions for causing a computer to perform any one of the methods described above.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic representation of RIR data provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of masked RIR data provided in an embodiment of the present application;

fig. 3 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 4 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a voice recognition method and a voice recognition device, which are used for enhancing the robustness of a voice recognition model, so that the voice recognition rate of the voice recognition model under the conditions of room shielding, multi-angle and the like is improved, and the method is simple, efficient and high in applicability.

The method and the device are based on the same application concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.

The embodiment of the application aims to improve the generalization capability of the voice recognition model, randomly masks (mask) the Room Impulse Response (RIR) data, increases the diversity of samples, enables the voice recognition model to accurately recognize voice without depending on specific direct sound and reflected sound, and improves the robustness of the voice recognition model to the situations of shielding, special-shaped rooms (multiple angles) and the like. The room reverberation is composed of direct sound, which refers to a sound signal that a sound source directly propagates to a microphone without reflection, and reflected sound, which is the rest of the sound signal that propagates to the microphone after being reflected and absorbed by an obstacle.

The technical scheme provided by the embodiment of the application mainly comprises the following three aspects:

1. the simulation generates RIR data, or directly uses the RIR data acquired in the real scene. There are various ways how the RIR data may be collected, such as: a certain stimulus signal is played in the room and the response it elicits is recorded.

2. And generating a set of random numbers comprising a random number of the starting position for masking the RIR data and a random number of the time length for masking the time and the time length of the random direct sound or the room reflected sound. That is, this step is used to determine a time period for masking the RIR data, wherein the starting position of the time period includes a random value within a preset range from 0, and the duration of the time period includes a random value within a preset range from 0.

3. And replacing the data corresponding to time and duration in the RIR sample with zero, and performing convolution operation on the obtained new RIR data and the original voice data to obtain new voice data. The convolution operation refers to multiplying and adding convolution kernels and corresponding elements of the audio signal. For example, the formula of one-dimensional discrete convolution is given as follows, where f (N) is RIR, g (N) is the original audio signal, N is the length of the signal f (N), and s (N) is the calculation result.

In the embodiment of the application, the RIR data obtained by simulation (for example, obtained by simulation by using an image method) is randomly masked in a data augmentation stage, so that the generalization performance of the speech recognition model is enhanced, and especially the robustness of the speech recognition model in a special-shaped room with a shelter and under the condition that a speaker faces away from a microphone is improved. The data augmentation refers to data augmentation of original data through methods of noise addition, reverberation addition and the like.

The technical scheme provided by the embodiment of the application comprises the following steps:

firstly, RIR data needs to be generated through simulation, and the RIR data collected in a real scene can also be used. The RIR data can be generated using prior art simulations, enabling efficient generation of room impulse responses.

The RIR data is related to parameters such as the distance angle between a microphone and a speaker, the wall reflectivity, the room size (length, width and height), the reverberation time and the like, wherein the wall reflectivity is related to the material of the wall and is a parameter set artificially; the reverberation time refers to the time required for the sound source to attenuate by 60dB after stopping sounding, and can be obtained by using a racing formula (the racing formula is an empirical formula). This method of simulation (e.g., pyroomics) can only generate data in an unobstructed rectangular room environment.

The method for generating the RIR data is not unique, the method for generating the RIR data is not dependent on the method for generating the RIR, and the RIR data generated in any mode or the RIR data acquired in a real scene can be directly used.

For example, referring to fig. 1, an RIR sample is generated by simulation, the abscissa is time, the ordinate is room impulse response (Amp) at the current time, fig. 1 shows the RIR of 8000 sampling points with a sampling rate of 16000, the total duration is half a second, and then in step two, the RIR sample is convolved with the original speech data to obtain new reverberation speech data. The raw speech data refers to speech data collected with a high fidelity microphone in a mute room without reverberation. The original voice data refers to the object which we need to use RIR to add reverberation, and is the collected original data. The new speech data contains the direct sound and room reflections of the original speech data.

Step two, the embodiment of the present application randomly masks the room reflected sound (the content of the masking is random, and may include the direct sound or the reflected sound), where the masking refers to replacing a valid value with zero (of course, other values than zero may also be used), and is intended to reduce the dependence of the speech recognition model on a part of the room reflected sound, so that the speech recognition model can accurately recognize even under the condition of having a mask. Wherein, the direct sound refers to a sound signal which is directly transmitted to a microphone after being sounded from a sound source; the early reflected sound refers to, for example, reflected sound within 100ms (of course, not limited to this value, and the specific value may be determined according to actual needs) after the direct sound.

Specifically, the method comprises the following steps:

firstly, the duration of masking is determined, for example, a random value with a uniform distribution of 0 to 200 in the RIR data is taken as s, if s is equal to 200, then data of 200 consecutive sample points in the RIR data is masked, for example, the audio sampling rate shown in fig. 1 is 16k, 200 sample points are 12.5ms, and the duration of masking is set as mask _ len. Then, the start position of the mask is determined, for example, a random value uniformly distributed from 0 to 1000 is taken (for example, if the random number is taken as 100, the 100 th to 300 th sampling points of the RIR are masked), and the mask _ start is set, so that the sampling points from mask _ start to mask _ start + mask _ len are masked, the data of the sampling points are replaced by zero, and the RIR data after masking is shown in fig. 2.

And then, convolving the new RIR data with the original voice data to obtain new voice data. For example, if mask _ len is equal to 100, mask _ start is 500, that is, the reflected sound 31.25ms to 37.5ms after the arrival of the direct sound is masked in the reverberation room simulated by the new RIR data. The sampling rate is 16k, 500/16000 is 0.03125 (31.25 ms), 100/16000 is 0.00625, and 0.03125+0.00625 is 0.0375 (37.5 ms).

Finally, the new speech data is used for training of the speech recognition model.

Fig. 3 is a schematic flow chart of a speech recognition method according to an embodiment of the present application.

It should be noted that the above technical solutions provided in the embodiments of the present application are only examples, and other implementations of the present application are not unique, for example, a random masking manner is performed on the RIR data, and the embodiments of the present application may mask a continuous early reflected sound, or randomly mask a plurality of segments of room impulse responses with different lengths (i.e., the RIR data in fig. 1), which all belong to the scope of the embodiments of the present application. Secondly, for the positions masked by the RIR data, the embodiment of the present application replaces with zero values, and may also replace with an average value of the whole RIR data or other random numbers, which all belong to the scope of the embodiment of the present application.

In summary, referring to fig. 4, a speech recognition method provided in the embodiment of the present application includes:

s101, masking predetermined room impulse response RIR data;

s102, convolving the masked RIR data with original voice data to obtain new voice data;

and S103, training a voice recognition model by using the new voice data.

determining a time period for masking the RIR data;

and replacing the RIR data of the time period with a preset value.

Optionally, the starting position of the time period comprises a random value within a preset range from 0. The predetermined range is, for example, 0 to 1000.

Optionally, the duration of the time period comprises a random value within a preset range from 0. The predetermined range is, for example, 0 to 200.

Referring to fig. 5, a computing device provided in this embodiment of the present application may be any kind of terminal device, such as an intelligent appliance (or a network device), and the apparatus includes:

a memory 11 for storing program instructions;

a processor 12 for calling the program instructions stored in the memory and executing, according to the obtained program:

masking predetermined room impulse response, RIR, data;

and training a voice recognition model by using the new voice data.

determining a time period for masking the RIR data;

and replacing the RIR data of the time period with a preset value.

Referring to fig. 6, a speech recognition apparatus provided in this embodiment of the present application may be any kind of terminal device, such as an intelligent appliance (or a network device), for example, and the apparatus includes:

a first unit 21 for masking predetermined room impulse response RIR data;

a second unit 22, configured to convolve the masked RIR data with original voice data to obtain new voice data;

a third unit 23 for training a speech recognition model with the new speech data.

determining a time period for masking the RIR data;

and replacing the RIR data of the time period with a preset value.

It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The embodiment of the present application provides a computing device, which may specifically be a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like. The computing device may include a Central Processing Unit (CPU), memory, input/output devices, etc., the input devices may include a keyboard, mouse, touch screen, etc., and the output devices may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), etc.

The memory may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor with program instructions and data stored in the memory. In the embodiments of the present application, the memory may be used for storing a program of any one of the methods provided by the embodiments of the present application.

The processor is used for executing any one of the methods provided by the embodiment of the application according to the obtained program instructions by calling the program instructions stored in the memory.

Embodiments of the present application provide a computer storage medium for storing computer program instructions for an apparatus provided in the embodiments of the present application, which includes a program for executing any one of the methods provided in the embodiments of the present application.

The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

The technical scheme provided by the embodiment of the application can be applied to terminal equipment and network equipment.

The Terminal device may also be referred to as a User Equipment (User Equipment, abbreviated as "UE"), a Mobile Station (Mobile Station, abbreviated as "MS"), a Mobile Terminal (Mobile Terminal), or the like, and optionally, the Terminal may have a capability of communicating with one or more core networks via a Radio Access Network (RAN), for example, the Terminal may be an intelligent appliance, a Mobile phone (or referred to as a "cellular" phone), or a computer with Mobile property, and for example, the Terminal may also be a portable, pocket, hand-held, computer-built-in, or vehicle-mounted Mobile device.

A network device may be a base station (e.g., access point) that refers to a device in an access network that communicates over the air-interface, through one or more sectors, with wireless terminals. The base station may be configured to interconvert received air frames and IP packets as a router between the wireless terminal and the rest of the access network, which may include an Internet Protocol (IP) network. The base station may also coordinate management of attributes for the air interface. For example, the Base Station may be a Base Transceiver Station (BTS) in GSM or CDMA, a Base Station (NodeB) in WCDMA, an evolved Node B (NodeB or eNB or e-NodeB) in LTE, or a gNB in 5G system. The embodiments of the present application are not limited.

The above method process flow may be implemented by a software program, which may be stored in a storage medium, and when the stored software program is called, the above method steps are performed.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method of speech recognition, the method comprising:

masking predetermined room impulse response, RIR, data;

and training a voice recognition model by using the new voice data.

2. The method of claim 1, wherein masking predetermined Room Impulse Response (RIR) data comprises:

determining a time period for masking the RIR data;

and replacing the RIR data of the time period with a preset value.

3. The method of claim 2, wherein the preset value is zero, or is an average value of a part of the RIR data, or is a random number.

4. The method of claim 2, wherein the period of RIR data comprises one or more periods of RIR data in the RIR data.

5. The method of claim 2, wherein the start position of the time period comprises a random value within a preset range from 0.

6. The method of claim 2, wherein the duration of the time period comprises a random value within a preset range from 0.

7. The method of claim 1, wherein the RIR data is generated by pre-simulation or acquired in a real scene.

8. A speech recognition apparatus, comprising:

a first unit for masking predetermined room impulse response RIR data;

a third unit for training a speech recognition model using the new speech data.

9. A computing device, comprising:

a memory for storing program instructions;

a processor for calling program instructions stored in said memory to perform the method of any of claims 1 to 7 in accordance with the obtained program.

10. A computer storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 7.