CN113470628A - Voice recognition method and device - Google Patents
Voice recognition method and device Download PDFInfo
- Publication number
- CN113470628A CN113470628A CN202110792834.0A CN202110792834A CN113470628A CN 113470628 A CN113470628 A CN 113470628A CN 202110792834 A CN202110792834 A CN 202110792834A CN 113470628 A CN113470628 A CN 113470628A
- Authority
- CN
- China
- Prior art keywords
- data
- rir
- present application
- time period
- computer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 47
- 230000000873 masking effect Effects 0.000 claims abstract description 27
- 230000004044 response Effects 0.000 claims abstract description 19
- 238000004088 simulation Methods 0.000 claims description 12
- 238000003860 storage Methods 0.000 claims description 11
- 230000002708 enhancing effect Effects 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 11
- 238000004590 computer program Methods 0.000 description 8
- 238000005070 sampling Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 230000005236 sound signal Effects 0.000 description 5
- 238000013434 data augmentation Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002310 reflectometry Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Telephonic Communication Services (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a voice recognition method and a voice recognition device, which are used for enhancing the robustness of a voice recognition model. The application provides a voice recognition method, which comprises the following steps: masking predetermined room impulse response, RIR, data; convolving the masked RIR data with original voice data to obtain new voice data; and training a voice recognition model by using the new voice data.
Description
Technical Field
The present application relates to the field of information technology, and in particular, to a method and an apparatus for speech recognition.
Background
The existing voice recognition technology mainly depends on an algorithm based on deep learning, in order to obtain a voice recognition model with a high recognition rate, a large amount of voice data matched with a real scene is needed, wherein room reverberation, the distance angle between a speaker and a microphone and the like are one of important factors influencing the performance of the voice recognition model, however, reverberation of a shielded or special-shaped room is difficult to simulate by the algorithm, for example, the recognition rate is obviously reduced under the conditions that the speaker is in a restaurant, the microphone is in a living room, or the speaker is back to the microphone, shielding exists between the speaker and the microphone, and the like, and a large amount of reverberation data is difficult to acquire, so that massive data coverage of the conditions cannot be achieved.
Disclosure of Invention
The embodiment of the application provides a voice recognition method and a voice recognition device, which are used for enhancing the robustness of a voice recognition model.
The voice recognition method provided by the embodiment of the application comprises the following steps:
masking predetermined room impulse response, RIR, data;
convolving the masked RIR data with original voice data to obtain new voice data;
and training a voice recognition model by using the new voice data.
By this method, predetermined room impulse response, RIR, data is masked; convolving the masked RIR data with original voice data to obtain new voice data; the new voice data is utilized to train the voice recognition model, so that the robustness of the voice recognition model is enhanced, the voice recognition rate of the voice recognition model under the conditions of room shielding, multi-angle and the like is improved, and the method is simple, efficient and high in applicability.
Optionally, masking the predetermined room impulse response RIR data specifically includes:
determining a time period for masking the RIR data;
and replacing the RIR data of the time period with a preset value.
Optionally, the preset value is zero, or is an average value of a part of the RIR data in the RIR data, or is a random number.
Optionally, the RIR data for the time period comprises one or more periods of RIR data in the RIR data.
Optionally, the starting position of the time period comprises a random value within a preset range from 0.
Optionally, the duration of the time period comprises a random value within a preset range from 0.
Optionally, the RIR data is generated by simulation in advance, or acquired in a real scene.
An embodiment of the present application provides a speech recognition apparatus, including:
a first unit for masking predetermined room impulse response RIR data;
the second unit is used for convolving the masked RIR data with the original voice data to obtain new voice data;
a third unit for training a speech recognition model using the new speech data.
Another embodiment of the present application provides a computing device, which includes a memory and a processor, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions stored in the memory and executing any one of the above methods according to the obtained program.
Another embodiment of the present application provides a computer storage medium having stored thereon computer-executable instructions for causing a computer to perform any one of the methods described above.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic representation of RIR data provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of masked RIR data provided in an embodiment of the present application;
fig. 3 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 4 is a schematic flowchart of a speech recognition method according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides a voice recognition method and a voice recognition device, which are used for enhancing the robustness of a voice recognition model, so that the voice recognition rate of the voice recognition model under the conditions of room shielding, multi-angle and the like is improved, and the method is simple, efficient and high in applicability.
The method and the device are based on the same application concept, and because the principles of solving the problems of the method and the device are similar, the implementation of the device and the method can be mutually referred, and repeated parts are not repeated.
The embodiment of the application aims to improve the generalization capability of the voice recognition model, randomly masks (mask) the Room Impulse Response (RIR) data, increases the diversity of samples, enables the voice recognition model to accurately recognize voice without depending on specific direct sound and reflected sound, and improves the robustness of the voice recognition model to the situations of shielding, special-shaped rooms (multiple angles) and the like. The room reverberation is composed of direct sound, which refers to a sound signal that a sound source directly propagates to a microphone without reflection, and reflected sound, which is the rest of the sound signal that propagates to the microphone after being reflected and absorbed by an obstacle.
The technical scheme provided by the embodiment of the application mainly comprises the following three aspects:
1. the simulation generates RIR data, or directly uses the RIR data acquired in the real scene. There are various ways how the RIR data may be collected, such as: a certain stimulus signal is played in the room and the response it elicits is recorded.
2. And generating a set of random numbers comprising a random number of the starting position for masking the RIR data and a random number of the time length for masking the time and the time length of the random direct sound or the room reflected sound. That is, this step is used to determine a time period for masking the RIR data, wherein the starting position of the time period includes a random value within a preset range from 0, and the duration of the time period includes a random value within a preset range from 0.
3. And replacing the data corresponding to time and duration in the RIR sample with zero, and performing convolution operation on the obtained new RIR data and the original voice data to obtain new voice data. The convolution operation refers to multiplying and adding convolution kernels and corresponding elements of the audio signal. For example, the formula of one-dimensional discrete convolution is given as follows, where f (N) is RIR, g (N) is the original audio signal, N is the length of the signal f (N), and s (N) is the calculation result.
In the embodiment of the application, the RIR data obtained by simulation (for example, obtained by simulation by using an image method) is randomly masked in a data augmentation stage, so that the generalization performance of the speech recognition model is enhanced, and especially the robustness of the speech recognition model in a special-shaped room with a shelter and under the condition that a speaker faces away from a microphone is improved. The data augmentation refers to data augmentation of original data through methods of noise addition, reverberation addition and the like.
The technical scheme provided by the embodiment of the application comprises the following steps:
firstly, RIR data needs to be generated through simulation, and the RIR data collected in a real scene can also be used. The RIR data can be generated using prior art simulations, enabling efficient generation of room impulse responses.
The RIR data is related to parameters such as the distance angle between a microphone and a speaker, the wall reflectivity, the room size (length, width and height), the reverberation time and the like, wherein the wall reflectivity is related to the material of the wall and is a parameter set artificially; the reverberation time refers to the time required for the sound source to attenuate by 60dB after stopping sounding, and can be obtained by using a racing formula (the racing formula is an empirical formula). This method of simulation (e.g., pyroomics) can only generate data in an unobstructed rectangular room environment.
The method for generating the RIR data is not unique, the method for generating the RIR data is not dependent on the method for generating the RIR, and the RIR data generated in any mode or the RIR data acquired in a real scene can be directly used.
For example, referring to fig. 1, an RIR sample is generated by simulation, the abscissa is time, the ordinate is room impulse response (Amp) at the current time, fig. 1 shows the RIR of 8000 sampling points with a sampling rate of 16000, the total duration is half a second, and then in step two, the RIR sample is convolved with the original speech data to obtain new reverberation speech data. The raw speech data refers to speech data collected with a high fidelity microphone in a mute room without reverberation. The original voice data refers to the object which we need to use RIR to add reverberation, and is the collected original data. The new speech data contains the direct sound and room reflections of the original speech data.
Step two, the embodiment of the present application randomly masks the room reflected sound (the content of the masking is random, and may include the direct sound or the reflected sound), where the masking refers to replacing a valid value with zero (of course, other values than zero may also be used), and is intended to reduce the dependence of the speech recognition model on a part of the room reflected sound, so that the speech recognition model can accurately recognize even under the condition of having a mask. Wherein, the direct sound refers to a sound signal which is directly transmitted to a microphone after being sounded from a sound source; the early reflected sound refers to, for example, reflected sound within 100ms (of course, not limited to this value, and the specific value may be determined according to actual needs) after the direct sound.
Specifically, the method comprises the following steps:
firstly, the duration of masking is determined, for example, a random value with a uniform distribution of 0 to 200 in the RIR data is taken as s, if s is equal to 200, then data of 200 consecutive sample points in the RIR data is masked, for example, the audio sampling rate shown in fig. 1 is 16k, 200 sample points are 12.5ms, and the duration of masking is set as mask _ len. Then, the start position of the mask is determined, for example, a random value uniformly distributed from 0 to 1000 is taken (for example, if the random number is taken as 100, the 100 th to 300 th sampling points of the RIR are masked), and the mask _ start is set, so that the sampling points from mask _ start to mask _ start + mask _ len are masked, the data of the sampling points are replaced by zero, and the RIR data after masking is shown in fig. 2.
And then, convolving the new RIR data with the original voice data to obtain new voice data. For example, if mask _ len is equal to 100, mask _ start is 500, that is, the reflected sound 31.25ms to 37.5ms after the arrival of the direct sound is masked in the reverberation room simulated by the new RIR data. The sampling rate is 16k, 500/16000 is 0.03125 (31.25 ms), 100/16000 is 0.00625, and 0.03125+0.00625 is 0.0375 (37.5 ms).
Finally, the new speech data is used for training of the speech recognition model.
Fig. 3 is a schematic flow chart of a speech recognition method according to an embodiment of the present application.
It should be noted that the above technical solutions provided in the embodiments of the present application are only examples, and other implementations of the present application are not unique, for example, a random masking manner is performed on the RIR data, and the embodiments of the present application may mask a continuous early reflected sound, or randomly mask a plurality of segments of room impulse responses with different lengths (i.e., the RIR data in fig. 1), which all belong to the scope of the embodiments of the present application. Secondly, for the positions masked by the RIR data, the embodiment of the present application replaces with zero values, and may also replace with an average value of the whole RIR data or other random numbers, which all belong to the scope of the embodiment of the present application.
In summary, referring to fig. 4, a speech recognition method provided in the embodiment of the present application includes:
s101, masking predetermined room impulse response RIR data;
s102, convolving the masked RIR data with original voice data to obtain new voice data;
and S103, training a voice recognition model by using the new voice data.
Optionally, masking the predetermined room impulse response RIR data specifically includes:
determining a time period for masking the RIR data;
and replacing the RIR data of the time period with a preset value.
Optionally, the preset value is zero, or is an average value of a part of the RIR data in the RIR data, or is a random number.
Optionally, the RIR data for the time period comprises one or more periods of RIR data in the RIR data.
Optionally, the starting position of the time period comprises a random value within a preset range from 0. The predetermined range is, for example, 0 to 1000.
Optionally, the duration of the time period comprises a random value within a preset range from 0. The predetermined range is, for example, 0 to 200.
Optionally, the RIR data is generated by simulation in advance, or acquired in a real scene.
Referring to fig. 5, a computing device provided in this embodiment of the present application may be any kind of terminal device, such as an intelligent appliance (or a network device), and the apparatus includes:
a memory 11 for storing program instructions;
a processor 12 for calling the program instructions stored in the memory and executing, according to the obtained program:
masking predetermined room impulse response, RIR, data;
convolving the masked RIR data with original voice data to obtain new voice data;
and training a voice recognition model by using the new voice data.
Optionally, masking the predetermined room impulse response RIR data specifically includes:
determining a time period for masking the RIR data;
and replacing the RIR data of the time period with a preset value.
Optionally, the preset value is zero, or is an average value of a part of the RIR data in the RIR data, or is a random number.
Optionally, the RIR data for the time period comprises one or more periods of RIR data in the RIR data.
Optionally, the starting position of the time period comprises a random value within a preset range from 0. The predetermined range is, for example, 0 to 1000.
Optionally, the duration of the time period comprises a random value within a preset range from 0. The predetermined range is, for example, 0 to 200.
Optionally, the RIR data is generated by simulation in advance, or acquired in a real scene.
Referring to fig. 6, a speech recognition apparatus provided in this embodiment of the present application may be any kind of terminal device, such as an intelligent appliance (or a network device), for example, and the apparatus includes:
a first unit 21 for masking predetermined room impulse response RIR data;
a second unit 22, configured to convolve the masked RIR data with original voice data to obtain new voice data;
a third unit 23 for training a speech recognition model with the new speech data.
Optionally, masking the predetermined room impulse response RIR data specifically includes:
determining a time period for masking the RIR data;
and replacing the RIR data of the time period with a preset value.
Optionally, the preset value is zero, or is an average value of a part of the RIR data in the RIR data, or is a random number.
Optionally, the RIR data for the time period comprises one or more periods of RIR data in the RIR data.
Optionally, the starting position of the time period comprises a random value within a preset range from 0. The predetermined range is, for example, 0 to 1000.
Optionally, the duration of the time period comprises a random value within a preset range from 0. The predetermined range is, for example, 0 to 200.
Optionally, the RIR data is generated by simulation in advance, or acquired in a real scene.
It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiment of the present application provides a computing device, which may specifically be a desktop computer, a portable computer, a smart phone, a tablet computer, a Personal Digital Assistant (PDA), and the like. The computing device may include a Central Processing Unit (CPU), memory, input/output devices, etc., the input devices may include a keyboard, mouse, touch screen, etc., and the output devices may include a Display device, such as a Liquid Crystal Display (LCD), a Cathode Ray Tube (CRT), etc.
The memory may include Read Only Memory (ROM) and Random Access Memory (RAM), and provides the processor with program instructions and data stored in the memory. In the embodiments of the present application, the memory may be used for storing a program of any one of the methods provided by the embodiments of the present application.
The processor is used for executing any one of the methods provided by the embodiment of the application according to the obtained program instructions by calling the program instructions stored in the memory.
Embodiments of the present application provide a computer storage medium for storing computer program instructions for an apparatus provided in the embodiments of the present application, which includes a program for executing any one of the methods provided in the embodiments of the present application.
The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
The technical scheme provided by the embodiment of the application can be applied to terminal equipment and network equipment.
The Terminal device may also be referred to as a User Equipment (User Equipment, abbreviated as "UE"), a Mobile Station (Mobile Station, abbreviated as "MS"), a Mobile Terminal (Mobile Terminal), or the like, and optionally, the Terminal may have a capability of communicating with one or more core networks via a Radio Access Network (RAN), for example, the Terminal may be an intelligent appliance, a Mobile phone (or referred to as a "cellular" phone), or a computer with Mobile property, and for example, the Terminal may also be a portable, pocket, hand-held, computer-built-in, or vehicle-mounted Mobile device.
A network device may be a base station (e.g., access point) that refers to a device in an access network that communicates over the air-interface, through one or more sectors, with wireless terminals. The base station may be configured to interconvert received air frames and IP packets as a router between the wireless terminal and the rest of the access network, which may include an Internet Protocol (IP) network. The base station may also coordinate management of attributes for the air interface. For example, the Base Station may be a Base Transceiver Station (BTS) in GSM or CDMA, a Base Station (NodeB) in WCDMA, an evolved Node B (NodeB or eNB or e-NodeB) in LTE, or a gNB in 5G system. The embodiments of the present application are not limited.
The above method process flow may be implemented by a software program, which may be stored in a storage medium, and when the stored software program is called, the above method steps are performed.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
Claims (10)
1. A method of speech recognition, the method comprising:
masking predetermined room impulse response, RIR, data;
convolving the masked RIR data with original voice data to obtain new voice data;
and training a voice recognition model by using the new voice data.
2. The method of claim 1, wherein masking predetermined Room Impulse Response (RIR) data comprises:
determining a time period for masking the RIR data;
and replacing the RIR data of the time period with a preset value.
3. The method of claim 2, wherein the preset value is zero, or is an average value of a part of the RIR data, or is a random number.
4. The method of claim 2, wherein the period of RIR data comprises one or more periods of RIR data in the RIR data.
5. The method of claim 2, wherein the start position of the time period comprises a random value within a preset range from 0.
6. The method of claim 2, wherein the duration of the time period comprises a random value within a preset range from 0.
7. The method of claim 1, wherein the RIR data is generated by pre-simulation or acquired in a real scene.
8. A speech recognition apparatus, comprising:
a first unit for masking predetermined room impulse response RIR data;
the second unit is used for convolving the masked RIR data with the original voice data to obtain new voice data;
a third unit for training a speech recognition model using the new speech data.
9. A computing device, comprising:
a memory for storing program instructions;
a processor for calling program instructions stored in said memory to perform the method of any of claims 1 to 7 in accordance with the obtained program.
10. A computer storage medium having stored thereon computer-executable instructions for causing a computer to perform the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110792834.0A CN113470628B (en) | 2021-07-14 | 2021-07-14 | Voice recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110792834.0A CN113470628B (en) | 2021-07-14 | 2021-07-14 | Voice recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113470628A true CN113470628A (en) | 2021-10-01 |
CN113470628B CN113470628B (en) | 2024-05-31 |
Family
ID=77880297
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110792834.0A Active CN113470628B (en) | 2021-07-14 | 2021-07-14 | Voice recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113470628B (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5544249A (en) * | 1993-08-26 | 1996-08-06 | Akg Akustische U. Kino-Gerate Gesellschaft M.B.H. | Method of simulating a room and/or sound impression |
EP2028883A2 (en) * | 2007-08-22 | 2009-02-25 | Gwangju Institute of Science and Technology | Sound field generator and method of generating sound field using the same |
US20170316773A1 (en) * | 2015-01-20 | 2017-11-02 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Speech reproduction device configured for masking reproduced speech in a masked speech zone |
US20180253648A1 (en) * | 2017-03-01 | 2018-09-06 | Synaptics Inc | Connectionist temporal classification using segmented labeled sequence data |
CN108734138A (en) * | 2018-05-24 | 2018-11-02 | 浙江工业大学 | A kind of melanoma skin disease image classification method based on integrated study |
CN110379414A (en) * | 2019-07-22 | 2019-10-25 | 出门问问(苏州)信息科技有限公司 | Acoustic model enhances training method, device, readable storage medium storing program for executing and calculates equipment |
US10582299B1 (en) * | 2018-12-11 | 2020-03-03 | Amazon Technologies, Inc. | Modeling room acoustics using acoustic waves |
CN111159416A (en) * | 2020-04-02 | 2020-05-15 | 腾讯科技(深圳)有限公司 | Language task model training method and device, electronic equipment and storage medium |
CN111210802A (en) * | 2020-01-08 | 2020-05-29 | 厦门亿联网络技术股份有限公司 | Method and system for generating reverberation voice data |
CN112257521A (en) * | 2020-09-30 | 2021-01-22 | 中国人民解放军军事科学院国防科技创新研究院 | CNN underwater acoustic signal target identification method based on data enhancement and time-frequency separation |
CN112633171A (en) * | 2020-12-23 | 2021-04-09 | 北京恒达时讯科技股份有限公司 | Sea ice identification method and system based on multi-source optical remote sensing image |
CN112767927A (en) * | 2020-12-29 | 2021-05-07 | 平安科技(深圳)有限公司 | Method, device, terminal and storage medium for extracting voice features |
-
2021
- 2021-07-14 CN CN202110792834.0A patent/CN113470628B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5544249A (en) * | 1993-08-26 | 1996-08-06 | Akg Akustische U. Kino-Gerate Gesellschaft M.B.H. | Method of simulating a room and/or sound impression |
EP2028883A2 (en) * | 2007-08-22 | 2009-02-25 | Gwangju Institute of Science and Technology | Sound field generator and method of generating sound field using the same |
US20170316773A1 (en) * | 2015-01-20 | 2017-11-02 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Speech reproduction device configured for masking reproduced speech in a masked speech zone |
US20180253648A1 (en) * | 2017-03-01 | 2018-09-06 | Synaptics Inc | Connectionist temporal classification using segmented labeled sequence data |
CN108734138A (en) * | 2018-05-24 | 2018-11-02 | 浙江工业大学 | A kind of melanoma skin disease image classification method based on integrated study |
US10582299B1 (en) * | 2018-12-11 | 2020-03-03 | Amazon Technologies, Inc. | Modeling room acoustics using acoustic waves |
CN110379414A (en) * | 2019-07-22 | 2019-10-25 | 出门问问(苏州)信息科技有限公司 | Acoustic model enhances training method, device, readable storage medium storing program for executing and calculates equipment |
CN111210802A (en) * | 2020-01-08 | 2020-05-29 | 厦门亿联网络技术股份有限公司 | Method and system for generating reverberation voice data |
CN111159416A (en) * | 2020-04-02 | 2020-05-15 | 腾讯科技(深圳)有限公司 | Language task model training method and device, electronic equipment and storage medium |
CN112257521A (en) * | 2020-09-30 | 2021-01-22 | 中国人民解放军军事科学院国防科技创新研究院 | CNN underwater acoustic signal target identification method based on data enhancement and time-frequency separation |
CN112633171A (en) * | 2020-12-23 | 2021-04-09 | 北京恒达时讯科技股份有限公司 | Sea ice identification method and system based on multi-source optical remote sensing image |
CN112767927A (en) * | 2020-12-29 | 2021-05-07 | 平安科技(深圳)有限公司 | Method, device, terminal and storage medium for extracting voice features |
Non-Patent Citations (3)
Title |
---|
TIEMIN MEI: "ROOM IMPULSE RESPONSE RESHAPING/SHORTENING BASED ON LEAST MEAN SQUARES OPTIMIZATIONWITH INFINITY NORM CONSTRAINT", IEEE, 31 December 2009 (2009-12-31), pages 1 - 6 * |
ZHONG-QIU WANG: "Robust Speaker Localization Guided by Deep Learning Based Time-Frequency Masking", IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, 31 December 2018 (2018-12-31), pages 1 - 11 * |
贾海蓉: "基于双通道神经网络时频掩蔽的语音增强算法", 华中科技大学学报, vol. 49, no. 6, 30 June 2021 (2021-06-30), pages 43 - 49 * |
Also Published As
Publication number | Publication date |
---|---|
CN113470628B (en) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8908875B2 (en) | Electronic device with digital reverberator and method | |
US9940922B1 (en) | Methods, systems, and computer readable media for utilizing ray-parameterized reverberation filters to facilitate interactive sound rendering | |
CN110809214B (en) | Audio playing method, audio playing device and terminal equipment | |
CN109961797B (en) | Echo cancellation method and device and electronic equipment | |
EP4121957A1 (en) | Encoding reverberator parameters from virtual or physical scene geometry and desired reverberation characteristics and rendering using these | |
CN105225674B (en) | A kind of audio signal processing method, device and mobile terminal | |
CN108391199B (en) | virtual sound image synthesis method, medium and terminal based on personalized reflected sound threshold | |
CN107301028B (en) | Audio data processing method and device based on multi-person remote call | |
CN112565981B (en) | Howling suppression method, howling suppression device, hearing aid, and storage medium | |
CN110493703A (en) | Stereo audio processing method, system and the storage medium of virtual spectators | |
WO2019072180A1 (en) | Method and apparatus for allocating resources to application | |
CN113170268A (en) | Method and device for detecting probability silent fault | |
WO2024027295A1 (en) | Speech enhancement model training method and apparatus, enhancement method, electronic device, storage medium, and program product | |
CN108549486A (en) | The method and device of explanation is realized in virtual scene | |
CN111385688A (en) | Active noise reduction method, device and system based on deep learning | |
WO2015062109A1 (en) | Method and device for evaluating network key performance indicator | |
CN112770063B (en) | Image generation method and device | |
CN112333608B (en) | Voice data processing method and related product | |
CN113470628B (en) | Voice recognition method and device | |
CN111158907B (en) | Data processing method and device, electronic equipment and storage medium | |
CN113132136B (en) | Satisfaction degree prediction model establishment method, satisfaction degree prediction device and electronic equipment | |
CN115273795B (en) | Method and device for generating simulated impulse response and computer equipment | |
CN106604144A (en) | Video processing method and device | |
CN109362027B (en) | Positioning method, device, equipment and storage medium | |
CN113936676A (en) | Sound adjusting method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |