CN115132177A

CN115132177A - Speech recognition method, apparatus, device, storage medium and program product

Info

Publication number: CN115132177A
Application number: CN202210764904.6A
Authority: CN
Inventors: 张晓明
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-09-30

Abstract

The application discloses a voice recognition method, a voice recognition device, voice recognition equipment, a storage medium and a program product, which are suitable for scenes such as cloud technology, artificial intelligence and intelligent traffic. The method comprises the following steps: acquiring a first voice sample and reference environment information of the first voice sample; extracting environmental features of the first voice sample by adopting a feature extraction network in the environmental prediction model to obtain the environmental features of the first voice sample; predicting to obtain predicted environment information of the first voice sample according to the environment characteristics of the first voice sample by adopting an environment prediction network in an environment prediction model; performing model optimization on the environment prediction model according to the prediction environment information and the reference environment information; after the voice recognition request is received, the optimized feature extraction network is called to extract the environmental features of the voice to be recognized carried by the voice recognition request, so that the voice to be recognized is recognized based on the extracted target environmental features, and the accuracy of the voice recognition can be improved.

Description

Speech recognition method, apparatus, device, storage medium and program product

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition method, apparatus, device, storage medium, and program product.

Background

With the rapid development of artificial intelligence technology, the speech recognition technology has made a staged breakthrough, and various speech recognition products developed based on the speech recognition technology can provide speech recognition services under a plurality of scenes for people's daily life. The application of the voice recognition product changes the interaction mode of people and computers, so that people can open or close programs or change working interfaces and the like only by moving the mouth, which undoubtedly brings great convenience to the daily life of people. However, when the conventional speech recognition method is applied to various speech recognition products, the corresponding speech recognition products cannot recognize the speech recognition result with high accuracy due to the influence of various environmental factors (such as environmental noise, environmental reverberation, etc.).

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, equipment, a storage medium and a program product, which can improve the accuracy of a voice recognition result.

In one aspect, an embodiment of the present application provides a speech recognition method, including:

acquiring a first voice sample and reference environment information of the first voice sample, wherein the reference environment information is used for describing an acoustic environment when the first voice sample is acquired;

extracting environmental features of the first voice sample by adopting a feature extraction network in an environmental prediction model to obtain the environmental features of the first voice sample;

predicting to obtain predicted environment information of the first voice sample according to the environment characteristics of the first voice sample by adopting an environment prediction network in the environment prediction model;

performing model optimization on the environment prediction model according to the prediction environment information and the reference environment information to obtain an optimized environment prediction model, wherein the difference between the prediction environment information of the first voice sample obtained by the optimized environment prediction model and the reference environment information is smaller than a first preset threshold value, and the optimized environment prediction model comprises an optimized feature extraction network;

and after receiving a voice recognition request, calling the optimized feature extraction network to extract the environmental features of the voice to be recognized carried by the voice recognition request so as to perform voice recognition on the voice to be recognized based on the extracted target environmental features.

In another aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first voice sample and reference environment information of the first voice sample, and the reference environment information is used for describing an acoustic environment when the first voice sample is acquired;

the feature extraction unit is used for extracting the environmental features of the first voice sample by adopting a feature extraction network in an environmental prediction model to obtain the environmental features of the first voice sample;

the environment prediction unit is used for predicting to obtain the predicted environment information of the first voice sample according to the environment characteristics of the first voice sample by adopting an environment prediction network in the environment prediction model;

the model optimization unit is used for carrying out model optimization on the environment prediction model according to the prediction environment information and the reference environment information to obtain an optimized environment prediction model, the difference between the prediction environment information of the first voice sample obtained by the optimized environment prediction model and the reference environment information is smaller than a first preset threshold value, and the optimized environment prediction model comprises an optimized feature extraction network;

and the voice recognition unit is used for calling the optimized feature extraction network to extract the environmental features of the voice to be recognized carried by the voice recognition request after receiving the voice recognition request so as to perform voice recognition on the voice to be recognized based on the extracted target environmental features.

In another aspect, an embodiment of the present application provides a speech recognition device, including:

a processor for implementing one or more computer programs;

a computer storage medium storing one or more computer programs adapted to be loaded by the processor and to perform the speech recognition method according to the first aspect.

In yet another aspect, embodiments of the present application further provide a computer storage medium storing one or more computer programs, the one or more computer programs being adapted to be loaded by a processor and to perform the speech recognition method according to the first aspect.

In yet another aspect, embodiments of the present application provide a computer program comprising a computer program adapted to be loaded by a processor and to perform the speech recognition method according to the first aspect.

In the embodiment of the application, when the voice recognition device performs voice recognition on a voice to be recognized, firstly, the optimized feature extraction network is adopted to obtain the target environment feature of the voice to be recognized, and the target environment feature is used for indicating the acoustic environment where the voice to be recognized is located during collection. Then, the voice recognition device can perform voice recognition on the voice to be recognized based on the target environment characteristics, so that the voice recognition device can acquire more information related to the voice to be recognized in the voice recognition process, and therefore more accurate voice recognition can be performed and a voice recognition result with higher accuracy can be obtained.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic diagram of a speech recognition process provided in an embodiment of the present application;

FIG. 2 is a schematic flow chart of a speech recognition method provided by an embodiment of the present application;

FIG. 3a is a schematic diagram of a near-field speech recognition scenario provided by an embodiment of the present application;

FIG. 3b is a diagram illustrating a far-field speech recognition scenario according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of another speech recognition method provided in the embodiments of the present application;

FIG. 5 is a schematic structural diagram of an environment prediction model provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech recognition device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the method provided by the embodiments of the present application, the technical method in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It should be noted that the specific embodiments described in the embodiments of the present application are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the various embodiments in the present application without making any creative effort belong to the protection scope of the present application.

With the vigorous development of artificial intelligence technology, the speech recognition technology has also made great progress. Owing to the progress of speech recognition technology, various speech recognition products have been developed in succession, and nowadays, speech recognition products are widely put into use in daily life of people. Undoubtedly, the voice recognition product determines an accurate voice recognition result, which can help the user of the voice recognition product to complete related tasks more conveniently and quickly. In practical application, a speech recognition product capable of obtaining an accurate speech recognition result generally has a high practical value, and can further bring economic benefits to corresponding developers. In order to obtain a speech recognition result with higher accuracy, the embodiment of the application provides a speech recognition scheme by combining an artificial intelligence technology, and provides a plurality of specific and feasible speech recognition methods based on the speech recognition scheme.

In order to facilitate the subsequent clear understanding of the specific implementation manner of the embodiments of the present application, the following description will first describe the artificial intelligence technology in detail.

Artificial Intelligence (AI) technology is a comprehensive technique of computer science that attempts to understand the essence of Intelligence and produces a new intelligent machine that can react in a manner similar to human Intelligence. The artificial intelligence is to study the design principle and the realization method of various intelligent machines, and the machines can have the functions of perception, reasoning and decision-making based on the artificial intelligence. In particular, artificial intelligence techniques can simulate, extend, and extend human intelligence using a digital computer or using a digital computer-controlled machine, such that the digital computer or related machine can perceive the environment, gain knowledge. That is, theories, methods, techniques and applications that use the learned knowledge of a digital computer or related machine to achieve optimal results can be implemented based on artificial intelligence. In practical applications, the artificial intelligence technology relates to a wide range of fields, including both hardware-level and software-level technologies. Specifically, the artificial intelligence hardware technology generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, mechatronics, and the like. Artificial intelligence software techniques typically include computer vision techniques, speech processing techniques, natural language processing techniques, and machine learning/deep learning techniques.

Along with the research and progress of artificial intelligence technology, the artificial intelligence technology develops research and application in a plurality of fields, for example, common smart homes, intelligent wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service and the like. It is believed that with the development of technology, artificial intelligence technology will find application in more fields and will play an increasingly important role. The embodiment of the application mainly utilizes a voice processing technology and a machine learning technology in an artificial intelligence technology. Similarly, to facilitate a clear understanding of the implementation of the embodiments of the present application, the following briefly introduces speech processing techniques and machine learning/deep learning techniques.

Key technologies for Speech processing Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best human-computer interaction modes in the future. In the speech processing technology, which is the most widely used speech recognition technology, the speech recognition technology can automatically recognize phonemes, syllables or words of a speech signal by using a computer, and the speech recognition technology is the basis for realizing automatic speech control.

Machine Learning (ML) technology and deep Learning technology are one-family multi-field cross disciplines, and may specifically relate to probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other multi-family disciplines. The machine learning technology is a technology specially researched and used for simulating or realizing the learning behaviors of human beings by adopting a computer. Based on the machine learning technology, the computer can continuously acquire new knowledge or skills and can reorganize the existing knowledge structure to continuously improve the performance of the computer, so that better intelligent processing effects (such as image recognition effect, text translation effect, voice generation effect and the like) are achieved. Based on the above description, it is obvious that the machine learning technology is the core of the artificial intelligence technology, and is the fundamental way to make the computer have intelligence. Thus, the application of machine learning techniques is spread across various areas of artificial intelligence. In practical applications, machine learning techniques and deep learning techniques typically include: artificial neural network, confidence network, reinforcement learning, transfer learning, inductive learning, formula teaching learning and the like.

Based on the above description of the artificial intelligence technology, the embodiment of the present application provides a speech recognition scheme that can be executed by a speech recognition device based on a speech processing technology and a deep learning technology, so as to obtain a speech recognition result with higher accuracy. The speech recognition scheme indicates that: the acoustic environment of the voice signal during generation has a great influence on the quality (such as definition, integrity, etc.) of the voice signal, and the higher the quality of the voice signal, the less difficult the voice recognition device performs voice recognition on the voice signal, the higher the accuracy of the corresponding obtained voice recognition result. Therefore, in the process of carrying out voice recognition on the voice signal by the voice recognition equipment, the related environment information of the acoustic environment where the voice signal is located when the voice signal is generated is utilized, so that the voice recognition equipment can acquire more accurate voice information when carrying out recognition on the voice signal, and the voice recognition equipment can finally obtain a voice recognition result with high accuracy. In the scheme, how to efficiently acquire the environmental characteristics of the voice signal is mainly researched. The following describes a general principle of the speech recognition scheme proposed in the embodiment of the present application in combination with a manner of obtaining an environmental characteristic of a speech signal in the present application.

Specifically, the execution flow of the scheme can be seen in fig. 1, and the general principle thereof can be seen in the following description: the environment prediction model may be model optimized using a speech sample (e.g., the first speech sample mentioned later) before performing speech recognition on the speech signal. The environment prediction model may include a feature extraction network, which may be mainly used to perform environment feature extraction on the voice sample to obtain an environment feature of the voice sample. Correspondingly, the optimized environment prediction model may then comprise an optimized feature extraction network. After the voice signal needing voice recognition is obtained, the optimized feature extraction network can be adopted to extract the environmental features of the voice signal so as to obtain the acoustic environment where the voice signal is generated, and the extracted environmental features and the voice signal are further used as input data of the voice recognition device, so that the voice recognition device can perform voice recognition on the voice signal based on the environmental features.

In the embodiment of the application, when the voice recognition device performs voice recognition on the voice signal, the environmental characteristics of the acoustic environment where the voice signal is located when the voice signal is generated are obtained, so that the voice recognition device learns the information of other dimensions of the voice signal except the acoustic dimension (namely, the environment dimension is generated), and therefore the voice recognition device can perform voice recognition on the voice signal more accurately, and a voice recognition result with higher accuracy can be obtained. In order to verify the accuracy of the speech recognition result obtained by the present solution, a related technical person performs a test comparison between the speech recognition scheme provided by the embodiment of the present application and other speech recognition schemes existing in the prior art. In the test, test data are voice signals in different home scenes (including smart televisions, smart sound boxes and the like), and WER (Word Error Rate) is adopted as a quantization index. Experiments show that the word error rate of the voice recognition result obtained by the voice recognition scheme provided by the embodiment of the application can be relatively reduced by 5-10% compared with the word error rate of the voice recognition result of other voice recognition schemes. It should be noted that, in practical applications, the degree of reduction of the word error rate is also related to the specifically adopted experimental data, that is, the relative reduction ratio of the word error rate corresponding to the speech recognition scheme proposed in the embodiment of the present application may also be other values, and the above-mentioned 5% to 10% is only one exemplary result.

In addition, it is worth mentioning that the voice recognition device adopted in the embodiment of the present application may be a terminal device, or a server, or an electronic device with a voice recognition function that is formed by combining the terminal device and the server, which is not limited in the embodiment of the present application. Specifically, the terminal device may include, but is not limited to: the intelligent voice interaction system comprises a smart phone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted terminal, an intelligent voice interaction device, an intelligent household appliance (such as an intelligent refrigerator, an intelligent television, an intelligent sound box and an intelligent lamp), an aircraft and the like. In an embodiment, various Applications (APP) and/or clients may also run in the terminal device, such as: a multimedia playing client, a social client, a browser client, an information flow client, an education client, and an image processing client, among others. Further, the above-mentioned servers may include, but are not limited to: the system comprises independent physical servers, a server cluster or distributed system formed by a plurality of physical servers, cloud servers and the like, wherein the cloud servers provide basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, Network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), big data platforms, artificial intelligence platforms and the like.

Based on the description of the voice recognition scheme, the embodiment of the application provides a specific voice recognition method. Referring to fig. 2, fig. 2 is a schematic flow chart of a speech recognition method according to an embodiment of the present application. It should be noted that, in practical applications, the speech recognition method can still be performed by the above-mentioned speech recognition device. Then, as shown in fig. 2, when the speech recognition device executes the speech recognition method, steps S201 to S205 included in the method may be executed:

s201, obtaining a first voice sample and reference environment information of the first voice sample.

In an embodiment of the present application, the first speech samples may be at least one, and each of the first speech samples may include a speech signal that can be recognized by the speech recognition device as a text sequence. In practical applications, the first speech sample may be obtained by the speech recognition device from a far-field speech recognition scenario, or may be obtained by the speech recognition device from a near-field speech recognition scenario. In practical applications, when speech recognition is performed on a speech signal (such as a subsequently mentioned speech command) generated in a far-field speech recognition scene, the speech recognition device is usually affected by factors such as environmental reverberation, environmental noise, and speech energy attenuation, so that the difficulty of performing speech recognition on the speech signal generated in the far-field speech recognition scene by the speech recognition device is greater than the difficulty of performing speech recognition on the speech signal generated in a near-field speech recognition scene by the speech recognition device. Then, in order to improve the accuracy of the speech recognition method provided in the embodiment of the present application in various speech recognition scenarios, especially in a far-field speech recognition scenario, the first speech sample in the embodiment of the present application may preferentially adopt the speech signal acquired in the far-field speech recognition scenario.

Wherein, near-field speech recognition refers to: a speech signal collected when a sound source (i.e., the source from which the sound originates, such as the speaker's mouth) is closer to the speech receiver (i.e., the sound source is closer in distance) is identified. Illustratively, the near-field speech recognition scenario may be the scenario shown in fig. 3a, the terminal device marked by 301 may be understood as a speech receiver, and the part marked by 302 may be understood as a sound source. Then, it can be seen that the near-field speech recognition scenario is: and the target object speaks the voice command to the terminal equipment at a short distance so as to enable the terminal equipment to carry out voice recognition on the voice command. The terminal device herein refers to a terminal device that can be used for providing voice recognition interaction, such as a smart car, a smart television, a smart phone, and the like that are equipped with a voice recognition service, and a user of the terminal device can obtain a corresponding service (such as services of navigating, searching for a video, controlling multimedia playing, and the like) by issuing a voice instruction (e.g., "open XX application to the terminal device"). Correspondingly, far-field speech recognition then refers to: and recognizing the voice signal collected when the sound source is far away from the voice receiver (namely, the distance between the sound source and the voice receiver is far). Illustratively, a far-field speech recognition scenario may be the speech recognition scenario illustrated in FIG. 3 b. In fig. 3b, the robot marked by 311 can be understood as a speech receiver, and the speaker's mouth marked by 312 can be understood as a sound source. Then, it is easy to see that the far-field speech recognition scenario is: and the target object issues the voice instruction to the terminal equipment at a longer distance so that the terminal equipment performs voice recognition on the corresponding voice instruction.

In addition, in the embodiment of the present application, the first voice sample may include a voice signal sample and collection environment information of the voice signal sample, or may include only the voice signal sample. Then, when the first voice sample includes a voice signal sample and the collection environment information of the voice signal sample, the reference environment information of the first voice sample may be obtained by the voice recognition device after parsing the first voice sample, and for example, the voice recognition device may directly use the collected environment information obtained by parsing as the reference environment information of the first voice sample. For convenience of illustration, the following description will use the collected environment information that the first voice sample includes a voice signal sample and does not include the voice signal sample as an example to describe the related implementation manner in detail.

In practical applications, the reference environment information may be used to describe the acoustic environment in which the speech signal sample in the first speech sample was acquired. The description of the acoustic environment may include, but is not limited to, any one or more of the following description dimensions: noise type, sound source distance, and ambient reverberation. Here, the noise type may be divided according to a physical characteristic of a sound source, for example. In this case, the noise type may include aerodynamic noise, mechanical noise, electromagnetic noise, and the like. Alternatively, the noise types may also be divided according to the temporal characteristics of the sound source. In this case, the noise types may include stationary noise (noise in which the sound pressure level of the noise changes little and does not change greatly with time), non-stationary noise (noise in which the intensity of the noise fluctuates with time, and the sound pressure changes by more than 3 db in general), impulse noise (noise composed of single or multiple sudden sounds having a duration of less than 1 s), and the like. Of course, in other embodiments, the noise type may also be divided according to the frequency component distribution of the noise, or may also be divided according to other rules (e.g., noise size, sound generating device of the noise, etc.), which is not limited in this application. The ambient reverberation may be in particular as reverberation time T60, i.e.: the time elapsed for the sound pressure to decay to 60 db after the sound source stops sounding.

S202, extracting the environmental characteristics of the first voice sample by adopting a characteristic extraction network in the environmental prediction model to obtain the environmental characteristics of the first voice sample.

In the embodiment of the present application, the environment prediction model is mainly used to determine the environment information of the first speech sample at the time of acquisition (e.g., determine the type of room in which the first speech sample is located at the time of acquisition), which may also be described by one or more description dimensions mentioned in step S201. In practical applications, the environment prediction model may be any neural network model for classification, and the neural network model includes a feature extraction network, and the feature extraction network may be used to extract the environmental features of the first speech sample, so that the environment prediction model may further predict the environmental information (i.e., the subsequently mentioned predicted environmental information) of the first speech sample based on the extracted environmental features. It should be noted that the environmental information is predicted by the environmental prediction model, and is not necessarily equal to the acoustic environment indicated by the reference environmental information of the first speech sample. In practical application, whether the environmental information predicted by the environmental prediction model is accurate mainly depends on whether the environmental features extracted by the environmental prediction model are accurate. Therefore, in order that the environment prediction model can extract accurate environment features, the speech recognition device can perform model optimization processing on the environment prediction model.

In addition, in the embodiment of the present application, in order to facilitate the optimization process of the relevant devices or models in practical applications, the relevant environment information (e.g., the reference environment information and the prediction environment information) of the first speech sample can be described by a room type. By way of example, the room type may be understood as: the acoustic environment created by the room in which the relevant device or person was located when the first speech sample was taken may be, for example: the noise type is steady-state noise, the sound source distance is 1.5 meters, and the reverberation time T60 is 100 ms; the room types may be as follows: a first type. It is worth mentioning that in the embodiment of the present application, the description dimensions adopted for the description of the acoustic environment corresponding to each room type may be the same, that is: if the first type is divided by the noise type and the sound source distance, other room types (such as the second type, the third type, etc.) can be divided based on two dimensions of the noise type and the sound source distance, which is beneficial for the voice recognition device to quickly generate corresponding environmental characteristics, thereby improving the voice recognition rate to a certain extent.

The manner of dividing the room types in the embodiment of the present application is described in detail below with reference to specific examples. In the present example, the room types are divided based on the reverberation time T60 and the sound source distance. In particular, when dividing the room type, the reverberation time T60 may also be subdivided into a plurality of subcategories, such as: reverberation time T60 is less than or equal to 0.5 seconds is the T60_ a subcategory, reverberation time T60 is between 0.5 seconds and 0.9 seconds is the T60_ B subcategory, and reverberation time T60 is greater than 0.9 seconds is the T60_ C subcategory. The sound source distance may also be subdivided into a plurality of subcategories, such as the far subcategory with a sound source distance greater than 1 meter and the near subcategory with a sound source distance less than 1 meter. Then, any room type may be randomly combined based on any sub-category of reverberation time T60 and any sub-category of sound source distance. In this example, 6 room types can be combined, respectively: room type resulting from the combination of the T60_ a sub-category with the near sub-category (identifiable with T60_ a _ near), room type resulting from the combination of the T60_ a sub-category with the far sub-category (identifiable with T60_ a _ far), room type resulting from the combination of the T60_ B sub-category with the near sub-category (identifiable with T60_ B _ near), room type resulting from the combination of the T60_ B sub-category with the far sub-category (identifiable with T60_ B _ far), room type resulting from the combination of the T60_ C sub-category with the near sub-category (identifiable with T60_ C _ near), and room type resulting from the combination of the T60_ C sub-category with the far sub-category (identifiable with T60_ C _ far).

And S203, predicting to obtain the predicted environment information of the first voice sample according to the environment characteristics of the first voice sample by adopting an environment prediction network in the environment prediction model.

In practical applications, the environment prediction model may further include an environment prediction network, which may be mainly used for: and predicting the environmental information of the first voice sample according to the environmental characteristics extracted by the characteristic extraction network to obtain predicted environmental information. The essence of the environment prediction network can be understood as a classification network, and then, in the embodiment of the present application, the role of the environment prediction network can be understood as: a type of room in which the first speech sample was taken is determined. In this case, specifically, the determination manner of the prediction environment information may be as follows: the voice recognition device adopts an environment prediction network to predict an environment prediction result of the first voice sample based on the environment characteristics of the first voice sample, wherein the environment prediction result can comprise at least one prediction probability, and one prediction probability corresponds to one room type. That is, in the embodiment of the present application, the number of room types may be the number of prediction probabilities in the environment prediction result. Wherein the prediction probability is used to indicate a probability that the first speech sample was in an acoustic environment of the respective room type at the time of acquisition. Then, the predicted environment information may be environment information corresponding to a room type having a highest prediction probability determined by the speech recognition device from the environment prediction result, for example.

And S204, performing model optimization on the environment prediction model according to the prediction environment information and the reference environment information to obtain an optimized environment prediction model, wherein the optimized environment prediction model comprises an optimized feature extraction network.

In particular embodiments, the predicted environment information and the reference environment information are not necessarily the same, and the difference between the predicted environment information and the reference environment information may be used to measure the performance of the environment prediction model, such as: the method can be used for measuring the accuracy of the environmental information predicted by the environmental prediction model. Based on this, it can be understood that the smaller the difference between the predicted environment information and the reference environment information is, the higher the similarity between the predicted environment information and the reference environment information is, and thus the higher the accuracy of the environment information predicted by the environment prediction model is. And because the predicted environment information is extracted by the speech recognition equipment by adopting the environment prediction network in the environment prediction model based on the feature extraction network, the more accurate the predicted environment information is, the more accurate the environment features extracted by the feature extraction network in the environment prediction model can be explained. In specific application, the environment features with high accuracy are beneficial to improving the accuracy of the speech recognition result in the embodiment of the application. It will be appreciated that in particular embodiments, then, the goal of model optimization may be: and (3) making the predicted environment information predicted by the environment prediction model consistent with the reference environment information as much as possible (for example, the similarity between the two is greater than a similarity threshold). Based on this, the speech recognition apparatus may model-optimize the environment prediction model in a direction of reducing a difference between the predicted environment information and the reference environment information to obtain an optimized environment prediction model, for example. The difference between the predicted environment information obtained by the speech recognition device after the optimized environment prediction model performs environment prediction on the first speech sample and the reference environment information of the first speech sample may be smaller than a first preset threshold.

Because the environment prediction model includes the feature extraction network, the optimized environment prediction model naturally also includes the optimized feature extraction network, and the optimized feature extraction network can be used for extracting the environment features with higher accuracy of the obtained first speech sample. For example, when the speech recognition device performs model optimization on the environment prediction model, relevant model parameters used for predicting to obtain predicted environment characteristic information in the environment prediction model may be specifically adjusted, so that the predicted environment information determined by the adjusted model parameters may be closer to the reference environment information. The model parameters can be understood as configuration variables inside the environment prediction model. Illustratively, the model parameters may be characterized by one or more of the following: the functionality of the model may be defined, may be derived using data estimation or data learning (e.g., learning using the first speech sample and its corresponding reference environmental characteristics), may be saved, etc. Then, exemplarily, the model parameters in the embodiment of the present application may include: weights associated with each network in the environment prediction model (e.g., weights of features extracted by each hidden layer in the feature extraction network), coefficients in a linear regression or a logistic regression (e.g., coefficients in a loss function), and the like. For example, in the embodiment of the application, when the environment prediction model is subjected to model optimization, a cross entropy loss function can be adopted.

And S205, after receiving the voice recognition request, calling the optimized feature extraction network to extract the environmental features of the voice to be recognized carried by the voice recognition, so as to perform the voice recognition on the voice to be recognized based on the extracted target environmental features.

In practical applications, the speech recognition device may trigger the execution of a speech recognition task after receiving a speech recognition request. Wherein, the voice recognition request can carry the voice signal (i.e. the voice to be recognized) needing voice recognition. The target environment feature is an environment feature of the speech to be recognized extracted by the speech recognition device, and is used for representing an acoustic environment in which the speech to be recognized is located during collection, so that the speech recognition device can perform speech recognition on the speech signal to be recognized according to the target environment feature. Illustratively, the target environmental characteristics may be used to characterize environmental information in one or more description dimensions, such as: ambient reverberation, sound source distance, etc.

It can be understood that the speech to be recognized is generally a speech signal collected by the speech recognition method in an actual speech recognition scenario when the speech recognition method is put into use. Therefore, the speech to be recognized usually has a large difference from the first speech sample used in model optimization, which specifically includes: differences in the acoustic environment in which the speech signal is acquired. Then, if it is desired to obtain accurate prediction environment information when speech recognition is actually performed on speech to be recognized, a large number of first speech samples collected in different acoustic environments need to be used to optimize both the feature extraction network and the environment prediction network in the environment prediction model, which undoubtedly increases the cost of research and development of speech recognition products. In the embodiment of the application, the prediction environment information is used for describing the acoustic environment of the corresponding voice signal during collection, and the prediction environment information is predicted based on the environment characteristics of the voice signal, so that the environment characteristics can effectively represent the acoustic environment of the voice signal.

Based on this, it can be seen that, in the embodiment of the present application, the acoustic environment corresponding to the voice signal is preferentially represented by adopting the environmental characteristics, which does not greatly affect the accuracy of the voice recognition result obtained by the voice recognition device, but effectively saves the number of the first voice samples required to be collected in the voice recognition process, thereby effectively reducing the development cost of the related voice recognition product, and accelerating the rate of model optimization to a certain extent, so as to further improve the development rate of the voice recognition product. In order to improve the speech recognition rate of the embodiment of the present application while ensuring that the speech recognition accuracy is high, in a specific implementation of the present application, for the optimized environment prediction model, the speech recognition device may only use the optimized feature extraction network to obtain the acoustic environment of the speech to be recognized when the speech is collected (or understand that the speech recognition device is constructed by using the optimized feature extraction network), and does not need to use other network structures in the model. Therefore, in the embodiment of the application, the environment information is acquired by adopting a mode of adopting a partial environment prediction model structure, and the structure of the equipment adopted in the whole voice recognition process can be simplified, so that the overall coupling of the voice recognition equipment can be reduced, and the robustness of the voice recognition equipment is further improved.

In the embodiment of the application, when the voice recognition equipment performs the voice recognition processing on the voice to be recognized, the target environment characteristics of the acoustic environment where the voice to be recognized is located when the voice to be recognized is collected are utilized, so that the voice recognition equipment can perform the recognition processing on the voice to be recognized based on more comprehensive information, and the accuracy of the voice recognition is improved. In the embodiment of the application, the target environment features of the voice to be recognized are obtained by adopting the optimized feature extraction network, and the optimized feature extraction network is obtained by performing model optimization on the voice recognition device based on the deep learning technology by adopting the first voice sample, so that the optimized feature extraction network can have stronger feature extraction capability and feature expression capability. Therefore, the accurate environmental features can be extracted and obtained on the basis of the feature extraction network, so that the voice recognition equipment can recognize and obtain an accurate voice recognition result on the basis of the environmental features with higher accuracy.

Based on the above speech recognition scheme and the speech recognition method shown in fig. 2, the embodiment of the present application provides another speech recognition method. Referring to fig. 4, fig. 4 is a schematic flow chart of the speech recognition method. In practical applications, the method may also be performed using the speech recognition device mentioned above. Specifically, the speech recognition device may be configured to perform steps S401 to S407:

s401, a first voice sample and reference environment information of the first voice sample are obtained, wherein the first voice sample comprises a plurality of voice signal frames.

In a specific embodiment, the specific implementation manner of the voice recognition device acquiring the first voice sample and the corresponding reference environment information may refer to the related description of step S201, which is not described herein again in this embodiment of the present application. The following mainly describes the definition of the speech signal frames and the manner of acquiring the speech signal frames in the embodiment of the present application in detail, and it should be noted that, when there are a plurality of first speech samples, because the time lengths of the respective first speech samples are different, the number of speech signal frames acquired by the speech recognition device according to the first speech samples with different time lengths is also different.

Wherein the speech signal frame can be obtained by sampling the first speech sample. Illustratively, a speech signal frame may be understood as a segment of a speech signal of a predetermined duration (e.g., 25 ms). And, illustratively, the first speech sample comprises a plurality of speech signal frames, each of which may have the same frame length, i.e.: the preset durations referred to by the speech recognition device in acquiring each speech signal frame may be the same. In an implementation manner of the embodiment of the present application, in order to enable smooth transition between each speech signal frame and maintain continuity of each speech signal frame, the speech recognition device performs framing on the first speech sample by using an overlapping and segmenting method, so as to ensure that two adjacent speech signal frames can overlap with each other by a portion, thereby avoiding incompleteness of a speech signal. In the process of framing the first speech sample, the time difference between the start positions of two adjacent speech signal frames is called frame shift, and in the embodiment of the present application, the frame shift may be exemplarily 10 ms.

In another implementation manner of the embodiment of the present application, in order to reduce the data size of the input data of the speech recognition device, the workload of the speech recognition device is reduced. In the embodiment of the present application, when the frame division processing is performed on the first speech sample, the frame shift may also be set to 0ms, that is, there may be no overlapping portion between the speech signal frames of the first speech sample. In this case, for example, the speech recognition device may sample a plurality of speech signal frames according to a fixed sampling rate (e.g. 16K sampling rate, i.e. 16000 values are collected in the first speech sample in 1S), and in this case, the frame length of one speech signal frame may be 1S, that is: the speech recognition device generates a speech signal frame based on all values collected within 1S.

S402, sampling a plurality of voice signal frames according to a preset step length to obtain at least one voice segment.

In a specific implementation, when the speech recognition device performs speech recognition on the speech signal, each speech signal frame may be processed separately (for example, acoustic feature extraction processing, or word vector feature extraction processing, etc.) to determine a speech recognition result based on the processing result of each speech signal frame. In the development process of the embodiment of the present application, relevant people find that: the accuracy of the speech recognition results generated based on the processing results of the individual speech signal frames is lower than the accuracy of the speech recognition results generated by the speech recognition device based on the processing results of the individual speech segments, each speech segment comprising a plurality of temporally successive speech signal frames. The reason for this is found by intensive research of the technicians in the examples of the present application: when the speech recognition device processes the speech segments, the context information between the speech signal frames is effectively utilized, so that the speech recognition result generated based on the processing result of each speech segment can have higher accuracy.

Based on the above discovery, in the embodiment of the present application, after obtaining a plurality of speech signal frames corresponding to the first speech sample, the speech recognition device may perform sampling processing on the plurality of speech signal frames according to a preset step size to obtain at least one speech segment, where each speech segment may include N consecutive speech signal frames (N is a positive integer, and for example, N ═ 3). Wherein, the preset step length can be understood as: when a speech recognition device processes a certain speech signal frame, the context information to be referred to corresponds to the number of the speech signal frame. If the preset step size is 2, it can mean: when the speech recognition device processes any speech signal frame, the context information needing to be referred to comes from 2 speech signal frames adjacent to the speech signal frame.

Under the condition, the speech recognition equipment can process each speech segment to realize the processing of the speech recognition equipment on the first speech sample, so that the speech recognition equipment can fully utilize the context information between the speech signal frames when processing the first speech sample, and further the accuracy of the speech recognition result is improved. In addition, optionally, in this embodiment of the present application, the speech recognition device determines that there may be overlapping speech signal frames between two consecutive speech segments.

The manner in which the speech recognition device obtains the speech segments is set forth in detail below with respect to specific examples. In this example, assuming that the preset step size is 2, the first speech signal comprises 5 consecutive speech signal frames. Based on the above description, it is understood that the preset step size is 2, that is, the expression: one voice segment acquired by the voice recognition device comprises 3 voice signal frames, and the 3 voice signal frames are continuous in time and can be combined to obtain a continuous voice signal. Then, in this case, the speech recognition device may obtain a speech segment based on the 1 st to 3 rd speech signal frame samples (to obtain the context information of the 1 st speech signal frame); further, the speech recognition device may derive a further speech segment based on the 2 nd to 4 th speech signal frame samples (to obtain context information of the 2 nd speech signal frame), and derive a further speech segment based on the 3 rd to 5 th speech signal frame samples.

And S403, extracting the environmental characteristics of each voice segment by adopting a characteristic extraction network to obtain the environmental characteristics of each voice segment.

In the embodiment of the present application, each speech segment may be extracted with an environmental feature. The feature extraction network for extracting the environmental features of the speech segments may comprise at least one sub-network, each sub-network may be used for extracting the environmental features of one dimension. By way of example, embodiments of the present application may include, but are not limited to, environmental features in one or more of the following dimensions: a noise type dimension, a noise size dimension, a sound source distance dimension, an ambient reverberation dimension, and the like. In this case, the manner in which the speech recognition apparatus extracts the environmental features of any speech piece may be as follows: and the voice recognition equipment respectively adopts each sub-network to extract the environmental characteristics of the voice segment to obtain the environmental characteristics of the voice segment under the corresponding dimensionality. Based on this, the speech recognition device may obtain the environmental features of the speech segment in at least one dimension, and further, the speech recognition device may perform feature fusion processing on the environmental features in the at least one dimension to obtain the environmental features of the speech segment.

For example, assuming that the feature extraction network includes a first sub-network, and the first sub-network is used to extract the environmental features of the voice segment in the noise type dimension (i.e., the environmental features indicating the noise type to which the voice segment belongs), when the voice recognition device performs feature extraction on the voice segment by using the first sub-network, the obtained environmental features in the corresponding dimension are the environmental features in the noise type dimension. Based on this, it can be understood that, if the number of the sub-networks included in the feature extraction network is a, the environmental features in at least one dimension mentioned in the embodiment of the present application may also be the environmental features in a dimension. Furthermore, the speech recognition device may, for example, perform a concatenation process on the environmental features in the dimensions to obtain the environmental features of the speech segment.

S404, performing feature fusion on the environment features corresponding to the at least one voice segment to obtain the environment features of the first voice sample.

The voice recognition device can perform feature fusion on the environmental features corresponding to the at least one voice segment by using a feature extraction network to obtain the environmental features of the first voice sample.

In one implementation of the embodiment of the present application, the feature extraction network for extracting the environmental features may include M hidden layers, and for example, M may be a positive integer. The hidden layer can be composed of a plurality of neurons, and the number of the neurons in different hidden layers can be different. In practical applications, the hidden layer can be mainly used for abstracting the input data of the layer (for example, the input data of the first hidden layer is a speech fragment) to another dimensional space to present more abstract features thereof, and the features can be better linearly divided and have deeper expression capability. The (i + 1) th hidden layer may be configured to further abstract the data abstracted by the ith hidden layer, so as to obtain an environmental characteristic (i is a non-negative integer, and i is smaller than M) that is stronger than an expression capability of output data of the ith hidden layer. Then, for example, the environmental feature of each speech segment may be obtained by the speech recognition device extracting the environmental feature of the corresponding speech segment by using the layer 1 hidden layer, in this case, the way of performing feature fusion on the environmental feature of each speech segment by the speech recognition device may be as follows: the speech recognition device adopts the 2 nd hidden layer to further extract the environmental characteristics of each speech segment in the 1 st hidden layer. Specifically, one neuron in the layer 2 hidden layer may perform feature extraction based on each environmental feature corresponding to a plurality of continuous speech segments in the layer 1 hidden layer, and one neuron in the layer 3 hidden layer may further perform feature extraction on output data of the plurality of neurons in the layer 2 hidden layer, and so on, so as to continuously extract and abstract each output data and context information between the output data in the previous layer hidden layer in the process of extracting the environmental feature of the first speech sample, thereby gradually generating more accurate environmental features. Illustratively, the mth hidden layer may include one neuron, and then, that is, the speech recognition device may directly use the output data of the mth hidden layer as the environmental feature of the first speech sample. It is worth mentioning that in practical application, the weight matrices adopted by the hidden layers may be the same, that is, in the embodiment of the present application, weights may be shared among the hidden layers, so as to reduce the number of model parameters that need to be optimized in the environment prediction model, thereby accelerating the convergence rate of the model.

In another implementation manner of the embodiment of the present application, after the voice recognition device obtains the environmental features of each voice segment, the voice recognition device may perform average operation on the environmental features corresponding to at least one voice segment to obtain the first sub-environmental feature. In other words, the first sub-environment feature is an average value corresponding to all environment features acquired by the voice device based on at least one voice segment, and the number of the voice segments is the same as the number of the environment features of the voice segments, that is: when at least one speech segment is specifically B speech segments, the environmental characteristics corresponding to the at least one speech segment mentioned in the embodiments of the present application are: b environmental characteristics corresponding to the B voice segments. For example, the speech recognition device may calculate the first sub-environment characteristic in a manner shown in equation 1. In equation 1, T represents the number of environmental features of a speech segment，h _t Representing the environmental characteristics of the t-th speech segment and mu representing the first sub-environmental characteristics.

In addition, the speech recognition device may further perform a standard deviation operation on the environment feature corresponding to the at least one speech segment to obtain a second sub-environment feature. Then, after the speech recognition device obtains the first sub-environment feature and the second sub-environment feature, the speech recognition device may perform feature splicing processing on the first sub-environment feature and the second sub-environment feature, so as to obtain the environment feature of the first speech sample. For example, the speech recognition device may calculate the second sub-environment feature in the manner shown in equation 2. In equation 2, σ denotes the second sub-ambient characteristic, indicates the hadamard product, and μ denotes the first sub-ambient characteristic.

In a particular embodiment, the structure of the environmental predictive model may be as shown in FIG. 5, for example. Then, optionally, the hidden layer and the environmental feature extraction layer marked in fig. 5 may be used as a feature extraction network in the environmental prediction model. In practical applications, the environmental feature extraction layer may be a fully-connected layer in nature, and is used to map the environmental features of the first speech sample to one-dimensional features, which can be used to predict the prediction probability that the first speech sample belongs to each room type. Optionally, the network marked by 51 in fig. 5 may also be a fully-connected layer network, in this case, the feature extraction network of the environment prediction model may mainly be composed of a plurality of hidden layers and two fully-connected layers, where a first fully-connected layer may be mainly used to map the environment features of the first speech sample into a vector space of a larger dimension to enhance the characterization capability of the feature extraction network, and a second fully-connected layer may also have the functions of the first fully-connected layer, but is mainly used to map the output of the first fully-connected layer into corresponding one-dimensional features. In this case, the environmental characteristic of the first speech sample may be an output characteristic obtained by using the first full-connected layer, or an output characteristic obtained by using the second full-connected layer, which is not limited in this embodiment of the present application. It is worth mentioning that, in the embodiment of the present application, the feature extraction network includes two fully connected layers, which can effectively enhance the feature characterization capability of the feature extraction network, so as to improve the accuracy of the environmental features obtained by the voice recognition device.

In this case, the speech recognition device may perform the feature fusion process on the first sub-environment feature and the second sub-environment feature by using Statistics Pooling (i.e., statistical Pooling layer) in the model structure as shown in FIG. 5. In other embodiments, the statistical Statistics Pooling may be replaced with an Attentive statistical Statistics Pooling (statistical Pooling layer based on attention mechanism). The Attention mechanism is utilized to provide different weights for different voice segments so as to generate a weighted average value and a weighted standard deviation of the environmental characteristics corresponding to at least one voice segment, and the weighted average value and the weighted standard deviation are spliced to obtain the environmental characteristics of the first voice signal, so that the long-term change of the environmental characteristics in the first voice sample can be captured more effectively, and the voice recognition device can determine a more accurate voice recognition result.

In the embodiment of the present application, a statistics pooling layer or a statistics pooling layer based on an attention mechanism is used to perform feature fusion on the environmental features of each voice sheet, so that the voice recognition device can integrate frame-level features of voice signals with indefinite length (i.e., environmental features of voice segments) into statement-level features with definite length (i.e., environmental features of the first voice sample). Therefore, the voice recognition equipment can further extract the environmental characteristics of the voice to be recognized with each length, so that the voice recognition can be carried out based on the extracted environmental characteristics, the application range of the voice recognition equipment is widened to a certain extent, and the robustness of the voice recognition equipment is improved.

S405, predicting to obtain the predicted environment information of the first voice sample according to the environment characteristics of the first voice sample by adopting an environment prediction network in the environment prediction model.

S406, performing model optimization on the environment prediction model according to the prediction environment information and the reference environment information to obtain an optimized environment prediction model, wherein the optimized environment prediction model comprises an optimized feature extraction network.

In a specific embodiment, reference may be made to specific embodiments of steps S203 to S204 for implementation related to steps S405 to S406, and details of the embodiments of the present application are not repeated herein.

And S407, after receiving the voice recognition request, calling the optimized feature extraction network to extract the environmental features of the voice to be recognized carried by the voice recognition so as to perform the voice recognition on the voice to be recognized based on the extracted target environmental features.

In a specific implementation, the speech recognition device may invoke the optimized speech recognition model to perform speech recognition on the speech to be recognized. The optimized speech recognition model can be obtained in the following manner: the speech recognition model firstly obtains a second speech sample and a reference recognition result of the second speech sample. For example, the second speech sample may be the same as the first speech sample, and the second speech sample may be acquired under an acoustic environment indicated by the reference environmental characteristic, and the reference recognition result may be a text sequence. Further, the speech recognition device may perform speech recognition on the second speech sample according to the reference environmental feature by using a speech recognition model to obtain a predicted recognition result of the second speech sample, where the predicted recognition result may also be a text sequence. Further, the speech recognition apparatus may perform model optimization processing on the speech recognition model based on the predicted recognition result and the reference recognition result (e.g., toward a direction of reducing a difference between the predicted recognition result and the reference recognition result) to obtain an optimized speech recognition model. It can be understood that the predicted recognition result obtained by performing speech recognition on the second speech sample by the optimized speech recognition model may have higher consistency with the reference recognition result of the second speech sample, or the difference between the predicted recognition result obtained by performing speech recognition on the second speech sample by the optimized speech recognition model and the reference recognition result of the second speech sample may be smaller than the second preset threshold. For example, the first preset threshold and the second preset threshold may be the same or different. Based on the above description, that is to say, in the embodiment of the present application, the optimized speech recognition model can obtain an accurate speech recognition result based on the environmental feature prediction of the second speech signal. Therefore, in the embodiment of the present application, the speech recognition device may perform speech recognition processing on the speech to be recognized by using the optimized speech recognition model.

Based on the above description, it is understood that the reference environmental characteristics and the acoustic environment may be in a one-to-one correspondence relationship, that is: one reference environmental characteristic corresponds to one acoustic environment. Then, the manner of obtaining any reference environmental feature may be as follows: the voice recognition equipment acquires a plurality of voice samples in any acoustic environment, then the voice recognition equipment adopts the optimized feature extraction network to respectively extract the environmental features of the voice samples, and then the reference environmental features corresponding to the acoustic environment are determined based on the extracted environmental features. For example, the speech recognition device may perform a weighted summation on the extracted plurality of environmental features, or perform a weighted average on the extracted plurality of environmental features, so as to use the corresponding weighted operation result as the reference environmental feature.

Based on the above description, in an implementation manner, a specific process of the speech recognition device performing speech recognition on the speech to be recognized according to the target environmental feature of the speech to be recognized may be as follows: the voice recognition equipment acquires environmental information indicated by the target environmental characteristics; further, the voice recognition device can acquire a voice recognition strategy matched with the environment information, and perform voice recognition on the voice to be recognized according to the voice recognition strategy by adopting the optimized voice recognition model. The voice recognition strategy can be a specific process of voice recognition of the voice to be recognized by the voice recognition equipment. And for the voice to be recognized collected under different environment information, different voice recognition strategies can be adopted to perform voice recognition on the voice to be recognized. For example, for a speech to be recognized collected in an acoustic environment with relatively large noise (e.g., noise greater than a noise threshold), the speech recognition device may perform denoising processing on the speech to be recognized first, and then perform speech recognition on the denoised speech data by the speech recognition device, so as to obtain a speech recognition result of the speech to be recognized. Correspondingly, for the speech to be recognized collected in the acoustic environment with less noise, the speech recognition device can directly perform speech recognition processing on the speech. Alternatively, the voice recognition strategies may also be different processing modes adopted in the same processing flow. For example, through experiments performed by a person skilled in the relevant art in the embodiment of the present application, it is found that the denoising effect of the additive noise is better by using the spectral subtraction method, and the denoising effect of the additive noise is better by using the adaptive filter. Therefore, when the noise in the speech to be recognized is additive noise, the speech recognition device can perform denoising processing on the speech to be recognized by adopting spectral subtraction; when the noise in the speech to be recognized is white noise, the speech recognition device may perform denoising processing on the speech to be recognized by using the adaptive filter.

In another implementation, since the optimized speech recognition model is obtained by performing model optimization based on the second speech signal sample under the reference environmental characteristic, the optimized speech recognition model may include speech recognition model parameters corresponding to the reference environmental characteristic. In a specific implementation, the number of the reference environmental features may be at least one, and each reference environmental feature corresponds to a speech recognition model parameter in the optimized speech recognition model, so as to perform speech recognition processing on a speech signal (such as the second speech sample or the speech to be recognized) of the corresponding reference environmental feature. In this case, the manner in which the speech recognition device performs speech recognition on the speech to be recognized based on the target environmental characteristics may also be as follows: when the speech recognition device determines that the feature similarity between the target environment feature and any reference environment feature is greater than the similarity threshold, the speech recognition device may perform speech recognition on the speech to be recognized by using the speech recognition model parameter corresponding to the any reference environment feature. The feature similarity between the target environmental feature and any reference environmental feature can be obtained by calculating a cosine distance between the two corresponding features, and a larger cosine distance indicates that the similarity between the two features is lower.

In the embodiment of the application, the voice recognition device firstly adopts the first voice sample to perform model optimization on the environment prediction model, so that the feature extraction network in the optimized environment prediction model can be used for extracting the target environment feature of the voice to be recognized, and in addition, the voice recognition device performs model optimization on the voice recognition model based on the reference environment feature and the second voice sample, so that the optimized voice recognition model can perform voice recognition on the voice to be recognized based on the target environment feature. In the embodiment of the application, the feature extraction network and the voice recognition model are respectively subjected to model optimization, so that the model parameters needing to be optimized in the optimization process of each model are less, and the development efficiency of a voice recognition product or a voice recognition task can be accelerated to a certain extent. In addition, the first voice sample for optimizing the feature extraction network and the second voice sample for optimizing the voice recognition model in the embodiment of the present application may be the same voice sample, so that there are fewer voice samples (i.e., the first voice sample and the second voice sample) that need to be collected in the embodiment of the present application, and thus, economic resources in the development process of voice recognition products can be effectively saved.

Based on the above description of the speech recognition method, the embodiment of the present application also discloses a speech recognition apparatus, which can run one or more computer programs (including a level code) in the above mentioned computer device. In a particular embodiment, the speech recognition apparatus may be used to perform a speech recognition method as shown in fig. 2 or fig. 4. Referring to fig. 6, the speech recognition apparatus may include: an acquisition unit 601, a feature extraction unit 602, an environment prediction unit 603, a model optimization unit 604, and a speech recognition unit 605. Wherein:

an obtaining unit 601, configured to obtain a first voice sample and reference environment information of the first voice sample, where the reference environment information is used to describe an acoustic environment when the first voice sample is collected;

a feature extraction unit 602, configured to perform environmental feature extraction on the first voice sample by using a feature extraction network in an environmental prediction model to obtain an environmental feature of the first voice sample;

an environment prediction unit 603, configured to predict, by using an environment prediction network in the environment prediction model, prediction environment information of the first speech sample according to an environment feature of the first speech sample;

a model optimization unit 604, configured to perform model optimization on the environment prediction model according to the prediction environment information and the reference environment information to obtain an optimized environment prediction model, where a difference between the prediction environment information of the first speech sample obtained by the optimized environment prediction model and the reference environment information is smaller than a first preset threshold, and the optimized environment prediction model includes an optimized feature extraction network;

and the voice recognition unit 605 is configured to, after receiving the voice recognition request, invoke the optimized feature extraction network to perform environmental feature extraction on the voice to be recognized carried by the voice recognition request, so as to perform voice recognition on the voice to be recognized based on the extracted target environmental feature.

In an embodiment, the first speech sample includes a plurality of speech signal frames, and the feature extraction unit 602 may be specifically configured to perform:

sampling the plurality of voice signal frames according to a preset step length to obtain at least one voice segment, wherein each voice segment comprises N voice signal frames, and N is a positive integer;

extracting the environmental characteristics of each voice segment by adopting the characteristic extraction network to obtain the environmental characteristics of each voice segment;

and performing feature fusion on the environmental features corresponding to the at least one voice segment to obtain the environmental features of the first voice sample.

In another embodiment, the feature extraction network includes at least one sub-network, each sub-network is configured to extract an environmental feature of one dimension, and the feature extraction unit 602 is specifically configured to:

respectively adopting each sub-network to extract the environmental characteristics of any voice segment to obtain the environmental characteristics of any voice segment under the corresponding dimensionality;

and performing feature fusion processing on the extracted environmental features of any voice segment under at least one dimension to obtain the environmental features of any voice segment.

In yet another embodiment, the feature extraction unit 602 may be further configured to perform:

carrying out average operation on the environment characteristics corresponding to the at least one voice fragment to obtain first sub-environment characteristics;

performing standard deviation operation on the environment characteristics corresponding to the at least one voice fragment to obtain second sub-environment characteristics;

and performing feature splicing processing on the first sub-environment feature and the second sub-environment feature to obtain the environment feature of the first voice sample.

In another embodiment, the model optimization unit 604 may be further specifically configured to perform:

acquiring a second voice sample and a reference recognition result of the second voice sample, wherein the second voice sample is acquired under an acoustic environment indicated by reference environment characteristics;

performing voice recognition on the second voice sample by adopting a voice recognition model according to the reference environment characteristics to obtain a prediction recognition result;

and performing model optimization processing on the voice recognition model according to the predicted recognition result and the reference recognition result to obtain an optimized voice recognition model, wherein the difference between the predicted recognition result of the second voice sample and the reference recognition result obtained by the optimized voice recognition model is smaller than a second preset threshold, and the optimized voice recognition model is used for performing voice recognition on the voice to be recognized based on the target environment characteristics.

In another embodiment, the speech recognition unit 605 may be further specifically configured to perform:

acquiring environmental information indicated by the target environmental characteristics;

and acquiring a voice recognition strategy matched with the environment information, and performing voice recognition on the voice to be recognized by adopting the optimized voice recognition model according to the voice recognition strategy.

In another embodiment, the optimized speech recognition model includes speech recognition model parameters corresponding to the reference environmental features, and the number of the reference environmental features is at least one; the speech recognition unit 605 may be further specifically configured to perform:

if the feature similarity between the target environment feature and any one of the at least one reference environment feature is greater than a similarity threshold, performing voice recognition on the voice to be recognized by adopting voice recognition model parameters corresponding to the any one reference environment feature;

wherein the speech recognition model parameters corresponding to any one of the reference environmental features include: and the model parameters are used for carrying out voice recognition on the voice samples acquired in the acoustic environment indicated by any reference environment characteristic.

According to an embodiment of the present application, the steps involved in the speech recognition methods shown in fig. 2 and 4 may be performed by the units in the speech recognition apparatus shown in fig. 6. For example, step S201 shown in fig. 2 may be performed by the acquisition unit 601 in the speech recognition apparatus shown in fig. 6; step S202 may be performed by the feature extraction unit 602 in the speech recognition apparatus shown in fig. 6; step S203 may be performed by the environment prediction unit 603 in the speech recognition apparatus shown in fig. 6; step S204 may be performed by the model optimization unit 604 in the speech recognition apparatus shown in fig. 6; step S205 may be performed by the speech recognition unit 605 in the speech recognition apparatus shown in fig. 6. As another example, step S401 shown in fig. 4 may be performed by the obtaining unit 601 in the speech recognition apparatus shown in fig. 6, steps S402 to S404 may be performed by the feature extracting unit 602 in the speech recognition apparatus shown in fig. 6, step S405 may be performed by the environment predicting unit 603 in the speech recognition apparatus shown in fig. 6, step S406 may be performed by the model optimizing unit 604 in the speech recognition apparatus shown in fig. 6, and step S407 may be performed by the speech recognition unit 605 in the speech recognition apparatus shown in fig. 6.

According to another embodiment of the present application, each unit in the speech recognition apparatus shown in fig. 6 is divided based on a logical function. The above units may be respectively or completely combined into one or several other units to form the structure, or some of the unit(s) may be further split into multiple functionally smaller units to form the structure, which may implement the same operation without affecting the implementation of the technical effect of the embodiments of the present application. In other embodiments of the present application, the voice recognition apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may also be implemented by being assisted by multiple units.

According to another embodiment of the present application, the voice recognition apparatus shown in fig. 6 may be constructed by running a computer program (including program codes) capable of executing the steps involved in the methods shown in fig. 2 and 4 on a general-purpose computing device, such as a domain name management device, including a Central Processing Unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM), and the like, as well as a storage element, and implementing the voice recognition method of the embodiment of the present application. The computer program may be embodied on, for example, a computer storage medium, and loaded into and executed by the computing device described above via the computer storage medium.

In the embodiment of the application, when the voice recognition device performs the voice recognition processing on the voice to be recognized, the target environment characteristics of the acoustic environment where the voice to be recognized is located when the voice to be recognized is collected are utilized, so that the voice recognition device can perform the recognition processing on the voice to be recognized based on more comprehensive information, and the accuracy of the voice recognition is improved. In the embodiment of the application, the target environment features of the voice to be recognized are obtained by adopting the optimized feature extraction network, and the optimized feature extraction network is obtained by performing model optimization on the voice recognition device based on the deep learning technology by adopting the first voice sample, so that the optimized feature extraction network can have stronger feature extraction capability and feature expression capability. Therefore, the accurate environmental features can be extracted and obtained on the basis of the feature extraction network, so that the voice recognition device can recognize and obtain an accurate voice recognition result on the basis of the environmental features with higher accuracy.

Based on the above description of the method embodiment and the apparatus embodiment, the embodiment of the present application further provides a speech recognition device. Referring to fig. 7, the speech recognition apparatus at least includes a processor 701 and a computer storage medium 702, and the processor 701 and the computer storage medium 702 may be connected by a bus or other means.

The above-mentioned computer storage medium 702 is a memory device in the voice recognition apparatus, and stores programs and data. It will be appreciated that the computer storage media 702 herein may comprise both built-in storage media in the speech recognition device and, of course, extended storage media supported by the speech recognition device. The computer storage medium 702 provides a storage space that stores the operating system of the speech recognition device. Also stored in this memory space are one or more computer programs, which may be one or more program codes, adapted to be loaded and executed by the processor 701. The computer storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one storage medium located remotely from the processor. The processor 701 (or CPU) is a computing core and a control core of the speech recognition device, and is adapted to implement one or more computer programs, and specifically to load and execute the one or more computer programs so as to implement corresponding method flows or corresponding functions.

In one embodiment, one or more computer programs stored in the computer storage medium 702 may be loaded and executed by the processor 701 to implement the corresponding method steps in the method embodiments described above with respect to fig. 2 and 4. In particular implementations, one or more computer programs in the computer storage medium 702 may be loaded and executed by the processor 701 to perform the steps of:

In one embodiment, the first speech sample includes a plurality of speech signal frames, and the processor 701 is specifically configured to load and execute:

In another embodiment, the feature extraction network includes at least one sub-network, each sub-network is configured to extract an environmental feature in one dimension, and the processor 701 is further specifically configured to load and execute:

In another embodiment, the processor 701 may be further specifically configured to load and execute:

performing standard deviation operation on the environment characteristic corresponding to the at least one voice fragment to obtain a second sub-environment characteristic;

and performing model optimization processing on the voice recognition model according to the predicted recognition result and the reference recognition result to obtain an optimized voice recognition model, wherein the difference between the predicted recognition result of the second voice sample obtained by the optimized voice recognition model and the reference recognition result is smaller than a second preset threshold, and the optimized voice recognition model is used for performing voice recognition on the voice to be recognized based on the target environment characteristics.

In another embodiment, the optimized speech recognition model includes speech recognition model parameters corresponding to the reference environmental features, and the number of the reference environmental features is at least one; the processor 701 may be further specifically configured to load and execute:

In the embodiment of the application, when the voice recognition equipment performs the voice recognition processing on the voice to be recognized, the target environment characteristics of the acoustic environment where the voice to be recognized is located when the voice to be recognized is collected are utilized, so that the voice recognition equipment can perform the recognition processing on the voice to be recognized based on more comprehensive information, and the accuracy of the voice recognition is improved. In the embodiment of the application, the target environment features of the voice to be recognized are obtained by adopting the optimized feature extraction network, and the optimized feature extraction network is obtained by performing model optimization on the voice recognition device based on the deep learning technology by adopting the first voice sample, so that the optimized feature extraction network can have stronger feature extraction capability and feature expression capability. Therefore, the network can extract and obtain accurate environmental features based on the feature extraction, so that the voice recognition equipment can recognize and obtain accurate voice recognition results based on the environmental features with higher accuracy.

The present application further provides a computer storage medium, where one or more computer programs corresponding to the foregoing speech recognition method are stored in the computer storage medium, and when one or more processors load and execute the one or more computer programs, the description of the speech recognition method in the embodiment may be implemented, which is not described herein again. The description of the beneficial effects of the same method is not repeated herein. It will be appreciated that the computer program may be deployed to be executed on one or more devices that are capable of communicating with each other.

It should be noted that according to an aspect of the present application, a computer product or a computer program is also provided, and the computer product includes a computer program, and the computer program is stored in a computer storage medium. A processor in the computer device reads the computer program from the computer storage medium and executes the computer program, thereby enabling the computer device to perform the methods provided in the various alternatives in the aspect of the embodiment of the speech recognition method illustrated in fig. 2 and 4 described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer storage medium and executed by a computer, and the processes of the embodiments of the speech recognition method may be included. The computer storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

It should be understood that the above-described embodiments are only exemplary of the present disclosure, and should not be construed as limiting the scope of the present disclosure, and those skilled in the art will understand that all or part of the above-described embodiments may be implemented and equivalents thereof may be made to the claims of the present disclosure while remaining within the scope of the present disclosure.

Claims

1. A speech recognition method, comprising:

2. The method of claim 1, wherein the first speech sample comprises a plurality of speech signal frames; the extracting the environmental characteristics of the first voice sample by adopting a characteristic extraction network in an environmental prediction model comprises the following steps:

3. The method of claim 2, wherein the feature extraction network comprises at least one sub-network, each sub-network for extracting environmental features of one dimension; the method for extracting the environmental characteristics of any voice segment by adopting the characteristic extraction network comprises the following steps:

4. The method according to claim 2 or 3, wherein the performing feature fusion on the environmental feature corresponding to the at least one speech segment to obtain the environmental feature of the first speech sample comprises:

performing average operation on the environment characteristics corresponding to the at least one voice fragment to obtain a first sub-environment characteristic;

5. The method of claim 1, further comprising:

6. The method of claim 5, wherein the manner of performing speech recognition on the speech to be recognized based on the target environment features comprises:

7. The method according to claim 5, wherein the optimized speech recognition model comprises speech recognition model parameters corresponding to the reference environmental features, and the number of the reference environmental features is at least one; the method for performing voice recognition on the voice to be recognized based on the target environment features comprises the following steps:

8. A speech recognition apparatus, comprising:

9. A speech recognition device, comprising:

a processor for implementing one or more computer programs;

computer storage medium storing one or more computer programs adapted to be loaded by the processor and to perform the speech recognition method according to any of claims 1-7.

10. A computer storage medium, characterized in that it stores one or more computer programs adapted to be loaded by a processor and to perform the speech recognition method according to any of claims 1-7.

11. A computer product, characterized in that the computer product comprises a computer program adapted to be loaded by a processor and to perform the speech recognition method according to any of claims 1-7.