WO2022266825A1

WO2022266825A1 - Speech processing method and apparatus, and system

Info

Publication number: WO2022266825A1
Application number: PCT/CN2021/101400
Authority: WO
Inventors: 王科涛; 聂为然
Original assignee: 华为技术有限公司
Priority date: 2021-06-22
Filing date: 2021-06-22
Publication date: 2022-12-29
Also published as: CN113597641A

Abstract

The present application relates to the technical field of intelligent vehicles. Provided are a speech processing method and apparatus, and a system. The method comprises: acquiring input speech information of a user; according to the input speech information, determining a plurality of first confidence levels corresponding to the input speech information, wherein the plurality of first confidence levels respectively correspond to a plurality of languages; correcting the plurality of first confidence levels to a plurality of second confidence levels according to a user feature of the user; and determining the language of the input speech information according to the plurality of second confidence levels. By using the speech processing method, the language of input speech information of a user is determined on the basis of taking a user feature into consideration, and therefore the language recognition accuracy can be improved, and the speech recognition capability can also be improved.

Description

Speech processing method, device and system

technical field

The present application relates to the technical field of artificial intelligence, in particular to a voice processing method, device and system.

Background technique

With the development of computer technology, speech recognition technology has been more and more widely used. In addition, with the deepening of globalization, there are often scenes where people who speak different languages mix their office and life. For example, on an international flight, the passengers are often people from different countries or regions, and the languages they use are not the same. Or, for example, in a country like Singapore, English is the first foreign language of most locals. At the same time, because there are many Chinese and ethnic Chinese, they usually use Chinese as their daily communication language. Thus, both English and Chinese may appear in communication. Therefore, in order to cope with this situation, a technology capable of recognizing speech in different languages (for example, capable of recognizing both Chinese speech and English speech) has emerged.

But at present, the recognition results of this technology sometimes fluctuate, that is, the ability to recognize speech is low, and there is still room for improvement in this regard.

Contents of the invention

The present application provides a speech processing method, device and system capable of improving speech recognition capability, so as to improve the accuracy of speech recognition.

The first aspect of the present application relates to a voice processing method, including the following content: acquiring user input voice information; determining a plurality of first confidence levels corresponding to the input voice information according to the input voice information, the multiple first confidence levels respectively corresponding to a plurality of languages; modify the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics; and determine the language of the input voice information according to the plurality of second confidence levels.

Using the speech processing method as described above, modifying multiple first confidence levels into multiple second confidence levels according to the user characteristics of the user, and determining the language of the input voice information according to the multiple second confidence levels, that is, considering the user characteristics Basically, the language of the voice information input by the user is determined, so that the language recognition accuracy can be improved and the voice recognition ability can be improved.

As a possible implementation of the first aspect of the present application, modifying the multiple first confidence levels into multiple second confidence levels according to user characteristics may specifically include: when the multiple first confidence levels are smaller than the first threshold, Modifying the multiple first confidence levels into multiple second confidence levels according to user characteristics.

When the plurality of first confidence levels are smaller than the first threshold, it is difficult to determine the language of the input voice information according to the first confidence levels. At this time, if multiple first confidence levels are modified into multiple second confidence levels according to user characteristics, the language of the input voice information can be determined according to the second confidence levels, which can improve language recognition accuracy and speech recognition ability.

User characteristics may include one or more of historical language records and user-specified languages.

In this manner, the first recognition confidence level is corrected according to the user's historical language records and/or the user-designated language, and the language of the input voice is determined on this basis, thereby improving the language recognition ability.

The historical language record of the user here refers to the record of the language to which the voice input by the user belongs before the above-mentioned input voice is input. The user-specified language here refers to the type of system language set by the user. There may be only one user-specified language, or there may be multiple user-specified languages (that is, there are multiple system languages set by the user).

As a possible implementation of the first aspect of the present application, the historical language records and the user-specified language are obtained by querying the voiceprint features of the input voice information.

Using the above method, according to the voiceprint of the input voice information to query the historical language records or the user-specified language, for example, compared with the way of querying based on face information, iris, etc., it is possible to avoid misidentification of the user (speaker). Non-speakers are identified as speakers) causing language misidentification. In addition, using the above method, the voiceprint can be obtained according to the input voice information, while the method of querying based on face information, iris, etc. also needs to obtain user images. Therefore, the method of querying based on voiceprint requires less equipment. Processing is faster.

As a possible implementation manner of the first aspect of the present application, the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights. The voice processing method may further include the following content: updating multiple preset weights according to multiple second confidence levels.

In this way, the language recognition accuracy in subsequent processing cycles can be improved.

As a possible implementation of the first aspect of the present application, updating multiple preset weights according to multiple second confidence levels specifically includes: when there is a second confidence level greater than the first threshold among the multiple second confidence levels , updating multiple preset weights according to multiple second confidence levels.

In this way, when there is a second confidence degree greater than the first threshold among the plurality of second confidence degrees, the reliability of the result of this processing cycle is better. By setting weights, it is possible to more reliably improve the language recognition accuracy in subsequent processing cycles.

As a possible implementation manner of the first aspect of the present application, the method further includes: determining the semantics of the input voice information according to the input voice information and the language of the input voice information.

By adopting the above method, the accuracy of language recognition can be improved, and the accuracy of semantic understanding can be improved.

As a possible implementation of the first aspect of the present application, multiple languages are preset.

As a possible implementation of the first aspect of the present application, the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the voice processing method further includes: before acquiring the user's input voice information, according to the scene Features set multiple preset weights.

By adopting the above method, the preset weight is set according to the scene characteristics, so that different scenes can be adapted, and the language recognition result is obtained with the preset weight that is most suitable for the scene, which improves the language recognition ability and speech recognition ability.

As a possible implementation manner of the first aspect of the present application, the scene feature includes an environment feature and/or an audio collector feature.

As a possible implementation manner of the first aspect of the present application, the environmental feature includes one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude, and the audio collector feature includes microphone arrangement information.

Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.

As a possible implementation of the first aspect of the present application, setting multiple preset weights according to scene characteristics specifically includes: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; The first voice data and the scene feature determine the second voice data; determine the second language information of the second voice data according to the second voice data; set a plurality of preset weights according to the first language information and the second language information.

As a possible implementation of the first aspect of the present application, determining the second language information of the second voice data according to the second voice data specifically includes: acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple a plurality of test weights; determine a plurality of second language information according to the second voice data and a plurality of test weight groups, and a plurality of second language information corresponds to a plurality of test weight groups respectively; set according to the first language information and the second language information Multiple preset weights, specifically including: determining multiple accuracy rates of multiple second language information according to the first language information and multiple second language information; setting according to the test weight group corresponding to the second language information with the highest accuracy rate Multiple preset weights.

As a possible implementation manner of the first aspect of the present application, setting multiple preset weights specifically includes: setting multiple preset weights within a weight range.

As a possible implementation manner of the first aspect of the present application, updating the multiple preset weights specifically includes: updating the multiple preset weights within a weight range.

If the preset weight exceeds the weight range, the recognition result will be unreliable. Therefore, setting or updating within the weight range can ensure the accuracy of the recognition result as much as possible.

As a possible implementation of the first aspect of the present application, the weight range is determined as follows:

Obtain the pre-recorded first language information of a plurality of test voice data groups collected in advance and a plurality of test voice data groups, any one of the plurality of test voice data groups includes a plurality of test voice data; obtain a plurality of test weight groups, Any one of the multiple test weight groups includes multiple test weights; the weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.

In the above way, a large number of voice data sets are used for testing to set the weight range of the multilingual preset weight set, that is, to specify the robustness range of the language recognition model, so that the language recognition model works within this range. Therefore, the reliability of the language recognition result is guaranteed.

The second aspect of the present application provides a voice processing method, including the following content: acquiring user input voice information; determining a plurality of third confidence levels corresponding to the input voice information according to the input voice information, the multiple third confidence levels respectively corresponding to multiple languages; correcting multiple third confidence levels into multiple fourth confidence levels according to scene features; determining the language of the input voice information according to the multiple fourth confidence levels.

Using the speech processing method as described above, modify the multiple first confidence levels into multiple second confidence levels according to the scene features, and determine the language of the input voice information according to the multiple second confidence levels, that is, on the basis of considering the scene features Determine the language of the user's input voice information, so that the voice processing method can be adapted to the actual scene as much as possible, and the language recognition accuracy can be improved, and the voice recognition ability can be improved.

Here, scene features may include environment features and/or audio collector features.

As a possible implementation manner of the second aspect of the present application, the environmental feature includes one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude, and the audio collector feature includes microphone arrangement information.

As a possible implementation of the second aspect of the present application, modifying multiple third confidence levels into multiple fourth confidence levels according to scene characteristics includes: setting multiple preset weights according to scene features; The weight modifies the plurality of third confidence levels into a plurality of fourth confidence levels.

As a possible implementation of the second aspect of the present application, setting multiple preset weights according to scene characteristics specifically includes: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; The first voice data and the scene feature determine the second voice data; determine the second language information of the second voice data according to the second voice data; set a plurality of preset weights according to the first language information and the second language information.

As a possible implementation of the second aspect of the present application, determining the second language information of the second voice data according to the second voice data specifically includes: acquiring multiple test weight groups, where the test weight groups include multiple test weights; according to the first Two voice data and a plurality of test weight groups determine a plurality of second language information, and a plurality of second language information corresponds to a plurality of test weight groups; multiple preset weights are set according to the first language information and the second language information, It specifically includes: determining multiple accuracy rates of multiple second language information according to the first language information and multiple second language information; setting multiple preset weights according to the test weight group corresponding to the second language information with the highest accuracy rate.

The specific features of the second aspect of the present application may be the same as or similar to those of the first aspect, so the technical effects thereof are basically the same, and will not be repeated here.

The third aspect of the present application provides a voice processing device, including a processing module and a transceiver module, the transceiver module is used to obtain the user's input voice information; Confidence degree, multiple first confidence degrees respectively correspond to multiple languages. The processing module is further configured to modify the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics of the user, and determine the language of the input voice information according to the plurality of second confidence levels.

As a possible implementation manner of the third aspect of the present application, the processing module is specifically configured to, when the multiple first confidence levels are smaller than the first threshold, modify the multiple first confidence levels to multiple second confidence levels according to user characteristics.

As a possible implementation of the third aspect of the present application, the user features include one or more of historical language records and user-specified languages.

As a possible implementation of the third aspect of the present application, the historical language records and the user-specified language are obtained by querying the voiceprint features of the input voice information.

As a possible implementation of the third aspect of the present application, the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the processing module is further configured to, according to the multiple second confidence levels, update the multiple Default weight.

As a possible implementation of the third aspect of the present application, the processing module is specifically configured to update the multiple Default weight.

As a possible implementation manner of the third aspect of the present application, the processing module is further configured to determine the semantics of the input voice information according to the input voice information and the language of the input voice information.

Multiple languages can be preset.

As a possible implementation of the third aspect of the present application, the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the processing module is also configured to, before acquiring the user's input voice information, Features set multiple preset weights.

Scene features may include environmental features and/or audio picker features. The environmental characteristics may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.

As a possible implementation of the third aspect of the present application, the processing module is specifically configured to acquire the pre-collected first voice data and the pre-recorded first language information of the first voice data, and determine the For the second voice data, the second language information of the second voice data is determined according to the second voice data, and a plurality of preset weights are set according to the first language information and the second language information.

As a possible implementation of the third aspect of the present application, the processing module is specifically configured to obtain multiple test weight groups, any one of the multiple test weight groups includes multiple test weights, and according to the second voice data and the multiple test weights, The weight group determines a plurality of second language information, and the plurality of second language information corresponds to the plurality of test weight groups respectively, and determines multiple accuracy rates of the plurality of second language information according to the first language information and the plurality of second language information, A plurality of preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.

As a possible implementation manner of the third aspect of the present application, the processing module is specifically configured to set multiple preset weights within a weight range.

As a possible implementation of the third aspect of the present application, the weight range is determined as follows:

The speech processing device of the third aspect can obtain the same technical effect as that of the speech processing method of the first aspect, and the description will not be repeated here.

The fourth aspect of the present application provides a voice processing device, including a processing module and a transceiver module, the transceiver module is used to obtain the input voice information of the user; Confidence, the plurality of third confidence degrees correspond to multiple languages, and the processing module is also used to modify the plurality of third confidence degrees into a plurality of fourth confidence degrees according to the scene characteristics, and determine the input according to the plurality of fourth confidence degrees The language of the voice message.

Scene features may include environmental features and/or audio picker features.

The environmental characteristics may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.

As a possible implementation manner of the fourth aspect, the processing module is specifically configured to set multiple preset weights according to scene characteristics, and correct multiple third confidence levels into multiple fourth confidence levels according to the multiple preset weights.

As a possible implementation of the fourth aspect, the processing module is specifically configured to acquire the pre-collected first voice data and the pre-recorded first language information of the first voice data, and determine the second language information according to the first voice data and scene characteristics. For voice data, the second language information of the second voice data is determined according to the second voice data, and a plurality of preset weights are set according to the first language information and the second language information.

As a possible implementation of the fourth aspect, the processing module is specifically configured to obtain multiple test weight groups, the test weight groups include multiple test weights, and determine multiple second languages according to the second voice data and the multiple test weight groups Information, a plurality of second language information corresponds to a plurality of test weight groups respectively, according to the first language information and the plurality of second language information to determine the multiple accuracy rates of the plurality of second language information, according to the second language with the highest accuracy The test weight group corresponding to the information sets a plurality of preset weights.

By adopting the speech processing device of the fourth aspect, the same technical effect as that of the speech processing method of the second aspect can be obtained, and the description will not be repeated here.

A fifth aspect of the present application provides a computing device, which includes a processor and a memory, the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the processor performs any one of the functions described in the first aspect or the second aspect. method.

The sixth aspect of the present application provides a computer-readable storage medium, which stores computer program instructions. When executed by a computer, the computer program instructions cause the computer to execute any method described in the first aspect or the second aspect.

A seventh aspect of the present application provides a computer program product, which includes computer program instructions. When executed by a computer, the computer program instructions cause the computer to execute any method described in the first aspect or the second aspect.

The eighth aspect of the present application provides a system, which includes the speech processing device provided in any aspect from the third aspect to the fourth aspect or any possible implementation manner.

Description of drawings

FIG. 1 is a schematic illustration of an application scenario example of a speech processing solution provided by an embodiment of the present application;

FIG. 2 is a schematic illustration of a speech processing system applied to the speech processing solution provided by the embodiment of the present application;

Fig. 3 is the flowchart of the voice processing method provided by one embodiment of the present application;

FIG. 4 is a flowchart of a speech processing method provided by an embodiment of the present application;

FIG. 5 is a schematic structural illustration of a speech processing device provided by an embodiment of the present application;

FIG. 6 is a flowchart for schematically illustrating a language recognition method provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a language recognition device provided by an embodiment of the present application;

FIG. 8 is a flowchart for schematically illustrating a voice interaction method provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a voice interaction system provided by an embodiment of the present application;

Figure 10 is a schematic illustration of a method for setting the weight range;

FIG. 11 is a schematic illustration of a voice interaction system involved in an embodiment of the present application;

FIG. 12 is a flow chart illustrating one of the processes of the voice interaction method involved in an embodiment;

FIG. 13 is a schematic illustration of a method for initializing a preset weight set provided in an embodiment of the present application;

FIG. 14 is a schematic illustration of a part of the flow of a voice interaction process provided in an embodiment of the present application;

FIG. 15 is a schematic illustration of a confidence correction method provided in an embodiment of the present application;

FIG. 16 is a schematic illustration of another confidence correction method provided in an embodiment of the present application;

Fig. 17 is a schematic illustration of an electronic control unit provided in an embodiment of the present application.

It should be understood that in the above structural schematic diagrams, the dimensions and shapes of each block diagram are for reference only, and should not constitute an exclusive interpretation of the embodiments of the present invention. The relative positions and containment relationships among the block diagrams presented in the structural diagram are only schematic representations of the structural relationships among the block diagrams, rather than limiting the physical connection manners of the embodiments of the present invention.

detailed description

The technical solutions provided by the present application will be further described below in conjunction with the accompanying drawings and examples. It should be understood that the system structure and business scenarios provided in the embodiments of the present application are mainly for illustrating possible implementations of the technical solution of the present application, and should not be interpreted as the only limitation on the technical solution of the present application. Those skilled in the art know that, with the evolution of the system structure and the emergence of new business scenarios, the technical solutions provided in this application are also applicable to similar technical problems.

It should be understood that the voice processing solution provided in the embodiment of the present application includes a voice processing method, device, and system. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have mutual references and can be combined with each other.

First, an example application scenario of the speech processing solution provided by the embodiment of the present application is described with reference to FIG. 1 . What is illustrated in FIG. 1 is a scene applied to a vehicle. Specifically, as shown in FIG. Microphone array) receives the voice commands of the driver 300 and other passengers, executes corresponding controls according to the voice commands (such as playing music, opening the windows, turning on the air conditioner, navigating, etc.), and at the same time, responding to the voice commands (Feedback), for example, sending display information through the central control display 210 or sending out voice information through a speaker (not shown) on the central control display 210 .

For example, since the vehicle 200 is taken by different occupants, they may issue voice commands in different languages, or even the same occupant may issue voice commands in different languages. function. However, limited by the language recognition capability, sometimes the car-machine system may get a wrong language recognition result, resulting in failure to recognize or wrongly recognize the semantics of the voice command, and thus fail to respond correctly.

Specifically, as a language recognition solution, there is a technology that uses machine learning models for classification and recognition. However, machine learning models may learn some task-independent information, such as environmental signal-to-noise ratio, audio collectors (sound sensors, Microphone) features, etc., which will lead to errors in the prediction results of the machine learning model when the information changes in actual applications.

For example, in the scenario shown in FIG. 1 , the vehicle 200 is a convertible car, and the ambient noise is relatively large (for example, a medium-noise environment), so for the English voice command "please play music" issued by the driver 300, the The computer system may get the wrong language recognition result, so it cannot correctly recognize the voice command and thus cannot make a correct response. In addition, if the type of microphone array corresponding to the training sample data of the machine learning model is different from the microphone 212 of the vehicle-machine system, it may also cause the vehicle-machine system to generate wrong language recognition results and fail to correctly recognize the driver's voice commands.

To this end, the embodiments of the present application provide a voice processing method, device, system, etc., which can improve the voice recognition capability of a multilingual voice processing solution.

A system architecture applied to the voice processing method, device, and system provided in the embodiments of the present application is described below. FIG. 2 is a schematic diagram illustrating the architecture of a speech processing system to which the speech processing solution provided by the embodiment of the present application is applied. As shown in FIG. 2 , the voice processing system 180 includes a voice processing device 182 , a sound sensor (microphone) 184 , a speaker 186 , a display device 188 and the like.

The voice processing system 180 can be applied to smart vehicles as a car-machine system. In addition, it can also be applied to scenarios such as smart home, smart office, smart robot, smart voice question and answer, smart voice analysis, and real-time voice monitoring and analysis.

The sound sensor 184 is used to acquire the user's input voice, and the voice processing device 182 obtains the user's input voice information according to the sensor data of the sound sensor 184, processes the input voice information, and obtains the semantics of the input voice information. And, the voice processing device 182 performs corresponding control according to the semantics, for example, controlling the output of the speaker 186 or the display device 188 . In addition, in addition to the speaker 186 and the display device 188, the voice processing device 182 can also be connected with other devices and mechanisms. , air-conditioning system, etc., so as to be able to control the windows, air-conditioning system, etc.

The speech processing method provided by an embodiment of the present application will be described below with reference to FIG. 3 .

Fig. 3 is a flowchart of a speech processing method provided by an embodiment of the present application. The voice processing method may be executed by a vehicle, a vehicle-mounted device, a vehicle-mounted computer, or a vehicle-mounted computer, and may also be executed by components in the vehicle or the vehicle-mounted device, such as a chip or a processor. In addition, in addition to being applied to vehicles, the voice processing method can also be applied to other scenarios such as smart home or smart office. At this point, the speech processing method may be executed by related devices involved in these scenarios, such as a control device, a processor, and the like.

As shown in Figure 3, the speech processing method includes the following contents:

S1. Obtain voice information input by the user. The input voice information of the user may be obtained according to the sensor data collected by the sound sensor, the sensor data may be used directly, or the information obtained after processing the sensor data. In addition, the time length of the input voice information is not particularly limited, and may correspond to a paragraph or a sentence of the user. Furthermore, when performing speech processing, the content spoken by the user may be segmented to form a plurality of input speech information, and the processing of S2-S4 described later is respectively performed on the plurality of input speech information.

S2. Determine multiple first confidence levels corresponding to the input voice information according to the input voice information, where the multiple first confidence levels correspond to multiple languages. Here, multiple languages may be preset. In addition, the meaning of the confidence level of a language refers to the probability that the input voice information belongs to the language. For example, when the multiple first recognition confidence levels obtained are {Chinese: 0.6; English: 0.4; Korean: 0; German: 0; Japanese: 0}, it means that the probability that the language of the input voice information is Chinese is 0.6, which is English. The probability of being Korean, German, and Japanese is 0.4.

In addition, different languages may refer to different language families. For example, Chinese and English belong to different languages, or they may refer to different minor languages under the same language family. For example, Mandarin and Cantonese in Chinese also belong to different languages.

S3. Modify the multiple first confidence levels into multiple second confidence levels according to the user characteristics of the user. The user features here are, for example, historical language records or user-specified languages. The historical language record is the recognized language of the user's input voice information that was recognized and recorded before the current processing cycle. The recognized language here means the language of the input voice information determined by recognizing the input voice information. The user-specified language refers to the type of system language set by the user, for example, according to his or her frequently used language.

S4. Determine the language of the input voice information according to multiple second confidence levels.

Using the above speech processing method, the first confidence degree is modified according to the user characteristics, and the language type of the input voice information is determined according to the modified second confidence degree, so that the language type of the input voice information can be determined more accurately, and speech recognition can be improved. ability.

Regarding the specific correction method, for example, take the correction based on historical language records as an example: assuming that there are more records in Chinese in the historical language records, then the confidence of Chinese among the multiple first confidence levels obtained in this processing cycle degree increases, so as to obtain the second degree of confidence, for example, according to the historical language records, the above-mentioned multiple first confidence degrees {Chinese: 0.6; English: 0.4; Korean: 0; German: 0; Japanese: 0}, amended is {Chinese: 0.8; English: 0.2; Korean: 0; German: 0; Japanese: 0}.

Optionally, when the multiple first confidence levels are smaller than the first threshold, the multiple first confidence levels may be corrected to multiple second confidence levels according to user characteristics. When the plurality of first confidence levels are smaller than the first threshold, it is difficult to determine the language of the input voice information according to the first confidence levels. At this time, if multiple first confidence levels are modified into multiple second confidence levels according to user characteristics, the language of the input voice information can be determined according to the second confidence levels, which can improve language recognition accuracy and speech recognition ability.

Optionally, historical language records and user-specified languages can be obtained by querying the voiceprint features of the input voice information. In this way, historical language records and user-specified languages can be easily obtained.

Optionally, the multiple first confidence levels may be determined by multiple initial confidence levels and multiple preset weights. At this time, the multiple preset weights may be updated according to the multiple second confidence levels.

In this way, the preset weight is updated according to the processing result of the current processing cycle, so that the language recognition accuracy of the subsequent processing cycle can be improved.

As a specific updating method, updating may be performed when there is a second confidence level greater than the first threshold among the multiple second confidence levels.

When there is a second confidence degree greater than the first threshold among the plurality of second confidence degrees, the language recognition result obtained according to the plurality of second confidence degrees has higher credibility, and at this time according to the plurality of second confidence degrees Updating the preset weights can more reliably improve the language recognition accuracy in subsequent processing cycles.

Optionally, after the language of the input voice information is determined, the semantics of the input voice information may be determined according to the input voice information and the language of the input voice information.

By adopting the above method, since the language type of the input voice information can be determined more accurately, the semantic recognition accuracy of the input voice information can be improved.

Optionally, before acquiring the input voice information of the user, the above-mentioned multiple preset weights may be set according to scene characteristics.

Since the above-mentioned multiple preset weights are set according to scene characteristics before the user's input voice information is acquired, the language recognition accuracy can be improved.

The scene features here may include environment features and/or audio collector features, for example.

By setting multiple preset weights according to environmental features and/or audio collector features, language recognition accuracy can be improved.

The environmental characteristics here may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.

As a specific way of setting a plurality of preset weights, the following method can be adopted: obtain the pre-collected first voice data and the pre-recorded first language information of the first voice data; determine the second language according to the first voice data and scene characteristics; voice data; determining second language information of the second voice data according to the second voice data; setting a plurality of preset weights according to the first language information and the second language information.

Furthermore, the specific manner of determining the second language information of the second voice data according to the second voice data may be: acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple test weights; according to the second voice data and a plurality of test weight groups determine a plurality of second language information, and the plurality of second language information corresponds to a plurality of test weight groups; determine a plurality of second language information according to the first language information and the plurality of second language information multiple accuracy rates; multiple preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.

In addition, an adjustable range, that is, a weight range, can be set for multiple preset weights, and multiple preset weights can be set or updated within the weight range. If the preset weight exceeds the weight range, the recognition result will not be credible. Therefore, by setting an adjustable range, that is, a weight range, the accuracy of the language recognition result can be improved.

Here, the weight range can be determined in the following manner: a plurality of pre-collected test voice data groups and pre-recorded first language information of the plurality of test voice data groups are acquired, and any one of the plurality of test voice data groups includes a plurality of test voice data sets. voice data; acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple test weights; determining the weight range according to the multiple test voice data groups, the first voice information and the multiple test weight groups.

The speech processing method provided by another embodiment of the present application is described below with reference to FIG. 4 . Fig. 4 is a flowchart of a speech processing method provided by an embodiment of the present application. Similar to the above-mentioned embodiments, the voice processing method of this embodiment can be executed by the vehicle, vehicle-mounted device, vehicle machine, vehicle-mounted computer, etc., and can also be executed by components in the vehicle or vehicle-mounted device, such as chips or processors. In addition, part of the content in this embodiment is the same as that in the above embodiment, so the description of these content will not be repeated.

As shown in Figure 4, the speech processing method includes the following contents:

S6. Obtain voice information input by the user.

S7. Determine multiple third confidence levels corresponding to the input voice information according to the input voice information, where the multiple third confidence levels correspond to multiple languages.

S8. Correct the multiple third confidence levels into multiple fourth confidence levels according to scene features.

S9. Determine the language of the input voice information according to multiple fourth confidence levels.

Using the above speech processing method, the third confidence degree is modified according to the user characteristics, and the language type of the input voice information is determined according to the modified fourth confidence degree, so that the language type of the input voice information can be determined more accurately, and speech recognition can be improved. ability. Here, the third confidence degree may be obtained in the same manner as the first confidence degree, or may be different, and the fourth confidence degree may be obtained in the same manner as the second confidence degree, or may be different. The specific manner of performing correction according to scene characteristics may be the same as the specific manner of performing correction according to user characteristics in the foregoing embodiments, or may be different.

The correction processing in this embodiment and the correction processing described in the above embodiments can be used in combination, that is, the language confidence is corrected according to both user characteristics and scene characteristics, so that the language of the input voice information can be determined more accurately.

In this embodiment, optionally, a plurality of preset weights may be set according to scene characteristics; and the plurality of third confidence degrees are modified into a plurality of fourth confidence degrees according to the plurality of preset weights.

In addition, optionally, the method of setting multiple preset weights may specifically be: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; determining according to the first voice data and scene characteristics second voice data; determining second language information of the second voice data according to the second voice data; setting a plurality of preset weights according to the first language information and the second language information.

Optionally, as a specific implementation, multiple test weight groups are obtained, and the test weight groups include multiple test weights; multiple second language information is determined according to the second voice data and the multiple test weight groups, and the multiple second The language information corresponds to a plurality of test weight groups respectively; according to the first language information and the plurality of second language information, multiple accuracy rates of the plurality of second language information are determined; according to the second language with the highest accuracy rate The test weight group corresponding to the information sets a plurality of preset weights.

A voice processing device provided by an embodiment of the present application is described below with reference to FIG. 5 . FIG. 5 is an explanatory diagram of a schematic structure of a speech processing device provided by an embodiment of the present application. The voice processing device 190 is used to execute the voice processing method in the embodiment described with reference to FIG. 3 or the voice processing method in the embodiment described with reference to FIG. 4 , and its structure can be known from the above description, so it is only briefly described here. As shown in FIG. 5 , the voice processing device 190 includes a processing module 192 and a transceiver module 194 . The processing module 192 may be used to execute the content in S2-S4 or S7-S9 above, and the transceiver module 194 may be used to execute the content in S1 or S6 above. In addition, the speech processing device 190 may be composed of hardware, may also be composed of software, or may be composed of a combination of software and hardware. Using the speech processing apparatus 190 of this embodiment, the same technical effect as that of the speech processing method described above can be obtained, so repeated description of the technical effect is omitted here.

A language recognition method provided by an embodiment of the present application is described below with reference to FIG. 6 .

Fig. 6 is a flow chart for schematically illustrating a language recognition method provided by an embodiment of the present application. The language recognition method can be executed by a vehicle, a vehicle-mounted device, a vehicle machine, a vehicle-mounted computer, a chip, a processor, and the like. As shown in FIG. 6 , in the language recognition method of this embodiment, firstly, in step S10 , the input voice information of the user is acquired. For example, the user's input voice data received by the microphone is acquired as the input voice information, or the input voice data of the microphone is preprocessed to obtain the input voice information. In step S12 , the input speech is recognized to obtain a multilingual first recognition confidence set, and multiple first recognition confidence levels in the first recognition confidence set correspond to multiple languages respectively. For example, the multilingual first recognition confidence set is {Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0}. That is, the probability that the language of the input voice information is Chinese is 0.9, the probability that it is English is 0.1, and the probability that it is Korean, German, or Japanese is 0.

In step S14, it is judged whether there is a first recognition confidence value greater than a threshold in the multilingual first recognition confidence set. The threshold here can be set to 0.8, for example. When the judgment result is "Yes", that is, when there is a first recognition confidence level greater than the threshold (for example, the first recognition confidence level of Chinese is 0.9), the recognition result is generated and output according to the first recognition confidence level set. The recognition result here may be a result indicating the recognized language (for example, Chinese), or may be the first recognition confidence set itself. In addition, as another embodiment, this S14 may be omitted, and step S18 described later may be directly performed.

In addition, when the judgment result in step S14 is "No", that is, there is no first recognition confidence degree greater than the threshold, in step S18, the first recognition confidence degree set is corrected and calculated according to the user's user characteristics to obtain the first recognition confidence degree set. 2. Recognition confidence set.

Examples of user characteristics include the user's historical language records, user-specified language, and the like.

The historical language record refers to the record of the recognition language of the speech input by the user before the above-mentioned input speech is input. The user-specified language refers to the type of system language set by the user (such as the system language of the voice interaction system, the system language of the mobile phone operating system when applied to a mobile phone, etc.). In addition, the user may specify one or more languages (that is, the user has set multiple system languages). Historical language records can be obtained by querying the voiceprint of the input voice in the database, and can also be obtained by querying the user's face information, iris information, etc. That is, the user's identity can be determined based on voiceprint, face information, iris information, etc., and thus the user's historical language records can be obtained from the database. Querying historical language records and user-specified languages according to the voiceprint of the input voice, for example, compared with the way of querying based on face information, iris, etc., can avoid misidentifying users (speakers) (identifying non-speakers as speakers) cause language misidentification. In addition, the voiceprint can be obtained according to the input voice information, while the method of querying based on face information, iris, etc. also needs to obtain the user image. Therefore, the method of querying based on the voiceprint requires less equipment and faster processing.

In addition, it may need to be explained that, for example, the basis of obtaining the user-specified language through voiceprint is to collect the user's voiceprint when the user sets the system language, and combine the voiceprint (or user identity) with the user The set system language is associated and stored in the user-specified language database. The database mentioned here can be stored locally or on a credible platform.

After modifying the first recognition confidence set according to the user characteristics to obtain the second recognition confidence set, in step S20, a language recognition result is generated according to the second recognition confidence set. For example, the second recognition confidence set may be directly output as the recognition result, or, when there is a second recognition confidence greater than a threshold in the second recognition confidence set, output the second recognition confidence set or information indicating the recognized language, When there is no second recognition confidence greater than the threshold in the second recognition confidence set, the first recognition confidence set is output as the recognition result.

Using the above method, the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence. In this way, the language recognition ability can be improved.

Optionally, the language recognition method in this embodiment further includes: when multiple second recognition confidences in the second recognition confidence set are smaller than the threshold, according to the automatic speech recognition confidence obtained by performing automatic speech recognition on the input speech The language recognition result is generated using the confidence degree of natural language understanding (NLU) obtained by performing natural language understanding (NLU) on the input speech.

In the above manner, when it is difficult to recognize the language of the input speech according to the second recognition confidence, the language of the input speech is determined according to the automatic speech recognition confidence or the natural language understanding confidence, thereby improving the language recognition ability. As a specific implementation manner, for example, the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.

Optionally, in this embodiment, the first recognition confidence set can be obtained in the following manner: the input speech is recognized to obtain an initial confidence set; multiple initial confidence sets in the initial confidence set are combined with preset weights Multiple preset weights in the set are respectively multiplied to obtain a first recognition confidence set.

At this time, when there is a second recognition confidence greater than the threshold in the second recognition confidence set, the preset weight set can be updated, so that the preset weights of the recognition languages whose second recognition confidence is greater than the threshold among the multiple languages are relatively Preset weights increased for other languages.

Using the above technical means, when there is a second recognition confidence greater than the threshold in the second recognition confidence set, that is, when the language of the input speech can be determined, the preset weight set is updated as above, so that after the input When the speech is processed, the updated preset weight set is used, so that the accuracy of language recognition can be improved, and the language recognition ability can be improved.

The specific method of updating the preset weight set can be: performing correction calculation on the preset weight set to obtain the corrected weight set; Set the set of weights.

If the preset weight exceeds the weight range, the reliability of the language recognition result obtained according to the preset weight is relatively low. Therefore, using the above method, correcting the preset weight within the weight range can suppress the misrecognition rate of the language .

In addition, in this embodiment, a preset weight set may be preset according to scene characteristics. Therefore, it is possible to adapt to different scenarios, and obtain the language recognition result with the preset weights that are most suitable for the scenario, thereby improving the language recognition ability.

Scene features may include environmental features and/or audio picker features. The environmental characteristics may include environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information. The microphone arrangement information refers to whether it is a single microphone or a microphone array, or if it is a microphone array, whether it is a linear array, a planar array, or a stereo array.

The specific method of setting the preset weight set can be: obtaining multiple test weight sets; inputting the pseudo-environmental data set into the language recognition model, the pseudo-environmental data set is obtained according to the scene characteristics and the noise-free data set; according to the language recognition model The output initial confidence set is obtained under the condition of multiple test weight sets of multiple first recognition confidence sets; according to the language information of the pseudo-environmental data set, the prediction accuracy of multiple first recognition confidence sets is calculated; The test weight set corresponding to the first recognition confidence set with the highest prediction accuracy among the multiple test weight sets is determined as the optimal test weight set; the preset weight set is set with the values of multiple test weights in the optimal test weight set .

Optionally, when the set preset weight is within the weight range, the setting is enabled, and when the set preset weight is not within the weight range, the setting is canceled. Or, when the set preset weight is not within the weight range, the setting is still valid, but other methods are preferred to obtain the language recognition result, such as determining the recognition language of the input voice according to the language specified by the user, or, by setting this The second input voice is compared with the input voice in the historical language record to obtain the characteristic similarity. If the feature similarity is greater than the similarity threshold, the language of the input voice in the historical language record is determined as the recognition of the input voice this time. language.

The weight range can be set in the following way: obtain multiple test data sets; obtain multiple test weight sets; input the test data sets into the language recognition model; according to the initial confidence set output by the language recognition model and multiple test weight sets, get A plurality of first recognition confidence sets in the case of multiple test weight sets; according to the language information of the test data set, calculate the prediction accuracy of multiple first recognition confidence sets; predict the accuracy of the multiple test weight sets The test weight set corresponding to the highest first recognition confidence set is determined as the optimal test weight set; the optimal test weight set of multiple test data sets is obtained; multiple language types are obtained according to the optimal test weight set of multiple test data sets weight range.

Wherein, the test data set may be a pre-collected speech data set, whose language information is known.

Using the above method, a large number of language data sets are used to test the language recognition model to set the weight range of the multilingual preset weight set, that is, to specify the robustness range of the language recognition model, so that the language recognition model is here Working within the scope, thus ensuring the reliability of language recognition results.

FIG. 7 is a schematic structural diagram of a language recognition device provided by an embodiment of the present application. As shown in Figure 7, an embodiment of the present application provides a language recognition device, which is used to implement the language recognition method shown in Figure 6, and its structure can be compared to the language recognition method in Figure 6 from above It is known from the description, so here only a relatively brief description of the language recognition device 10 is given.

As shown in FIG. 7 , the language recognition device 10 includes: an input speech acquisition module 17, configured to acquire the user's input speech; a language recognition module 12, configured to recognize the input speech to obtain a first recognition confidence set, the first Multiple first recognition confidences in the recognition confidence set correspond to multiple languages respectively; the language confidence correction module 16 is used to correct and calculate the first recognition confidence set according to the user characteristics of the user to obtain the second recognition confidence degree set; the recognition result generating module 18, configured to generate a language recognition result according to the second recognition confidence degree set.

Using the above device, the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence. In this way, the language recognition ability can be improved.

Optionally, the language confidence correction module 16 may correct and calculate the first set of recognition confidences according to user characteristics to obtain the second set of recognition confidences when the multiple first recognition confidences are less than the threshold.

Optionally, the user characteristics include historical language records.

In this way, when it is difficult to recognize the language of the input voice based on the first recognition confidence, the first recognition confidence is corrected according to the user's historical language records, and the language of the input voice is determined on this basis, thereby improving the language recognition ability . The historical language record of the user here refers to the record of the language to which the voice input by the user belongs before inputting the above-mentioned input voice.

Optionally, the historical language record is obtained by querying the voiceprint of the input voice.

In the above way, according to the voiceprint of the input voice to query the historical language record, for example, compared with the way of querying based on face information, iris, etc., it can avoid misidentifying the user (speaker) (identifying the non-speaker as speakers) cause language misidentification.

Optionally, the user characteristics include a user-specified language.

In the above manner, when it is difficult to recognize the language of the input speech based on the first recognition confidence, the first recognition confidence is modified according to the language specified by the user, and the language of the input speech is determined on this basis, thereby improving the language recognition ability.

Optionally, the language specified by the user is obtained through querying the voiceprint of the input voice.

In the above manner, the language specified by the user can be queried according to the voiceprint of the input voice, for example, compared with the way of inquiring based on face information, iris, etc., it can avoid misidentifying the user (speaker) (identifying the non-speaker as speakers) cause language misidentification.

Optionally, the recognition result generation module is also used for: when a plurality of second recognition confidences in the second recognition confidence set are less than a threshold, according to the automatic speech recognition (Automatic Speech Recognition, ASR) obtained by performing automatic speech recognition (ASR) on the input speech, Speech recognition confidence generates language recognition results.

With the above method, when it is difficult to recognize the language of the input speech according to the second recognition confidence, the language of the input speech is determined according to the automatic speech recognition confidence, thereby improving the language recognition ability. As a specific implementation manner, for example, the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.

Optionally, the recognition result generation module is also used for: when multiple second recognition confidences in the second recognition confidence set are less than the threshold, generate language recognition according to the natural language understanding confidence obtained by performing natural language understanding on the input speech result.

Optionally, the language identification module is also used to: recognize the input speech to obtain an initial confidence set; multiply multiple initial confidences in the initial confidence set by multiple preset weights in the preset weight set , to obtain the first recognition confidence set; the language confidence correction module is also used to update the preset weight set when there is a second recognition confidence greater than the threshold in the second recognition confidence set, so that the first Second, the preset weights of recognized languages whose recognition confidence is greater than the threshold are increased relative to the preset weights of other languages.

In the above manner, when there is a second recognition confidence greater than the threshold in the second recognition confidence set, that is, when the language of the input speech can be determined, the preset weight set is updated in the above manner, so that the subsequent input speech When processing, the updated preset weight set is used, so that the accuracy of language recognition can be improved, and the ability of language recognition can be improved.

Optionally, the language confidence correction module is also used to: perform correction calculation on the preset weight set to obtain a correction weight set; when multiple correction weights in the correction weight set are within the weight range, use the values of multiple correction weights Update preset weight collection.

If the preset weight exceeds the weight range, the reliability of the language recognition result obtained according to the preset weight is relatively low. Therefore, using the above method, correcting the preset weight within the weight range can suppress the misrecognition rate of the language.

Optionally, the language identification module is also used to: recognize the input speech to obtain an initial confidence set; multiply multiple initial confidences in the initial confidence set by multiple preset weights in the preset weight set , to obtain the first recognition confidence set; the language confidence correction module is also used to set a preset weight set according to scene features.

In the above way, the preset weight set is set according to the scene characteristics, so that different scenes can be adapted, and the language recognition result can be obtained with the preset weights that are most suitable for the scene, thereby improving the language recognition ability.

Optionally, the scene features include environment features and/or audio collector features.

Optionally, the environmental characteristics include environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics include microphone arrangement information. The microphone arrangement information refers to whether it is a single microphone or a microphone array, or if it is a microphone array, whether it is a linear array, a planar array, or a stereo array.

Optionally, the language confidence correction module is also used to: obtain multiple sets of test weights; input the quasi-environment data set into the language recognition model, the quasi-environment data set is obtained according to the scene characteristics and the noise-free data set; according to the language recognition model The output initial confidence set is obtained under the condition of multiple test weight sets of multiple first recognition confidence sets; according to the language information of the pseudo-environmental data set, the prediction accuracy of multiple first recognition confidence sets is calculated; The test weight set corresponding to the first recognition confidence set with the highest prediction accuracy among the multiple test weight sets is determined as the optimal test weight set; the preset weight set is set with the values of multiple test weights in the optimal test weight set . Therefore, it can be said that the language confidence correction module has a preset weight setting module.

Optionally, the language confidence correction module is further configured to set multiple preset weights within the weight range.

Optionally, the weight range is set as follows: obtain multiple test data sets; obtain multiple test weight sets; input the test data sets into the language recognition model; output the initial confidence set according to the language recognition model and multiple Test the weight set to obtain multiple first recognition confidence sets in the case of multiple test weight sets; calculate the prediction accuracy of multiple first recognition confidence sets according to the language information of the test data set; combine multiple test weights The test weight set corresponding to the first recognition confidence set with the highest prediction accuracy in the set is determined as the optimal test weight set; the optimal test weight set of multiple test data sets is obtained; according to the optimal test weight set of multiple test data sets Set to get the weight range of multiple languages. The function of setting the weight range can be realized by the language recognition device 10 . In this case, it can be said that the language recognition device 10 has a weight range setting module, or it can be realized by a test device for testing the language recognition device 10 .

An embodiment of the present application provides a computing device, which includes a processor and a memory, the memory stores program instructions, and when the program instructions are executed by the processor, the processor executes the speech processing method and the language recognition method. The computing device can be understood more from the following description in conjunction with FIG. 17 .

An embodiment of the present application provides a computer-readable storage medium, which stores program instructions, and is characterized in that, when the program instructions are executed by a computer, the computer executes the above speech processing method and language recognition method.

An embodiment of the present application provides a computer program. When the computer program is executed by a computer, the computer executes the above speech processing method and language recognition method.

FIG. 8 is a flowchart schematically illustrating a voice interaction method provided by an embodiment of the present application. Part of the steps in the speech interaction method are the same as the above-mentioned language recognition method, and here, the same content is marked with the same reference numerals, and the description thereof is simplified.

As shown in FIG. 8, in the voice interaction method, firstly, in step S10, input voice information of the user is acquired. For example, the input voice information of the user received by the microphone is obtained. At this time, on the one hand, in step S40, automatic speech recognition is performed on the input speech using a speech recognition model; on the other hand, in step S12, language recognition is performed on the input speech using a language recognition model. As other embodiments, automatic speech recognition and language recognition may also be performed sequentially.

In addition, in step S40, in order to be able to recognize the input speech of multiple languages, the speech recognition model of a plurality of different languages (in this embodiment, five languages of Chinese, English, Korean, German and Japanese) is used to perform this input speech Speech content recognition processing obtains multiple texts Ti in different languages.

After that, in step S42, a plurality of texts Ti are input into the text translation model, and the text translation model performs translation processing on these texts Ti, and converts these texts Ti into text Ai of the target language (such as Chinese).

After that, in step S44, multiple texts Ai are sequentially input into the semantic understanding model, and the semantic understanding model performs semantic understanding processing on these texts Ai, so as to obtain multiple corresponding candidate commands Oi. Candidate means an order that has not yet been confirmed for execution.

In addition, in step S12 , the input speech is recognized to obtain a multilingual first recognition confidence set, and multiple first recognition confidence levels in the first recognition confidence set correspond to multiple languages respectively. For example, the multilingual first recognition confidence set is {Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0}.

In step S14, it is judged whether there is a first recognition confidence value greater than a threshold in the multilingual first recognition confidence set. The threshold here can be set to 0.8, for example. When the judgment result is "Yes", that is, when there is a first recognition confidence greater than the threshold (for example, the first recognition confidence of Chinese is 0.9), in step S16, as the recognition result, the first recognition confidence greater than the threshold The language (such as Chinese) is determined as the recognition language of the input speech.

Afterwards, in step S26, select the candidate command corresponding to the recognition language (such as Chinese) as the target command to be executed according to the recognition language (such as Chinese) from a plurality of candidate commands Oi obtained in step S44, and then make the The process by which the target command is executed. For example, when the target command is "turn on the air conditioner", corresponding control is executed to turn on the air conditioner.

In addition, when the judgment result in step S14 is "No", that is, there is no first recognition confidence greater than the threshold, or when multiple first recognition confidences are smaller than the threshold, in step S18, according to the user characteristics, the first Recognition confidence is corrected. The specific content of the amendment has been described in detail above, so it will not be described here.

Afterwards, in step S22, it is judged whether there is a second recognition confidence greater than the threshold, and when the judgment result is "yes", in step S24, the language whose second recognition confidence is greater than the threshold is determined as the recognition language of the input speech , thereafter, the processing in step S26 is performed.

In addition, when the judgment result in step S22 is "Yes", that is, there is no second recognition confidence greater than the threshold, or when multiple second recognition confidences are smaller than the threshold, in step S28, it is judged whether there is a second recognition confidence greater than the threshold. ASR confidence. When the judgment result is "Yes", in step S30, the language whose ASR confidence is greater than the threshold is determined as the recognized language, and then the processing in step S26 is performed.

When the judgment result in step S28 is "No", that is, there is no ASR confidence greater than the threshold, or when multiple ASR confidences are less than the threshold, in step S32, it is judged whether there is an NLU confidence greater than the threshold. When the judgment result is "Yes", in step S34, the language type whose NLU confidence degree is greater than the threshold is determined as the recognized language type, and then the processing in step S26 is executed.

When the judgment result in step S32 is "No", information indicating that the voice content recognition fails may be output, and the processing ends.

Using the voice interaction method above, in the language recognition, the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence, so that the language recognition ability can be improved , thereby improving the ability of voice interaction.

FIG. 9 is a schematic structural diagram of a voice interaction system provided by an embodiment of the present application. As shown in Figure 9, the voice interaction system (or voice interaction device) 20 has a voice recognition module 110, a language recognition module 12, a text translation module 130, a semantic understanding module 140, an input voice acquisition module 17, and a language confidence correction module 16 and control module 170 . The voice interaction device is used to execute the voice interaction method described with reference to FIG. 8 , therefore, the description of the specific processing flow is omitted here. In addition, the voice interaction system 20 is the same as the above-mentioned language recognition device 10, and has a language recognition module 12, an input speech acquisition module 17 and a language confidence correction module 16, which are marked with the same reference numerals and omitted. Detailed description. In addition, the voice interaction system may further include an execution device, such as a loudspeaker, a display device, and the like.

The corresponding relationship between the speech interaction system 20 and the steps of the above-mentioned speech interaction method will be briefly described below.

The speech recognition module 110 executes step S40 in FIG. 8 . The language identification module 12 executes step S12 in FIG. 8 . The text translation module 130 executes step S42 in FIG. 8 . The semantic understanding module 140 executes step S44 in FIG. 8 . The input speech acquisition module 17 executes step S10 in FIG. 8 . The language confidence correction module 16 executes step S18 in FIG. 8 . The control module 170 executes step S14 , step S16 , step S22 , step S24 , step S28 , step S30 , step S32 , and step S34 in FIG. 8 . In addition, Step S14 , Step S16 , Step S22 , and Step S24 may also be executed by the language confidence correction module 16 .

In addition, it can be seen from the above description that the speech interaction method described with reference to FIG. 8 essentially includes a multilingual speech recognition method, which can recognize input speech in multiple languages; A speech recognition device for implementing the multilingual speech recognition method is included. Because there are many repetitive contents, no separate embodiments will be given to describe the speech recognition method and speech recognition device here.

The voice interaction system 100 and the voice interaction method executed by it according to an embodiment of the present application will be described below with reference to FIGS. 11-17 .

In this embodiment, the voice interaction system 100 is applied in a car to form a vehicle voice interaction system as an example. Voice question and answer, intelligent voice analysis, real-time voice monitoring and analysis, etc. In addition, the vehicle voice interaction system also constitutes a vehicle control device. In addition, it can be understood that, through the voice interaction method described above and the voice interaction system described in this embodiment, embodiments of the present application provide a voice processing method, device, and system.

The voice interaction system 100 of this embodiment can receive the input voice of the user (that is, the speaker), and perform corresponding processing in response to the content of the input voice, such as turning on the air conditioner, opening the car window, and other processing. Moreover, the voice interaction system 100 can respond to voices in multiple different languages. For example, in this embodiment, it can respond to voices in five languages: Chinese, English, Korean, German, and Japanese.

The so-called sounds of different languages here include not only the sounds of different language families, for example, Chinese and English belong to different languages, but also the sounds of different minor languages under the same language family, for example, Mandarin and Cantonese of Chinese also belong to different languages.

FIG. 11 is a schematic illustration of a voice interaction system according to an embodiment of the present application. The voice interaction system 100 has a voice recognition module 110, a language recognition module 120, a text translation module 130, a semantic understanding module 140, a command analysis and execution module 150, and a language confidence correction module 160. In addition, the voice interaction system 100 can also have a microphone , speaker, camera or display etc.

Fig. 12 is a flowchart for illustrating one procedure of the voice interaction method involved in an embodiment. A processing flow of the voice interaction system 100 is described below with reference to FIG. 12 , so as to roughly illustrate the architecture of the voice interaction system 100 .

As shown in Figure 12, when the user sends out a section of voice S, the voice interaction system 100 acquires the voice (called input voice) through the microphone, on the one hand, 1. the input voice is input into the voice recognition module 110, Call the speech recognition submodule of a plurality of different languages (in the present embodiment, be Chinese, English, Korean, German and Japanese five languages, self-evident, can also be other quantity languages) this input speech carries out speech content recognition After processing, multiple texts Ti in different languages are obtained. ② Multiple texts Ti are input into the text translation module 130, and the text translation module 130 performs translation processing on these texts Ti, and converts these texts Ti into text Ai of the target language (eg Chinese). ③ Input multiple texts Ai into the semantic understanding module 140 sequentially, and the semantic understanding module 140 performs semantic understanding processing on these texts Ai, so as to obtain multiple corresponding candidate commands Oi.

On the other hand, ④ the user's input voice is also input into the language recognition module 120, and the language recognition module 120 performs language recognition processing on the input voice, generates initial confidence degrees of multiple languages, and multiplies each initial confidence degree by the corresponding preset Weights to get the recognition confidence of multiple languages.

⑤ When there is a recognition confidence degree greater than the threshold λ in the recognition confidence degrees of multiple languages, it can be considered that the language of the input voice is a language whose recognition confidence degree is greater than the threshold λ, and the command analysis and execution module 150 will recognize multiple candidate commands Oi The candidate command Oi corresponding to the language whose confidence is greater than the threshold λ is determined as the target command to be executed, and corresponding processing is performed according to the content of the target command.

In addition, when there is no recognition confidence greater than the threshold λ among the recognition confidences of multiple languages, the language confidence correction module 160 corrects the recognition confidences of multiple languages according to user characteristics, and the specific content will be described later. A detailed description.

The structure of each component of the voice interaction system 100 will be described below.

In this embodiment, the speech recognition module 110, the language recognition module 120, the text translation module 130 and the semantic understanding module 140 respectively include an algorithm model, namely a speech recognition model, a language recognition model, a text translation model and a semantic understanding model. Perform speech recognition processing, language recognition processing, text translation processing and semantic understanding processing respectively.

The speech recognition module 110 is used to convert human speech, that is, speech to be recognized, into text in a corresponding language, which can also be said to predict speech content or perform automatic speech recognition (Automatic Speech Recognition, ASR). Here, the speech recognition module 110 has a plurality of speech recognition sub-modules, and each speech recognition sub-module corresponds to a language respectively, and is used to convert the speech into the text Ti of the corresponding language. For example, in this embodiment, there are Chinese, English The speech recognition sub-modules of 5 languages, Korean, German and Japanese, are used to convert the input speech into Chinese text T1, English text T2, Korean text T3, German text T4 and Japanese text T5. After the recognition is completed, these sub-modules output the text Ti as the prediction result and the confidence of the text Ti. The confidence is called the ASR confidence. The ASR confidence represents the prediction probability of the text predicted by the sub-module, or the Predicted probability of speech content.

The text translation module 130 is used to convert text in one natural language (source language) into text in another natural language (target language), for example, convert English text into Chinese text. Here, the text translation module 130 has a plurality of text translation sub-modules, and each text translation sub-module corresponds to a language respectively. The text translation sub-modules of two languages are respectively used to translate English text, Korean text, German text and Japanese text into Chinese text Ai. In addition, when the Chinese text is input into the text translation module 130, since Chinese is the target language for translation, the text translation module 130 may not process the input Chinese text and output the input Chinese text as it is. The final text translation module 130 outputs 5 Chinese texts Ai.

The semantic understanding module 140 is used for performing natural language understanding (Natural Language Understanding, NLU) on the text of the target language, which can also be said to predict the intent of the text and generate commands that can be understood by the machine. For example, if the text is "Please play the song "XX", after the semantic understanding module 140, the machine can get the intention of "Please play the song "XX". While generating the command, the semantic understanding module 140 also generates an NLU confidence, which represents the predicted probability of the meaning of the text by the semantic understanding module 140 . In addition, since the speech recognition module 110 will output 5 language texts, the semantic understanding module 140 will eventually generate commands corresponding to 5 languages and 5 NLU confidence levels. Furthermore, these commands output by the semantic understanding module 140 have not been determined to be executed, so they are called candidate commands.

The language identification (Language Identification, LID) module is used to identify the language of the user's input voice, that is, the voice to be recognized. It can also be said to predict which one of the multiple languages the user's input voice belongs to. For example, in this embodiment, the language The recognition module 120 recognizes which language type the input speech belongs to among Chinese, English, Korean, German, and Japanese, and outputs a set of recognition confidence levels of multiple languages as the recognition result. Predicted probability for a language. In addition, in the language recognition model of this embodiment, algorithmic recognition is performed on the input speech to obtain the confidence levels of multiple languages (this confidence level is called the initial confidence level), and the initial confidence levels of multiple languages are respectively multiplied by The corresponding preset weight values obtain recognition confidences of multiple languages, and the language recognition module 120 outputs the recognition confidences as prediction results. The calculation of multiplying the initial confidence by the preset weight value may or may not be performed by the language recognition model.

The command parsing and execution module 150 is used for selecting a target command to be executed from the candidate commands output by the semantic understanding module 140 according to the output of the language identification module 120 . In this embodiment, when there is a recognition confidence degree greater than the threshold λ (for example, set above 0.8) among the recognition confidence degrees of multiple languages output by the language recognition module 120, the command analysis and execution module 150 sets the recognition confidence degree greater than The language of the threshold λ is determined as the language of the user's input voice, and the candidate command corresponding to the language is determined as the target command to be executed. For example, when the confidence levels of the multiple languages output by the language identification module 120 are {Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0}, Chinese is determined as the language of the user's input voice, Among the candidate commands in five languages output by the semantic understanding module 140, the candidate command corresponding to Chinese is determined as the target command to be executed.

After determining the target command to be executed, the command parsing and execution module 150 executes the control for enabling the target command to be executed, for example, when the determined target command to be executed is "please play "XX" song", and the voice When the interactive system 100 has a music player module, the command analysis and execution module 150 controls the music player module to play the song "XX". And if the music player module does not belong to the voice interaction system 100 in terms of ownership, and is not controlled by the command analysis and execution module 150, at this time, the determined target command to be executed can be sent to the voice interaction system 100 and the music player module. The upper controller sends commands to the controller of the music playing module by the upper controller to realize playing the song "XX".

In addition, after determining the target command to be executed, the command analysis and execution module 150 can respond to the user through a speaker or display, for example, when the determined target command to be executed is "please play the song "XX", the command analysis And the execution module 150 controls the speaker to emit the sound of "OK, I will play it for you soon" to respond to the user.

On the other hand, in this embodiment, when there is no recognition confidence greater than the threshold λ (for example, 0.8) in the recognition confidence sets of multiple languages output by the language recognition module 120, the command analysis and execution module 150 uses other methods to Determine the target command to be executed, and these methods are illustrated below.

method one

The command analysis and execution module 150 corrects and calculates the recognition confidence of the above-mentioned multiple languages (corresponding to the "first recognition confidence" in this application), and the command analysis and execution module 150 performs corresponding calculations according to the output of the language confidence correction module 160. deal with. For example, it can be modified according to user characteristics, where the user characteristics include the user's historical language records and user-specified languages. Historical language records and user-specified languages can be determined according to audio features (ie, voiceprints) to determine the user identity, and can be obtained by querying the historical language record database and user-specified language database of the voice interaction system 100 according to the user identity. The specific content of these amendments will be described in detail later. After obtaining the revised recognition confidence, the command analysis and execution module 150 determines the language of the input speech according to the revised recognition confidence, and performs corresponding processing. For example, when there is a recognition confidence greater than the threshold λ among the corrected recognition confidences of multiple languages, the language whose recognition confidence is greater than the threshold λ is determined as the language of the user input voice, and the candidate command corresponding to the determined language is Determined as the target command to execute.

As a specific implementation method of correcting and calculating the recognition confidence in method 1, the value of the recognition confidence can be directly corrected, or the preset weight can be corrected and calculated, and then according to the initial confidence set and the corrected preset The set of weights again calculates the recognition confidence.

In this embodiment, the language confidence correction module 160 has an audio feature-based adjustment module 162 , a video feature-based adjustment module 163 and a comprehensive adjustment module 164 , and these adjustment modules are used to correct the language confidence in different ways.

way two

The command parsing and execution module 150 determines the target command to be executed according to the ASR confidence level output by the speech recognition module 110 or the NLU confidence level output by the semantic understanding module 140 . For example, when there is an ASR confidence greater than the ASR confidence threshold (which can be set to the same value as the above-mentioned threshold λ, such as 0.8), the language corresponding to the ASR confidence is determined as the language of the input speech, and the language corresponding to the language The candidate command of is determined as the target command to be executed. Or, when there is an NLU confidence greater than the NLU confidence threshold (which can be set to the same value as the above-mentioned threshold λ, such as 0.8), the language corresponding to the NLU confidence is determined as the language of the input speech, and the language corresponding to the language The candidate command of is determined as the target command to be executed.

The execution timing of mode 2 can be set freely. Optionally, for example, when the language of the input speech cannot be determined after the execution of mode 1 (that is, there is still no recognition confidence greater than the threshold λ in the recognition confidence modified according to user characteristics, etc.). degree), it can also be executed before executing method 1, or it can also be executed between the multiple methods listed in the description of method 1.

way three

The command parsing and execution module 150 determines the language of the input speech by feature similarity. For example, compare the current input voice with the audio data of the historical input voice in the historical record, and obtain the feature similarity between the two through cosine similarity, linear regression or deep learning. When the feature similarity exceeds the threshold , the recognition language of the historical input voice can be determined as the language of the current input voice.

The execution timing of Mode 3 can be set freely. Optionally, it can be executed after or before Mode 1 and Mode 2, or it can be executed between Mode 1 and Mode 2, or it can also be exemplified in the description of Mode 1 Execute in multiple ways.

The confidence correction module will be described below.

The language confidence correction module 160 includes a real-time scene adaptation module 161 , an audio feature-based adjustment module 162 , a video feature-based adjustment module 163 and a comprehensive adjustment module 164 .

The real-time scene adaptation module 161 is used to initialize the multilingual preset weight set according to the environmental characteristics and the characteristics of the audio collector (ie, the microphone) when the language recognition model initially contacts the scene. The initial contact scene here is, for example, when the user has just purchased a voice interaction system or a vehicle. At this time, the user generally turns on the voice interaction system to perform some basic settings or tests. In this embodiment, the real-time scene adaptation module 161 can use this opportunity to initialize the preset weight set. In addition, as another embodiment, the initialization of the preset weight set is not limited to be performed when initially contacting the scene, and can also be performed at other appropriate times, such as when replacing a new audio collector, or the user chooses the execution time.

The video feature-based adjustment module 163 is configured to modify the recognition confidence sets of multiple languages according to the captured user images.

The audio feature-based adjustment module 162 is configured to modify the recognition confidence sets of multiple languages according to the user's voice information. Specifically, the user's historical language records can be obtained by querying the database of the voice interaction system 100 according to the voice information (voiceprint), and the recognition confidence sets of multiple languages can be corrected according to the historical language records.

The comprehensive adjustment module 164 is mainly used to modify the recognition confidence sets of multiple languages according to the language specified by the user. Wherein, the user-specified language is obtained by querying the database of the voice interaction system 100 according to the voiceprint of the input voice.

When there is a recognition confidence greater than the threshold λ among the corrected recognition confidences of multiple languages, the confidence correction module performs correction calculations on the preset weight sets of multiple languages, so that the preset weights of the languages whose recognition confidence is greater than the threshold λ The weights are increased relative to the preset weights of other languages, and in this way, a set of modified weights is obtained. Afterwards, the confidence degree correction module judges whether each correction weight in the correction weight set is within the weight range, and when the judgment result is within the weight range, the value in the correction confidence weight set is used to update the preset weight set for language recognition Module 120 is used for subsequent language recognition.

The functions of the speech recognition module 110, the language recognition module 120, the text translation module 130, the semantic understanding module 140, the command analysis and execution module 150 and the confidence correction module can be implemented by the processor executing the program (software) stored in the memory , It can also be realized by hardware such as LSI (Large Scale Integration, large scale integrated circuit) and ASIC (Application Specific Integrated Circuit, application specific integrated circuit).

Typically, these modules can be formed by an electronic control unit (ECU). Optionally, one module can be formed by one ECU, or multiple ECUs, or one ECU can be used to form multiple modules.

ECU refers to a control device composed of integrated circuits used to implement a series of functions such as data analysis, processing and transmission. As shown in Figure 17, the embodiment of the present application provides an electronic control unit ECU, the ECU includes a microcomputer (microcomputer), an input circuit, an output circuit and an analog-to-digital (analog-to-digital, A/D) converter .

The main function of the input circuit is to preprocess the input signal (such as the signal from the sensor), and the processing method is different for different input signals. Specifically, since there are two types of input signals: analog signals and digital signals, the input circuit may include an input circuit that processes analog signals and an input circuit that processes digital signals.

The main function of the A/D converter is to convert the analog signal into a digital signal. After the analog signal is preprocessed by the corresponding input circuit, it is input to the A/D converter for processing and converted into a digital signal accepted by the microcomputer.

The output circuit is a device that establishes a connection between the microcomputer and the actuator. Its function is to convert the processing results sent by the microcomputer into control signals to drive the actuators to work. The output circuit generally uses a power transistor, which controls the electronic circuit of the actuator by turning on or off according to the instructions of the microcomputer.

Microcomputer includes central processing unit (central processing unit, CPU), memory and input/output (input/output, I/O) interface, CPU is connected with memory, I/O interface through bus, can communicate with each other through bus exchange. The memory may be a memory such as a read-only memory (ROM) or a random access memory (RAM). The I/O interface is a connection circuit for exchanging information between the central processor unit (CPU) and the input circuit, output circuit or A/D converter. Specifically, the I/O interface can be divided into a bus interface and a communication interface . The memory stores programs, and the CPU calls the programs in the memory to realize the functions of the above modules, or execute the methods described with reference to Fig. 3 , Fig. 4 , Fig. 6 , Fig. 8 , and Fig. 12 .

In addition, as mentioned above, the voice interaction system 100 also has a microphone, a speaker, a camera or a display. The microphone is used to acquire the user's input voice, which corresponds to the voice acquisition module in this application. The speaker is used to play sounds, such as the response tone "OK" to the user's input voice. The camera is used to collect the user's facial image, etc., and send the collected image to the command analysis and execution module 150. The command analysis and execution module 150 can perform image recognition on the image, so as to authenticate the user's identity. The display is used to respond according to the user's input voice, for example, when the input voice is "play the song "XX", the display will display the playing screen of the song.

The voice interaction system 100 will be described in more detail below in conjunction with the description of the actions and processing flow of the voice interaction system 100 . In addition, in conjunction with the description of the actions and processing flow of the voice interaction system 100, the voice interaction method involved in this embodiment will be described at the same time, and it can also be known from the following description that the voice interaction method includes a language recognition method (corresponding to the language recognition module 120, part of the processing of the command analysis and execution module 150, the processing of the confidence correction module, etc.).

As mentioned above, the language recognition module 120 uses the language recognition model to perform language recognition. In this embodiment, as shown in step S210 in FIG. The multilingual preset weight set is initialized according to the environment feature and the audio collector feature. An example of an initialization method will be described below with reference to FIG. 13 .

As shown in FIG. 13 , the real-time scene adaptation module 161 generates a quasi-environment dataset according to environmental features, audio collector features and expert datasets. Here, the environmental characteristics include, for example, the environmental signal-to-noise ratio, microphone power source information (DC-AC information), or environmental vibration amplitude, and the like. The information on the power source of the microphone can be obtained, for example, through a controller area network (Controller Area Network, CAN) signal of the vehicle. The characteristics of the audio collector mainly include microphone arrangement information (single microphone or microphone array, wherein the microphone array includes linear array, planar array and stereo array). The expert data set is a batch of multi-person, multilingual, and noise-free audio data sets collected in advance, and its content (the language of each piece of voice data) is pre-recorded and known.

In addition, randomly initialize confidence weight sets 1-confidence weight sets N of N different multiple languages, for example, refer to the example confidence weight set 1 {Chinese: 0.80; English: 0.04; Korean: 0.06; Japanese: 0.05; German: 0.05}, confidence weight set 2 {Chinese: 0.21; English: 0.19; Korean: 0.22; Japanese: 0.20; German: 0.18} and confidence weight set N {Chinese: 0.31; English: 0.09; Korean: 0.12; Japanese: 0.25; German: 0.23}.

Input the quasi-environmental data set into the language recognition model to obtain the multilingual initial confidence set ({Chinese: p1; English: p2; Korean: p3; Japanese: p4; German: p5} in Figure 13), and the initial confidence Multiply with the confidence weight set 1-confidence weight set N of N multiple languages to obtain N recognition confidence sets. Since the content of the expert data set (the language of each piece of speech data) is known, the accuracy rate acc of the N recognition confidence sets can be calculated, and the confidence weight set corresponding to the recognition confidence set with the highest accuracy is It is determined as the optimal confidence weight set, and the preset weight set is set with the value of the optimal confidence weight set to complete the initialization of the preset weight set. For example, in Figure 13, confidence weight set 2 {Chinese: 0.21; English: 0.19; Korean: 0.22; Japanese: 0.20; German: 0.18} corresponds to the highest accuracy rate (0.98), therefore, the preset weight set is set is {Chinese: 0.21; English: 0.19; Korean: 0.22; Japanese: 0.20; German: 0.18}.

Using the above technical means, when the language recognition model initially contacts the scene, the preset weight set is initialized according to the environmental characteristics and the audio collector characteristics, so as to adjust the recognition confidence of the input voice when there is a subsequent user input voice, Therefore, the speech interaction system 100 can be adjusted according to different scenarios, and the language recognition can be performed with the best recognition accuracy as possible, thereby improving the reliability of the recognition result. That is, the adoption of the above technical means can suppress the occurrence of the problem of "the trained language recognition model is not very adaptable to the scene, resulting in low reliability of the recognition result".

In this embodiment, the preset weight set is initialized according to both the environment feature and the audio collector feature. However, as another embodiment, the preset weight set may be initialized only according to one of the environment feature and the audio collector feature.

After the initialization of the preset weight set is completed, in step S212 of FIG. 14 , it is determined whether the preset weights of each language in the set preset weight set are within the weight range. The weight range is preset, and its specific value can be determined through testing, which will be described later.

When the preset weights of each language in the preset weight set set according to the quasi-environmental data set are within the weight range, the recognition result of the language recognition model under the environment (the above-mentioned environmental characteristics and audio collector characteristics) can be guaranteed. High reliability. When the confidence weight set according to the simulated environment data set is not within the weight range ("No" in step S212), the reliability of the result of the language recognition model in this environment is low. At this point, in this embodiment, as shown in steps S214 and S217 in FIG. 14 , the language of the user's input voice can be determined through historical language records and the language specified by the user.

Specifically, in step S214, it is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to obtain the feature similarity, thereby determining the current language. The language of the input voice of the secondary user.

When there is no historical language record, in step S217, an inquiry is made based on the voiceprint to determine whether the user has specified a language, and if there is a user specified language, the recognition language of the user's input voice is determined according to the user specified language. There may be one or more user-specified languages. When there is only one user-specified language, the user-specified language is determined as the recognition language of the input voice; when there are multiple user-specified languages, for example, the one that appears most frequently The language is determined as the recognition language of the input voice. For example, when the user-specified language inquired is {Chinese: 3 times; English: 1 time; German: 1 time}, Chinese is determined as the recognition language of the input voice. When no user specifies a language, in step S219, the language of the input speech is determined according to the recognition result of the language recognition model.

In FIG. 14 , step S214 is executed before step S217 , however, there is no limitation on the execution order of the processing of determining the language through the historical language record and the processing of determining the language through the user's designation of the language.

Using the above technical means, when the preset weight set is not within the weight range, the language of the user's input voice is predicted according to the historical language records or the language specified by the user, thereby improving the reliability of the voice interaction system 100 in predicting the language of the input voice sex.

In addition, when the weight value of each language in the preset weight set set according to the simulated environment data set is within the weight range ("Yes" in step S212), when the user's input voice is detected, in step S212 In S200, use the language recognition model to perform language recognition on the input speech. In step S221, it is judged whether there is a recognition confidence value greater than the threshold λ in the multilingual recognition confidence set obtained according to the language recognition model. If there is ("Yes" in step S221), in step S222, the user identity is determined through the voiceprint. As another embodiment, the identity of the user may also be determined by means of face recognition or iris recognition. Afterwards, in step S223, the user's historical language record and the current dialogue wheel language record are updated (that is, the current language is added to the record). Afterwards, in step S225 , the multilingual recognition confidence set is output to the command analysis and execution module 150 as the language recognition result. Here, the current dialogue wheel refers to a cycle of continuously listening to (receiving) the user's input voice, for example, a period from a turn-on to turn-off of a language recognition system or a voice interaction system.

When there is no recognition confidence greater than the threshold λ in the multilingual recognition confidence set ("No" in step S221), the language confidence correction module 160 calls the audio feature-based adjustment module 162 or the video feature-based adjustment module 163 to Multilingual recognition confidence sets are corrected. In this embodiment, the audio feature adjustment module 162 is first called to perform correction processing, and when there is no recognition confidence greater than the threshold λ in the multilingual recognition confidence set corrected based on the audio feature adjustment module 162, then the call based on The video feature adjustment module 163 performs correction processing. As another embodiment, the video feature-based adjustment module 163 may also be called first.

The processing based on the audio feature adjustment module 162 in this embodiment will be described below with reference to FIG. 15 .

As shown in Figure 15, in step S231, the user identity is determined through the voiceprint, and in step S232, the user's historical language records are queried, and if there are historical language records, the distribution of each language in the historical language records is calculated. For example, referring to the right part in Fig. 15, when the historical language records are {Chinese: 8; English: 1; Korean: 0; Japanese: 1; German: 0}, calculate according to the historical language records The distribution of each language (which can also be referred to as normalization according to the numerical form of the weight) is {Chinese: 0.8; English: 0.1; Korean: 0.0; Japanese: 0.1; German: 0.0}. The time range of historical language records can be set freely, such as the current dialogue round, a few days, a few months or longer. Here, as a method, all time node historical language records and current dialogue wheel historical language records (abbreviated as current dialogue wheel language records) can be pre-stored, and the following are respectively executed according to all time node historical language records and current dialogue wheel historical language records In the processing in step S236 etc., when there is a conflict between the results obtained according to the two, the result obtained according to the historical language record of the current dialogue wheel is given priority, which is considering that the credibility of the result obtained according to the language record of the current dialogue wheel is relatively high. is higher. Alternatively, different weight values may be assigned to the two to perform the calculation in step S236 described below.

Afterwards, in step S236, use the distribution of each language in the historical language records to correct and calculate the multilingual recognition confidence set, and obtain the corrected multilingual recognition confidence (corresponding to the second recognition confidence in this application). )gather. For example, referring to the right part of Figure 15, the multilingual initial confidence set is {Chinese: 0.7; English: 0.1; Korean: 0.1; Japanese: 0.05; German: 0.05}, and the preset reliability weight is {Chinese : 0.25; English: 0.25; Korean: 0.25; Japanese: 0.25; German: 0.25}, the calculated correction when the language distribution is {Chinese: 0.8; English: 0.1; Korean: 0.0; Japanese: 0.1; German: 0.0} The final recognition confidence set (after normalization) is {Chinese: 0.973; English: 0.017; Korean: 0.000; Japanese: 0.010; German: 0.000}.

Afterwards, in step S237, it is judged whether there is a correction confidence greater than the threshold λ in the multilingual recognition confidence set after the judgment, and if there is ("Yes" in step S237), on the one hand, in step S239, The corrected multilingual recognition confidence set is output to the command parsing and execution module 150 as the language recognition result, in addition, the historical language record and the current dialogue wheel language record are updated (see step S222, step S222 in FIG. S223); On the other hand, in step S238, the process of adjusting the confidence weight is performed. Specifically, the correction is performed according to the corrected recognition confidence set, and the confidence weights of languages whose recognition confidence is greater than the threshold λ among the corrected recognition confidences are increased relative to the confidence weights of other languages. For example, with reference to Figure 15, the old confidence weight set {Chinese: 0.25; English: 0.25; Korean: 0.25; Japanese: 0.25; German: 0.25} is amended to a new confidence weight set (or called a modified weight set){ Chinese: 0.29; English: 0.24; Korean: 0.24; Japanese: 0.24; German: 0.24}.

In step S271, it is judged whether each corrected weight in the corrected corrected weight set is within the weight range. The set of weights is set to be used by the language identification module 120 for subsequent language identification, and then the processing ends. When the judging result is not within the weight range, in step S271, the preset weight set is not updated, and the process ends.

In addition, when the judgment result in step S235 is "No" or the judgment result in step S237 is "No", the language confidence correction module 160 calls the comprehensive adjustment module 164 for processing.

Next, the processing and the like of the comprehensive adjustment module 164 will be described with reference to FIG. 16 .

As shown in Figure 16, in step S251, according to the output of the speech recognition module 110, it is judged whether there is an ASR confidence degree greater than the threshold λ, and if it exists, in step S252, the comprehensive adjustment module 164 will recognize the multilingual recognition confidence The set is output to the command parsing and execution module 150 as a language recognition result. In this case, the command parsing and execution module 150 can determine the language corresponding to the ASR confidence with the ASR confidence greater than the threshold λ as the language of the input speech (which can be called the recognition language), and determine the candidate command corresponding to the language is the target command to execute.

When the judgment result in step S251 is "No", according to the output of the semantic understanding module 140, it is judged whether there is an NLU confidence degree greater than the threshold λ. When there is an NLU confidence degree greater than the threshold λ, in step S254 , the comprehensive adjustment module 164 outputs the multilingual recognition confidence set as the language recognition result to the command parsing and execution module 150 . In this case, the command analysis and execution module 150 determines the language corresponding to the NLU confidence greater than the threshold λ as the language of the input speech, and determines the candidate command corresponding to the language as the target command to be executed.

When the judgment result in step S253 is "No", in step S256, it is judged according to the voiceprint (user identity) whether there is a language specified by the user. The user-specified language here is the type of system language of the voice interaction system 100 set by the user. When there is a user-specified language, the multilingual recognition confidence set is corrected and calculated according to the user-specified language, so that the recognition confidence of the user-specified language is increased relative to the recognition confidence of other languages, and thus the corrected multilingual Language recognition confidence (corresponding to the second recognition confidence in this application). In addition, there may be multiple user-specified languages (there are multiple languages in the system language history records stored in the database). For example, referring to the right part in FIG. For Chinese, English and German, the old recognition confidence set {Chinese: 0.75; English: 0.12; Korean: 0.11; Japanese: 0.01; German: 0.01} is revised to the new recognition confidence set {Chinese: 0.95; English: 0.32; Korean: 0.11; Japanese: 0.01; German: 0.21}.

Here, the language specified by the user is an example of a user operation record. As another example, the language of the user's historically played songs can also be mentioned.

In addition, optionally, after the recognition language of the input speech is determined according to the ASR confidence level or the NLU confidence level, the multilingual preset weight set may be updated to be used for language recognition of the subsequent input speech. The update method is the same as that described with reference to FIG. 14 , and will not be repeated here.

Afterwards, in step S259, it is judged whether there is a recognition confidence value greater than the threshold λ in the set of multilingual recognition confidence values corrected in step S258, and if yes, the corrected multilingual recognition confidence value is determined in step S261. The degree set is output to the command parsing and recognition module as the language recognition result. Afterwards, in step S262, the multilingual preset reliability weight sets are adjusted. The adjustment method is the same as the method explained above, and will not be repeated here. In addition, similar to the above, when each correction weight in the correction weight set is within the weight range, the multilingual preset reliability weight set is updated.

In addition, when the judgment result in step S256 is "No", that is, there is no user-specified language, or when the judgment result in step S259 is "No", that is, there is no recognition confidence greater than the threshold λ, in step S264, Determine whether there is a user's historical language record based on the user's identity. When the judgment result is that there is a user's historical language record, in step S256, the input voice of the user this time is compared with the input voice in the historical language record to obtain a feature similarity, and according to the feature similarity, find an unknown voice. Enter the language with the closest sound.

In addition, when the judgment result in step S264 is "No", that is, there is no historical language record of the user, the comprehensive adjustment module 164 directly outputs the recognition language confidence of multiple users as the language recognition result to the command analysis and execution module 150. In this case, the command parsing and execution module 150 may consider that the language of the input voice cannot be recognized, and may, for example, feedback this matter to the user by playing voice.

With the present embodiment as above, the multilingual recognition confidence set is adjusted according to user characteristics including historical language records or user-specified languages, so that the voice interaction system 100 can improve the prediction accuracy of input voice and improve the user's voice interaction. The degree of confidence in the intelligence of the system 100 .

In the above description, "weight range" is mentioned, that is, when the real-time scene adaptation module 161 initializes the preset weight set, it is judged whether the initialized preset weight is within the weight range. When the feature adjustment module 163 and the comprehensive adjustment module 164 intend to update the preset weight, they will also determine whether the preset weight is within the weight range. It can be seen that the "weight range" reflects the robustness range of the model itself.

This embodiment also provides a method for setting the "weight range". This method is implemented, for example, in the testing phase before the voice interaction system 100 leaves the factory. In addition, it can also be implemented in the offline inspection phase after leaving the factory.

The method mainly includes the following steps:

①Collect language data sets data ₁ , data ₂ ,...,data _n in different scenarios, and the content of language data sets is known in advance;

② Input the language data set data _i (i∈[1,n]) in a scene into the language recognition model to test the language recognition model, and obtain the optimal confidence weight of each language in the scene (that is, the recognition accuracy rate The corresponding confidence weight in the highest case) t _i ;

③ All language data sets are input into the model to execute the above ②, and the best confidence weight set T _k ={t _1k ,t _2k ,...,t _nk }(k∈[1,m]) for each language can be obtained;

④ Obtain the optimal confidence weight range F _k =[a _k ,b _k ] for each language, where a _k =min(T _k ), b _k =max(T _k ).

Note: n is the number of language datasets, and m is the number of languages.

Here, the language data sets data ₁ , data ₂ ,..., data _n correspond to the test data sets in this application.

For the above method, this embodiment provides an implementation solution as shown in FIG. 10 , but it is not limited to this solution. This scheme will be described below with reference to FIG. 10 . First, randomly initialize k different sets of confidence weights, which correspond to multiple test weight sets in this application; then, input the language data set data _i of a scene into the language recognition model to obtain the output , and respectively multiplied with k sets of confidence weights (the set here can be understood as a matrix), and then normalized to obtain the revised multilingual language confidence set; then, according to the language confidence set and the already The accuracy of each multilingual confidence weight set (acc in the figure) is obtained from the content of the known language data set data _i , then the confidence weight set with high accuracy (for example {Chinese: 0.21; English: 0.19 ; Korean: 0.22; Japanese: 0.20; German: 0.18}) is the optimal set of confidence weights for the language data set data _i (ie: scene i).

Finally, by repeating the above processing for the language data set of each scene, the confidence weight set of each monolingual can be obtained, and the confidence weight range of each monolingual can be obtained, such as the confidence weight of Chinese shown in the figure Range [c _a ,c _b ], English confidence weight range [e _a ,e _b ], Korean confidence weight range [h _a ,h _b ], Japanese confidence weight range [r _a ,r _b ] , German confidence weight range [d _a , d _b ].

Using the above technical means, before using the language recognition model for language recognition, the language recognition model is tested through a large number of language data sets to set the weight range of the multilingual preset weight set, that is, to specify the language recognition model The range of robustness enables the language recognition model to work within this range, thereby ensuring the reliability of the language recognition results.

Note that the above are only preferred embodiments and technical principles used in this application. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and may include more other equivalent embodiments without departing from the concept of the present application, all of which belong to protection scope of this application.

Claims

A voice processing method, characterized in that, comprising:

Obtain the user's input voice information;

According to the input voice information, determine a plurality of first confidence levels corresponding to the input voice information, and the plurality of first confidence levels respectively correspond to a plurality of languages;

modifying the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics of the user;

The language of the input voice information is determined according to the plurality of second confidence levels.
The speech processing method according to claim 1, wherein the modifying the plurality of first confidence levels into a plurality of second confidence levels according to the user characteristics of the user specifically comprises:

When the multiple first confidence levels are less than a first threshold, modify the multiple first confidence levels to the multiple second confidence levels according to the user characteristics.
The speech processing method according to claim 1 or 2, wherein the user features include one or more of historical language records and user-specified languages.
The speech processing method according to claim 3, characterized in that, the historical language record and the user-designated language are obtained by querying the voiceprint feature of the input speech information.
The speech processing method according to any one of claims 1-4, wherein the plurality of first confidence levels are determined by a plurality of initial confidence levels and a plurality of preset weights;

The speech processing method further includes: updating the plurality of preset weights according to the plurality of second confidence levels.
The speech processing method according to claim 5, wherein said updating said plurality of preset weights according to said plurality of second confidence degrees specifically comprises:

When there is a second confidence degree greater than a first threshold in the plurality of second confidence degrees, the plurality of preset weights are updated according to the plurality of second confidence degrees.
The speech processing method according to any one of claims 1-6, further comprising: determining the semantics of the input speech information according to the input speech information and the language of the input speech information.
The speech processing method according to any one of claims 1-7, wherein the plurality of languages are preset.
The speech processing method according to any one of claims 1-8, wherein the plurality of first confidence levels are determined by a plurality of initial confidence levels and a plurality of preset weights;

The speech processing method further includes: before acquiring the user's input speech information, setting the plurality of preset weights according to scene characteristics.
The speech processing method according to claim 9, wherein the scene features include environment features and/or audio collector features.
The speech processing method according to claim 10, wherein the environmental characteristics include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics include microphone arrangement information.
The speech processing method according to any one of claims 9-11, wherein the setting of the plurality of preset weights according to scene characteristics specifically includes:

Acquiring pre-collected first voice data and pre-recorded first language information of the first voice data;

determining second voice data according to the first voice data and the scene feature;

determining the second language information of the second voice data according to the second voice data;

The plurality of preset weights are set according to the first language information and the second language information.
The speech processing method according to claim 12, wherein the determining the second language information of the second speech data according to the second speech data specifically comprises:

Acquiring a plurality of test weight groups, any one of the plurality of test weight groups includes a plurality of test weights;

determining a plurality of second language information according to the second voice data and the plurality of test weight groups, the plurality of second language information corresponding to the plurality of test weight groups;

The setting of the plurality of preset weights according to the first language information and the second language information specifically includes:

determining a plurality of accuracy rates of the plurality of second language information according to the first language information and the plurality of second language information;

The plurality of preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
The speech processing method according to any one of claims 9-13, wherein the setting the multiple preset weights specifically comprises: setting the multiple preset weights within a weight range.
The speech processing method according to claim 5 or 6, wherein said updating said plurality of preset weights specifically comprises: updating said plurality of preset weights within a weight range.
The speech processing method according to claim 14 or 15, wherein the weight range is determined as follows:

Acquiring a plurality of pre-collected test voice data groups and pre-recorded first language information of the plurality of test voice data groups, any one of the plurality of test voice data groups includes a plurality of test voice data;

Acquiring a plurality of test weight groups, any one of the plurality of test weight groups includes a plurality of test weights;

The weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
A voice processing method, characterized in that, comprising:

Obtain the user's input voice information;

According to the input voice information, determine a plurality of third confidence levels corresponding to the input voice information, and the plurality of third confidence levels respectively correspond to a plurality of languages;

Correcting the plurality of third confidence levels into a plurality of fourth confidence levels according to scene characteristics;

The language of the input voice information is determined according to the plurality of fourth confidence levels.
The speech processing method according to claim 17, wherein the scene features include environment features and/or audio collector features.
The speech processing method according to claim 17 or 18, wherein the environmental features include one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude, and the audio collector features include a microphone Arrange information.
The speech processing method according to any one of claims 17-19, wherein the modifying the plurality of third confidence levels into a plurality of fourth confidence levels according to scene characteristics specifically includes:

setting a plurality of preset weights according to the scene characteristics;

Correcting the plurality of third confidence levels into the plurality of fourth confidence levels according to the plurality of preset weights.
The speech processing method according to claim 20, wherein said setting said plurality of preset weights according to scene characteristics specifically comprises:

Acquiring pre-collected first voice data and pre-recorded first language information of the first voice data;

determining second voice data according to the first voice data and the scene feature;

determining the second language information of the second voice data according to the second voice data;

The plurality of preset weights are set according to the first language information and the second language information.
The speech processing method according to claim 21, wherein the determining the second language information of the second speech data according to the second speech data specifically comprises:

Acquiring multiple test weight groups, where the test weight groups include multiple test weights;

determining a plurality of second language information according to the second voice data and the plurality of test weight groups, the plurality of second language information corresponding to the plurality of test weight groups;

The setting of the plurality of preset weights according to the first language information and the second language information specifically includes:

determining a plurality of accuracy rates of the plurality of second language information according to the first language information and the plurality of second language information;

The plurality of preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
A voice processing device, characterized in that it includes a processing module and a transceiver module,

The transceiver module is used to obtain the input voice information of the user;

The processing module is configured to, according to the input voice information, determine a plurality of first confidence levels corresponding to the input voice information, the plurality of first confidence levels respectively corresponding to multiple languages,

The processing module is further configured to correct the plurality of first confidence levels into a plurality of second confidence levels according to the user characteristics of the user, and determine the language of the input voice information according to the plurality of second confidence levels .
The speech processing device according to claim 23, wherein the processing module is specifically configured to, when the multiple first confidence levels are smaller than a first threshold, modify the multiple first confidence levels according to the user characteristics. The confidence level is the plurality of second confidence levels.
The speech processing device according to claim 23 or 24, wherein the user features include one or more of historical language records and user-specified languages.
The speech processing device according to claim 25, characterized in that, the historical language record and the user-designated language are obtained according to the voiceprint feature of the input voice information.
The speech processing device according to any one of claims 23-26, wherein the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights;

The processing module is further configured to update the plurality of preset weights according to the plurality of second confidence levels.
The speech processing device according to claim 27, wherein the processing module is specifically configured to: when there is a second confidence degree greater than the first threshold in the plurality of second confidence degrees, according to the plurality of second confidence degrees The second confidence level is to update the plurality of preset weights.
The speech processing device according to any one of claims 23-28, wherein the processing module is further configured to determine the language of the input speech information according to the input speech information and the language of the input speech information. semantics.
The speech processing device according to any one of claims 23-29, wherein the plurality of languages are preset.
The speech processing device according to any one of claims 23-30, wherein the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights;

The processing module is further configured to set the plurality of preset weights according to scene characteristics before acquiring the user's input voice information.
The speech processing device according to claim 31, wherein the scene features include environment features and/or audio collector features.
The speech processing device according to claim 32, wherein the environmental characteristics include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics include microphone arrangement information.
The speech processing device according to any one of claims 31-33, wherein the processing module is specifically configured to acquire the pre-collected first speech data and the pre-recorded first speech data of the first speech data. language information, determining the second voice data according to the first voice data and the scene feature, determining the second language information of the second voice data according to the second voice data, and determining the second language information of the second voice data according to the first language information and the The plurality of preset weights are set for the second language information.
The speech processing device according to claim 34, wherein the processing module is specifically configured to obtain a plurality of test weight groups, any one of the plurality of test weight groups includes a plurality of test weights, according to the The second voice data and the plurality of test weight groups determine a plurality of the second language information, the plurality of second language information corresponds to the plurality of test weight groups respectively, and according to the first language information and the plurality of test weight groups, The plurality of second language information determines multiple accuracy rates of the plurality of second language information, and sets the plurality of presets according to the test weight group corresponding to the second language information with the highest accuracy rate Weights.
The speech processing device according to any one of claims 31-35, wherein the processing module is specifically configured to set the plurality of preset weights within a weight range.
The speech processing device according to claim 27 or 28, wherein the processing module is specifically configured to update the plurality of preset weights within a weight range.
The speech processing device according to claim 36 or 37, wherein the weight range is determined in the following manner:

Acquiring a plurality of pre-collected test voice data groups and pre-recorded first language information of the plurality of test voice data groups, any one of the plurality of test voice data groups includes a plurality of test voice data;

Acquiring a plurality of test weight groups, any one of the plurality of test weight groups includes a plurality of test weights;

The weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
A voice processing device, characterized in that it includes a processing module and a transceiver module,

The transceiver module is used to obtain the input voice information of the user;

The processing module is configured to, according to the input voice information, determine a plurality of third confidence levels corresponding to the input voice information, the plurality of third confidence levels respectively corresponding to a plurality of languages,

The processing module is further configured to modify the plurality of third confidence levels into a plurality of fourth confidence levels according to scene characteristics, and determine the language of the input voice information according to the plurality of fourth confidence levels.
The speech processing device according to claim 39, wherein the scene features include environment features and/or audio collector features.
The speech processing device according to claim 39 or 40, wherein the environmental features include one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude, and the audio collector features include a microphone Arrange information.
The speech processing device according to any one of claims 39-41, wherein the processing module is specifically configured to set a plurality of preset weights according to the scene characteristics, and set a plurality of preset weights according to the plurality of preset weights. Modifying the plurality of third confidence levels into the plurality of fourth confidence levels.
The speech processing device according to claim 42, wherein the processing module is specifically configured to obtain the pre-collected first speech data and the pre-recorded first language information of the first speech data, according to the The first voice data and the scene feature determine the second voice data, determine the second language information of the second voice data according to the second voice data, and set the second language information according to the first language information and the second language information. Determine the plurality of preset weights.
The speech processing device according to claim 43, wherein the processing module is specifically configured to obtain a plurality of test weight groups, the test weight groups include a plurality of test weights, and according to the second speech data and the The plurality of test weight groups determine a plurality of the second language information, the plurality of second language information corresponds to the plurality of test weight groups, and according to the first language information and the plurality of second language information The information determines multiple accuracy rates of the multiple second language information, and sets the multiple preset weights according to the test weight group corresponding to the second language information with the highest accuracy rate.
A computing device, characterized by comprising a processor and a memory, the memory storing computer program instructions, the computer program instructions, when executed by the processor, cause the processor to perform any of claims 1-22 one of the methods described.
A computer-readable storage medium, characterized in that computer program instructions are stored, and when executed by a computer, the computer program instruction causes the computer to perform the method according to any one of claims 1-22.
A computer program product, characterized by comprising computer program instructions, which when executed by a computer cause the computer to perform the method according to any one of claims 1-22.