WO2022266825A1 - Procédé et appareil de traitement vocal, et système - Google Patents

Procédé et appareil de traitement vocal, et système Download PDF

Info

Publication number
WO2022266825A1
WO2022266825A1 PCT/CN2021/101400 CN2021101400W WO2022266825A1 WO 2022266825 A1 WO2022266825 A1 WO 2022266825A1 CN 2021101400 W CN2021101400 W CN 2021101400W WO 2022266825 A1 WO2022266825 A1 WO 2022266825A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
information
confidence
speech
confidence levels
Prior art date
Application number
PCT/CN2021/101400
Other languages
English (en)
Chinese (zh)
Inventor
王科涛
聂为然
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180001914.8A priority Critical patent/CN113597641A/zh
Priority to PCT/CN2021/101400 priority patent/WO2022266825A1/fr
Publication of WO2022266825A1 publication Critical patent/WO2022266825A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a voice processing method, device and system.
  • the present application provides a speech processing method, device and system capable of improving speech recognition capability, so as to improve the accuracy of speech recognition.
  • the first aspect of the present application relates to a voice processing method, including the following content: acquiring user input voice information; determining a plurality of first confidence levels corresponding to the input voice information according to the input voice information, the multiple first confidence levels respectively corresponding to a plurality of languages; modify the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics; and determine the language of the input voice information according to the plurality of second confidence levels.
  • modifying multiple first confidence levels into multiple second confidence levels according to the user characteristics of the user and determining the language of the input voice information according to the multiple second confidence levels, that is, considering the user characteristics Basically, the language of the voice information input by the user is determined, so that the language recognition accuracy can be improved and the voice recognition ability can be improved.
  • modifying the multiple first confidence levels into multiple second confidence levels according to user characteristics may specifically include: when the multiple first confidence levels are smaller than the first threshold, Modifying the multiple first confidence levels into multiple second confidence levels according to user characteristics.
  • the plurality of first confidence levels are smaller than the first threshold, it is difficult to determine the language of the input voice information according to the first confidence levels.
  • the language of the input voice information can be determined according to the second confidence levels, which can improve language recognition accuracy and speech recognition ability.
  • User characteristics may include one or more of historical language records and user-specified languages.
  • the first recognition confidence level is corrected according to the user's historical language records and/or the user-designated language, and the language of the input voice is determined on this basis, thereby improving the language recognition ability.
  • the historical language record of the user refers to the record of the language to which the voice input by the user belongs before the above-mentioned input voice is input.
  • the user-specified language refers to the type of system language set by the user. There may be only one user-specified language, or there may be multiple user-specified languages (that is, there are multiple system languages set by the user).
  • the historical language records and the user-specified language are obtained by querying the voiceprint features of the input voice information.
  • the voiceprint of the input voice information to query the historical language records or the user-specified language, for example, compared with the way of querying based on face information, iris, etc., it is possible to avoid misidentification of the user (speaker). Non-speakers are identified as speakers) causing language misidentification.
  • the voiceprint can be obtained according to the input voice information, while the method of querying based on face information, iris, etc. also needs to obtain user images. Therefore, the method of querying based on voiceprint requires less equipment. Processing is faster.
  • the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights.
  • the voice processing method may further include the following content: updating multiple preset weights according to multiple second confidence levels.
  • updating multiple preset weights according to multiple second confidence levels specifically includes: when there is a second confidence level greater than the first threshold among the multiple second confidence levels , updating multiple preset weights according to multiple second confidence levels.
  • the method further includes: determining the semantics of the input voice information according to the input voice information and the language of the input voice information.
  • multiple languages are preset.
  • the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the voice processing method further includes: before acquiring the user's input voice information, according to the scene Features set multiple preset weights.
  • the preset weight is set according to the scene characteristics, so that different scenes can be adapted, and the language recognition result is obtained with the preset weight that is most suitable for the scene, which improves the language recognition ability and speech recognition ability.
  • the scene feature includes an environment feature and/or an audio collector feature.
  • the environmental feature includes one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude
  • the audio collector feature includes microphone arrangement information.
  • Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
  • setting multiple preset weights according to scene characteristics specifically includes: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; The first voice data and the scene feature determine the second voice data; determine the second language information of the second voice data according to the second voice data; set a plurality of preset weights according to the first language information and the second language information.
  • determining the second language information of the second voice data according to the second voice data specifically includes: acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple a plurality of test weights; determine a plurality of second language information according to the second voice data and a plurality of test weight groups, and a plurality of second language information corresponds to a plurality of test weight groups respectively; set according to the first language information and the second language information Multiple preset weights, specifically including: determining multiple accuracy rates of multiple second language information according to the first language information and multiple second language information; setting according to the test weight group corresponding to the second language information with the highest accuracy rate Multiple preset weights.
  • setting multiple preset weights specifically includes: setting multiple preset weights within a weight range.
  • updating the multiple preset weights specifically includes: updating the multiple preset weights within a weight range.
  • the weight range is determined as follows:
  • any one of the plurality of test voice data groups includes a plurality of test voice data; obtain a plurality of test weight groups, Any one of the multiple test weight groups includes multiple test weights; the weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
  • the second aspect of the present application provides a voice processing method, including the following content: acquiring user input voice information; determining a plurality of third confidence levels corresponding to the input voice information according to the input voice information, the multiple third confidence levels respectively corresponding to multiple languages; correcting multiple third confidence levels into multiple fourth confidence levels according to scene features; determining the language of the input voice information according to the multiple fourth confidence levels.
  • the speech processing method modify the multiple first confidence levels into multiple second confidence levels according to the scene features, and determine the language of the input voice information according to the multiple second confidence levels, that is, on the basis of considering the scene features Determine the language of the user's input voice information, so that the voice processing method can be adapted to the actual scene as much as possible, and the language recognition accuracy can be improved, and the voice recognition ability can be improved.
  • scene features may include environment features and/or audio collector features.
  • the environmental feature includes one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude
  • the audio collector feature includes microphone arrangement information.
  • modifying multiple third confidence levels into multiple fourth confidence levels according to scene characteristics includes: setting multiple preset weights according to scene features; The weight modifies the plurality of third confidence levels into a plurality of fourth confidence levels.
  • setting multiple preset weights according to scene characteristics specifically includes: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; The first voice data and the scene feature determine the second voice data; determine the second language information of the second voice data according to the second voice data; set a plurality of preset weights according to the first language information and the second language information.
  • determining the second language information of the second voice data according to the second voice data specifically includes: acquiring multiple test weight groups, where the test weight groups include multiple test weights; according to the first Two voice data and a plurality of test weight groups determine a plurality of second language information, and a plurality of second language information corresponds to a plurality of test weight groups; multiple preset weights are set according to the first language information and the second language information, It specifically includes: determining multiple accuracy rates of multiple second language information according to the first language information and multiple second language information; setting multiple preset weights according to the test weight group corresponding to the second language information with the highest accuracy rate.
  • the third aspect of the present application provides a voice processing device, including a processing module and a transceiver module, the transceiver module is used to obtain the user's input voice information; Confidence degree, multiple first confidence degrees respectively correspond to multiple languages.
  • the processing module is further configured to modify the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics of the user, and determine the language of the input voice information according to the plurality of second confidence levels.
  • the processing module is specifically configured to, when the multiple first confidence levels are smaller than the first threshold, modify the multiple first confidence levels to multiple second confidence levels according to user characteristics.
  • the user features include one or more of historical language records and user-specified languages.
  • the historical language records and the user-specified language are obtained by querying the voiceprint features of the input voice information.
  • the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the processing module is further configured to, according to the multiple second confidence levels, update the multiple Default weight.
  • the processing module is specifically configured to update the multiple Default weight.
  • the processing module is further configured to determine the semantics of the input voice information according to the input voice information and the language of the input voice information.
  • the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the processing module is also configured to, before acquiring the user's input voice information, Features set multiple preset weights.
  • Scene features may include environmental features and/or audio picker features.
  • the environmental characteristics may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
  • the processing module is specifically configured to acquire the pre-collected first voice data and the pre-recorded first language information of the first voice data, and determine the For the second voice data, the second language information of the second voice data is determined according to the second voice data, and a plurality of preset weights are set according to the first language information and the second language information.
  • the processing module is specifically configured to obtain multiple test weight groups, any one of the multiple test weight groups includes multiple test weights, and according to the second voice data and the multiple test weights,
  • the weight group determines a plurality of second language information, and the plurality of second language information corresponds to the plurality of test weight groups respectively, and determines multiple accuracy rates of the plurality of second language information according to the first language information and the plurality of second language information,
  • a plurality of preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
  • the processing module is specifically configured to set multiple preset weights within a weight range.
  • the processing module is specifically configured to set multiple preset weights within a weight range.
  • the weight range is determined as follows:
  • any one of the plurality of test voice data groups includes a plurality of test voice data; obtain a plurality of test weight groups, Any one of the multiple test weight groups includes multiple test weights; the weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
  • the speech processing device of the third aspect can obtain the same technical effect as that of the speech processing method of the first aspect, and the description will not be repeated here.
  • the fourth aspect of the present application provides a voice processing device, including a processing module and a transceiver module, the transceiver module is used to obtain the input voice information of the user; Confidence, the plurality of third confidence degrees correspond to multiple languages, and the processing module is also used to modify the plurality of third confidence degrees into a plurality of fourth confidence degrees according to the scene characteristics, and determine the input according to the plurality of fourth confidence degrees The language of the voice message.
  • Scene features may include environmental features and/or audio picker features.
  • the environmental characteristics may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
  • the processing module is specifically configured to set multiple preset weights according to scene characteristics, and correct multiple third confidence levels into multiple fourth confidence levels according to the multiple preset weights.
  • the processing module is specifically configured to acquire the pre-collected first voice data and the pre-recorded first language information of the first voice data, and determine the second language information according to the first voice data and scene characteristics.
  • the second language information of the second voice data is determined according to the second voice data, and a plurality of preset weights are set according to the first language information and the second language information.
  • the processing module is specifically configured to obtain multiple test weight groups, the test weight groups include multiple test weights, and determine multiple second languages according to the second voice data and the multiple test weight groups Information, a plurality of second language information corresponds to a plurality of test weight groups respectively, according to the first language information and the plurality of second language information to determine the multiple accuracy rates of the plurality of second language information, according to the second language with the highest accuracy
  • the test weight group corresponding to the information sets a plurality of preset weights.
  • a fifth aspect of the present application provides a computing device, which includes a processor and a memory, the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the processor performs any one of the functions described in the first aspect or the second aspect. method.
  • the sixth aspect of the present application provides a computer-readable storage medium, which stores computer program instructions. When executed by a computer, the computer program instructions cause the computer to execute any method described in the first aspect or the second aspect.
  • a seventh aspect of the present application provides a computer program product, which includes computer program instructions. When executed by a computer, the computer program instructions cause the computer to execute any method described in the first aspect or the second aspect.
  • the eighth aspect of the present application provides a system, which includes the speech processing device provided in any aspect from the third aspect to the fourth aspect or any possible implementation manner.
  • FIG. 1 is a schematic illustration of an application scenario example of a speech processing solution provided by an embodiment of the present application
  • FIG. 2 is a schematic illustration of a speech processing system applied to the speech processing solution provided by the embodiment of the present application;
  • Fig. 3 is the flowchart of the voice processing method provided by one embodiment of the present application.
  • FIG. 4 is a flowchart of a speech processing method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural illustration of a speech processing device provided by an embodiment of the present application.
  • FIG. 6 is a flowchart for schematically illustrating a language recognition method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a language recognition device provided by an embodiment of the present application.
  • FIG. 8 is a flowchart for schematically illustrating a voice interaction method provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a voice interaction system provided by an embodiment of the present application.
  • Figure 10 is a schematic illustration of a method for setting the weight range
  • FIG. 11 is a schematic illustration of a voice interaction system involved in an embodiment of the present application.
  • FIG. 12 is a flow chart illustrating one of the processes of the voice interaction method involved in an embodiment
  • FIG. 13 is a schematic illustration of a method for initializing a preset weight set provided in an embodiment of the present application
  • FIG. 14 is a schematic illustration of a part of the flow of a voice interaction process provided in an embodiment of the present application.
  • FIG. 15 is a schematic illustration of a confidence correction method provided in an embodiment of the present application.
  • FIG. 16 is a schematic illustration of another confidence correction method provided in an embodiment of the present application.
  • Fig. 17 is a schematic illustration of an electronic control unit provided in an embodiment of the present application.
  • the voice processing solution provided in the embodiment of the present application includes a voice processing method, device, and system. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have mutual references and can be combined with each other.
  • FIG. 1 What is illustrated in FIG. 1 is a scene applied to a vehicle.
  • Microphone array receives the voice commands of the driver 300 and other passengers, executes corresponding controls according to the voice commands (such as playing music, opening the windows, turning on the air conditioner, navigating, etc.), and at the same time, responding to the voice commands (Feedback), for example, sending display information through the central control display 210 or sending out voice information through a speaker (not shown) on the central control display 210 .
  • the voice commands such as playing music, opening the windows, turning on the air conditioner, navigating, etc.
  • the vehicle 200 since the vehicle 200 is taken by different occupants, they may issue voice commands in different languages, or even the same occupant may issue voice commands in different languages. function. However, limited by the language recognition capability, sometimes the car-machine system may get a wrong language recognition result, resulting in failure to recognize or wrongly recognize the semantics of the voice command, and thus fail to respond correctly.
  • machine learning models may learn some task-independent information, such as environmental signal-to-noise ratio, audio collectors (sound sensors, Microphone) features, etc., which will lead to errors in the prediction results of the machine learning model when the information changes in actual applications.
  • the vehicle 200 is a convertible car
  • the ambient noise is relatively large (for example, a medium-noise environment)
  • the The computer system may get the wrong language recognition result, so it cannot correctly recognize the voice command and thus cannot make a correct response.
  • the type of microphone array corresponding to the training sample data of the machine learning model is different from the microphone 212 of the vehicle-machine system, it may also cause the vehicle-machine system to generate wrong language recognition results and fail to correctly recognize the driver's voice commands.
  • the embodiments of the present application provide a voice processing method, device, system, etc., which can improve the voice recognition capability of a multilingual voice processing solution.
  • FIG. 2 is a schematic diagram illustrating the architecture of a speech processing system to which the speech processing solution provided by the embodiment of the present application is applied.
  • the voice processing system 180 includes a voice processing device 182 , a sound sensor (microphone) 184 , a speaker 186 , a display device 188 and the like.
  • the voice processing system 180 can be applied to smart vehicles as a car-machine system. In addition, it can also be applied to scenarios such as smart home, smart office, smart robot, smart voice question and answer, smart voice analysis, and real-time voice monitoring and analysis.
  • the sound sensor 184 is used to acquire the user's input voice, and the voice processing device 182 obtains the user's input voice information according to the sensor data of the sound sensor 184, processes the input voice information, and obtains the semantics of the input voice information. And, the voice processing device 182 performs corresponding control according to the semantics, for example, controlling the output of the speaker 186 or the display device 188 .
  • the voice processing device 182 can also be connected with other devices and mechanisms. , air-conditioning system, etc., so as to be able to control the windows, air-conditioning system, etc.
  • Fig. 3 is a flowchart of a speech processing method provided by an embodiment of the present application.
  • the voice processing method may be executed by a vehicle, a vehicle-mounted device, a vehicle-mounted computer, or a vehicle-mounted computer, and may also be executed by components in the vehicle or the vehicle-mounted device, such as a chip or a processor.
  • the voice processing method can also be applied to other scenarios such as smart home or smart office.
  • the speech processing method may be executed by related devices involved in these scenarios, such as a control device, a processor, and the like.
  • the speech processing method includes the following contents:
  • the input voice information of the user may be obtained according to the sensor data collected by the sound sensor, the sensor data may be used directly, or the information obtained after processing the sensor data.
  • the time length of the input voice information is not particularly limited, and may correspond to a paragraph or a sentence of the user.
  • the content spoken by the user may be segmented to form a plurality of input speech information, and the processing of S2-S4 described later is respectively performed on the plurality of input speech information.
  • the multiple first confidence levels correspond to multiple languages.
  • multiple languages may be preset.
  • the meaning of the confidence level of a language refers to the probability that the input voice information belongs to the language. For example, when the multiple first recognition confidence levels obtained are ⁇ Chinese: 0.6; English: 0.4; Korean: 0; German: 0; Japanese: 0 ⁇ , it means that the probability that the language of the input voice information is Chinese is 0.6, which is English. The probability of being Korean, German, and Japanese is 0.4.
  • different languages may refer to different language families. For example, Chinese and English belong to different languages, or they may refer to different minor languages under the same language family. For example, Mandarin and Cantonese in Chinese also belong to different languages.
  • the user features here are, for example, historical language records or user-specified languages.
  • the historical language record is the recognized language of the user's input voice information that was recognized and recorded before the current processing cycle.
  • the recognized language here means the language of the input voice information determined by recognizing the input voice information.
  • the user-specified language refers to the type of system language set by the user, for example, according to his or her frequently used language.
  • the first confidence degree is modified according to the user characteristics, and the language type of the input voice information is determined according to the modified second confidence degree, so that the language type of the input voice information can be determined more accurately, and speech recognition can be improved. ability.
  • the specific correction method for example, take the correction based on historical language records as an example: assuming that there are more records in Chinese in the historical language records, then the confidence of Chinese among the multiple first confidence levels obtained in this processing cycle degree increases, so as to obtain the second degree of confidence, for example, according to the historical language records, the above-mentioned multiple first confidence degrees ⁇ Chinese: 0.6; English: 0.4; Korean: 0; German: 0; Japanese: 0 ⁇ , amended is ⁇ Chinese: 0.8; English: 0.2; Korean: 0; German: 0; Japanese: 0 ⁇ .
  • the multiple first confidence levels when the multiple first confidence levels are smaller than the first threshold, the multiple first confidence levels may be corrected to multiple second confidence levels according to user characteristics.
  • the plurality of first confidence levels are smaller than the first threshold, it is difficult to determine the language of the input voice information according to the first confidence levels.
  • the language of the input voice information can be determined according to the second confidence levels, which can improve language recognition accuracy and speech recognition ability.
  • historical language records and user-specified languages can be obtained by querying the voiceprint features of the input voice information.
  • historical language records and user-specified languages can be easily obtained.
  • the multiple first confidence levels may be determined by multiple initial confidence levels and multiple preset weights.
  • the multiple preset weights may be updated according to the multiple second confidence levels.
  • the preset weight is updated according to the processing result of the current processing cycle, so that the language recognition accuracy of the subsequent processing cycle can be improved.
  • updating may be performed when there is a second confidence level greater than the first threshold among the multiple second confidence levels.
  • the language recognition result obtained according to the plurality of second confidence degrees has higher credibility, and at this time according to the plurality of second confidence degrees Updating the preset weights can more reliably improve the language recognition accuracy in subsequent processing cycles.
  • the semantics of the input voice information may be determined according to the input voice information and the language of the input voice information.
  • the above-mentioned multiple preset weights may be set according to scene characteristics.
  • the scene features here may include environment features and/or audio collector features, for example.
  • the environmental characteristics here may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
  • the following method can be adopted: obtain the pre-collected first voice data and the pre-recorded first language information of the first voice data; determine the second language according to the first voice data and scene characteristics; voice data; determining second language information of the second voice data according to the second voice data; setting a plurality of preset weights according to the first language information and the second language information.
  • the specific manner of determining the second language information of the second voice data according to the second voice data may be: acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple test weights; according to the second voice data and a plurality of test weight groups determine a plurality of second language information, and the plurality of second language information corresponds to a plurality of test weight groups; determine a plurality of second language information according to the first language information and the plurality of second language information multiple accuracy rates; multiple preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
  • an adjustable range that is, a weight range
  • an adjustable range can be set for multiple preset weights, and multiple preset weights can be set or updated within the weight range. If the preset weight exceeds the weight range, the recognition result will not be credible. Therefore, by setting an adjustable range, that is, a weight range, the accuracy of the language recognition result can be improved.
  • the weight range can be determined in the following manner: a plurality of pre-collected test voice data groups and pre-recorded first language information of the plurality of test voice data groups are acquired, and any one of the plurality of test voice data groups includes a plurality of test voice data sets. voice data; acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple test weights; determining the weight range according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
  • Fig. 4 is a flowchart of a speech processing method provided by an embodiment of the present application. Similar to the above-mentioned embodiments, the voice processing method of this embodiment can be executed by the vehicle, vehicle-mounted device, vehicle machine, vehicle-mounted computer, etc., and can also be executed by components in the vehicle or vehicle-mounted device, such as chips or processors. In addition, part of the content in this embodiment is the same as that in the above embodiment, so the description of these content will not be repeated.
  • the speech processing method includes the following contents:
  • the third confidence degree is modified according to the user characteristics, and the language type of the input voice information is determined according to the modified fourth confidence degree, so that the language type of the input voice information can be determined more accurately, and speech recognition can be improved. ability.
  • the third confidence degree may be obtained in the same manner as the first confidence degree, or may be different
  • the fourth confidence degree may be obtained in the same manner as the second confidence degree, or may be different.
  • the specific manner of performing correction according to scene characteristics may be the same as the specific manner of performing correction according to user characteristics in the foregoing embodiments, or may be different.
  • correction processing in this embodiment and the correction processing described in the above embodiments can be used in combination, that is, the language confidence is corrected according to both user characteristics and scene characteristics, so that the language of the input voice information can be determined more accurately.
  • a plurality of preset weights may be set according to scene characteristics; and the plurality of third confidence degrees are modified into a plurality of fourth confidence degrees according to the plurality of preset weights.
  • the method of setting multiple preset weights may specifically be: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; determining according to the first voice data and scene characteristics second voice data; determining second language information of the second voice data according to the second voice data; setting a plurality of preset weights according to the first language information and the second language information.
  • test weight groups are obtained, and the test weight groups include multiple test weights; multiple second language information is determined according to the second voice data and the multiple test weight groups, and the multiple second The language information corresponds to a plurality of test weight groups respectively; according to the first language information and the plurality of second language information, multiple accuracy rates of the plurality of second language information are determined; according to the second language with the highest accuracy rate
  • the test weight group corresponding to the information sets a plurality of preset weights.
  • FIG. 5 is an explanatory diagram of a schematic structure of a speech processing device provided by an embodiment of the present application.
  • the voice processing device 190 is used to execute the voice processing method in the embodiment described with reference to FIG. 3 or the voice processing method in the embodiment described with reference to FIG. 4 , and its structure can be known from the above description, so it is only briefly described here.
  • the voice processing device 190 includes a processing module 192 and a transceiver module 194 .
  • the processing module 192 may be used to execute the content in S2-S4 or S7-S9 above, and the transceiver module 194 may be used to execute the content in S1 or S6 above.
  • the speech processing device 190 may be composed of hardware, may also be composed of software, or may be composed of a combination of software and hardware. Using the speech processing apparatus 190 of this embodiment, the same technical effect as that of the speech processing method described above can be obtained, so repeated description of the technical effect is omitted here.
  • a language recognition method provided by an embodiment of the present application is described below with reference to FIG. 6 .
  • Fig. 6 is a flow chart for schematically illustrating a language recognition method provided by an embodiment of the present application.
  • the language recognition method can be executed by a vehicle, a vehicle-mounted device, a vehicle machine, a vehicle-mounted computer, a chip, a processor, and the like.
  • the input voice information of the user is acquired.
  • the user's input voice data received by the microphone is acquired as the input voice information, or the input voice data of the microphone is preprocessed to obtain the input voice information.
  • the input speech is recognized to obtain a multilingual first recognition confidence set, and multiple first recognition confidence levels in the first recognition confidence set correspond to multiple languages respectively.
  • the multilingual first recognition confidence set is ⁇ Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0 ⁇ . That is, the probability that the language of the input voice information is Chinese is 0.9, the probability that it is English is 0.1, and the probability that it is Korean, German, or Japanese is 0.
  • step S14 it is judged whether there is a first recognition confidence value greater than a threshold in the multilingual first recognition confidence set.
  • the threshold here can be set to 0.8, for example.
  • the recognition result is generated and output according to the first recognition confidence level set.
  • the recognition result here may be a result indicating the recognized language (for example, Chinese), or may be the first recognition confidence set itself.
  • this S14 may be omitted, and step S18 described later may be directly performed.
  • step S18 the first recognition confidence degree set is corrected and calculated according to the user's user characteristics to obtain the first recognition confidence degree set. 2. Recognition confidence set.
  • Examples of user characteristics include the user's historical language records, user-specified language, and the like.
  • the historical language record refers to the record of the recognition language of the speech input by the user before the above-mentioned input speech is input.
  • the user-specified language refers to the type of system language set by the user (such as the system language of the voice interaction system, the system language of the mobile phone operating system when applied to a mobile phone, etc.). In addition, the user may specify one or more languages (that is, the user has set multiple system languages).
  • Historical language records can be obtained by querying the voiceprint of the input voice in the database, and can also be obtained by querying the user's face information, iris information, etc. That is, the user's identity can be determined based on voiceprint, face information, iris information, etc., and thus the user's historical language records can be obtained from the database.
  • Querying historical language records and user-specified languages according to the voiceprint of the input voice can avoid misidentifying users (speakers) (identifying non-speakers as speakers) cause language misidentification.
  • the voiceprint can be obtained according to the input voice information, while the method of querying based on face information, iris, etc. also needs to obtain the user image. Therefore, the method of querying based on the voiceprint requires less equipment and faster processing.
  • the basis of obtaining the user-specified language through voiceprint is to collect the user's voiceprint when the user sets the system language, and combine the voiceprint (or user identity) with the user
  • the set system language is associated and stored in the user-specified language database.
  • the database mentioned here can be stored locally or on a credible platform.
  • a language recognition result is generated according to the second recognition confidence set.
  • the second recognition confidence set may be directly output as the recognition result, or, when there is a second recognition confidence greater than a threshold in the second recognition confidence set, output the second recognition confidence set or information indicating the recognized language, When there is no second recognition confidence greater than the threshold in the second recognition confidence set, the first recognition confidence set is output as the recognition result.
  • the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence. In this way, the language recognition ability can be improved.
  • the language recognition method in this embodiment further includes: when multiple second recognition confidences in the second recognition confidence set are smaller than the threshold, according to the automatic speech recognition confidence obtained by performing automatic speech recognition on the input speech
  • the language recognition result is generated using the confidence degree of natural language understanding (NLU) obtained by performing natural language understanding (NLU) on the input speech.
  • the language of the input speech is determined according to the automatic speech recognition confidence or the natural language understanding confidence, thereby improving the language recognition ability.
  • the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
  • the first recognition confidence set can be obtained in the following manner: the input speech is recognized to obtain an initial confidence set; multiple initial confidence sets in the initial confidence set are combined with preset weights Multiple preset weights in the set are respectively multiplied to obtain a first recognition confidence set.
  • the preset weight set can be updated, so that the preset weights of the recognition languages whose second recognition confidence is greater than the threshold among the multiple languages are relatively Preset weights increased for other languages.
  • the preset weight set is updated as above, so that after the input When the speech is processed, the updated preset weight set is used, so that the accuracy of language recognition can be improved, and the language recognition ability can be improved.
  • the specific method of updating the preset weight set can be: performing correction calculation on the preset weight set to obtain the corrected weight set; Set the set of weights.
  • the reliability of the language recognition result obtained according to the preset weight is relatively low. Therefore, using the above method, correcting the preset weight within the weight range can suppress the misrecognition rate of the language .
  • a preset weight set may be preset according to scene characteristics. Therefore, it is possible to adapt to different scenarios, and obtain the language recognition result with the preset weights that are most suitable for the scenario, thereby improving the language recognition ability.
  • Scene features may include environmental features and/or audio picker features.
  • the environmental characteristics may include environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude
  • the audio collector characteristics may include microphone arrangement information.
  • the microphone arrangement information refers to whether it is a single microphone or a microphone array, or if it is a microphone array, whether it is a linear array, a planar array, or a stereo array.
  • Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
  • the specific method of setting the preset weight set can be: obtaining multiple test weight sets; inputting the pseudo-environmental data set into the language recognition model, the pseudo-environmental data set is obtained according to the scene characteristics and the noise-free data set; according to the language recognition model
  • the output initial confidence set is obtained under the condition of multiple test weight sets of multiple first recognition confidence sets; according to the language information of the pseudo-environmental data set, the prediction accuracy of multiple first recognition confidence sets is calculated;
  • the test weight set corresponding to the first recognition confidence set with the highest prediction accuracy among the multiple test weight sets is determined as the optimal test weight set; the preset weight set is set with the values of multiple test weights in the optimal test weight set .
  • the setting when the set preset weight is within the weight range, the setting is enabled, and when the set preset weight is not within the weight range, the setting is canceled. Or, when the set preset weight is not within the weight range, the setting is still valid, but other methods are preferred to obtain the language recognition result, such as determining the recognition language of the input voice according to the language specified by the user, or, by setting this
  • the second input voice is compared with the input voice in the historical language record to obtain the characteristic similarity. If the feature similarity is greater than the similarity threshold, the language of the input voice in the historical language record is determined as the recognition of the input voice this time. language.
  • the weight range can be set in the following way: obtain multiple test data sets; obtain multiple test weight sets; input the test data sets into the language recognition model; according to the initial confidence set output by the language recognition model and multiple test weight sets, get A plurality of first recognition confidence sets in the case of multiple test weight sets; according to the language information of the test data set, calculate the prediction accuracy of multiple first recognition confidence sets; predict the accuracy of the multiple test weight sets
  • the test weight set corresponding to the highest first recognition confidence set is determined as the optimal test weight set; the optimal test weight set of multiple test data sets is obtained; multiple language types are obtained according to the optimal test weight set of multiple test data sets weight range.
  • test data set may be a pre-collected speech data set, whose language information is known.
  • FIG. 7 is a schematic structural diagram of a language recognition device provided by an embodiment of the present application.
  • an embodiment of the present application provides a language recognition device, which is used to implement the language recognition method shown in Figure 6, and its structure can be compared to the language recognition method in Figure 6 from above It is known from the description, so here only a relatively brief description of the language recognition device 10 is given.
  • the language recognition device 10 includes: an input speech acquisition module 17, configured to acquire the user's input speech; a language recognition module 12, configured to recognize the input speech to obtain a first recognition confidence set, the first Multiple first recognition confidences in the recognition confidence set correspond to multiple languages respectively; the language confidence correction module 16 is used to correct and calculate the first recognition confidence set according to the user characteristics of the user to obtain the second recognition confidence degree set; the recognition result generating module 18, configured to generate a language recognition result according to the second recognition confidence degree set.
  • the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence. In this way, the language recognition ability can be improved.
  • the language confidence correction module 16 may correct and calculate the first set of recognition confidences according to user characteristics to obtain the second set of recognition confidences when the multiple first recognition confidences are less than the threshold.
  • the user characteristics include historical language records.
  • the historical language record of the user refers to the record of the language to which the voice input by the user belongs before inputting the above-mentioned input voice.
  • the historical language record is obtained by querying the voiceprint of the input voice.
  • the voiceprint of the input voice to query the historical language record for example, compared with the way of querying based on face information, iris, etc., it can avoid misidentifying the user (speaker) (identifying the non-speaker as speakers) cause language misidentification.
  • the user characteristics include a user-specified language.
  • the first recognition confidence is modified according to the language specified by the user, and the language of the input speech is determined on this basis, thereby improving the language recognition ability.
  • the language specified by the user is obtained through querying the voiceprint of the input voice.
  • the language specified by the user can be queried according to the voiceprint of the input voice, for example, compared with the way of inquiring based on face information, iris, etc., it can avoid misidentifying the user (speaker) (identifying the non-speaker as speakers) cause language misidentification.
  • the recognition result generation module is also used for: when a plurality of second recognition confidences in the second recognition confidence set are less than a threshold, according to the automatic speech recognition (Automatic Speech Recognition, ASR) obtained by performing automatic speech recognition (ASR) on the input speech, Speech recognition confidence generates language recognition results.
  • ASR Automatic Speech Recognition
  • the language of the input speech is determined according to the automatic speech recognition confidence, thereby improving the language recognition ability.
  • the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
  • the recognition result generation module is also used for: when multiple second recognition confidences in the second recognition confidence set are less than the threshold, generate language recognition according to the natural language understanding confidence obtained by performing natural language understanding on the input speech result.
  • the language of the input speech is determined according to the automatic speech recognition confidence, thereby improving the language recognition ability.
  • the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
  • the language identification module is also used to: recognize the input speech to obtain an initial confidence set; multiply multiple initial confidences in the initial confidence set by multiple preset weights in the preset weight set , to obtain the first recognition confidence set; the language confidence correction module is also used to update the preset weight set when there is a second recognition confidence greater than the threshold in the second recognition confidence set, so that the first Second, the preset weights of recognized languages whose recognition confidence is greater than the threshold are increased relative to the preset weights of other languages.
  • the preset weight set is updated in the above manner, so that the subsequent input speech
  • the updated preset weight set is used, so that the accuracy of language recognition can be improved, and the ability of language recognition can be improved.
  • the language confidence correction module is also used to: perform correction calculation on the preset weight set to obtain a correction weight set; when multiple correction weights in the correction weight set are within the weight range, use the values of multiple correction weights Update preset weight collection.
  • the reliability of the language recognition result obtained according to the preset weight is relatively low. Therefore, using the above method, correcting the preset weight within the weight range can suppress the misrecognition rate of the language.
  • the language identification module is also used to: recognize the input speech to obtain an initial confidence set; multiply multiple initial confidences in the initial confidence set by multiple preset weights in the preset weight set , to obtain the first recognition confidence set; the language confidence correction module is also used to set a preset weight set according to scene features.
  • the preset weight set is set according to the scene characteristics, so that different scenes can be adapted, and the language recognition result can be obtained with the preset weights that are most suitable for the scene, thereby improving the language recognition ability.
  • the scene features include environment features and/or audio collector features.
  • the environmental characteristics include environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude
  • the audio collector characteristics include microphone arrangement information.
  • the microphone arrangement information refers to whether it is a single microphone or a microphone array, or if it is a microphone array, whether it is a linear array, a planar array, or a stereo array.
  • Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
  • the language confidence correction module is also used to: obtain multiple sets of test weights; input the quasi-environment data set into the language recognition model, the quasi-environment data set is obtained according to the scene characteristics and the noise-free data set; according to the language recognition model
  • the output initial confidence set is obtained under the condition of multiple test weight sets of multiple first recognition confidence sets; according to the language information of the pseudo-environmental data set, the prediction accuracy of multiple first recognition confidence sets is calculated;
  • the test weight set corresponding to the first recognition confidence set with the highest prediction accuracy among the multiple test weight sets is determined as the optimal test weight set;
  • the preset weight set is set with the values of multiple test weights in the optimal test weight set . Therefore, it can be said that the language confidence correction module has a preset weight setting module.
  • the language confidence correction module is further configured to set multiple preset weights within the weight range.
  • the weight range is set as follows: obtain multiple test data sets; obtain multiple test weight sets; input the test data sets into the language recognition model; output the initial confidence set according to the language recognition model and multiple Test the weight set to obtain multiple first recognition confidence sets in the case of multiple test weight sets; calculate the prediction accuracy of multiple first recognition confidence sets according to the language information of the test data set; combine multiple test weights
  • the test weight set corresponding to the first recognition confidence set with the highest prediction accuracy in the set is determined as the optimal test weight set; the optimal test weight set of multiple test data sets is obtained; according to the optimal test weight set of multiple test data sets Set to get the weight range of multiple languages.
  • the function of setting the weight range can be realized by the language recognition device 10 . In this case, it can be said that the language recognition device 10 has a weight range setting module, or it can be realized by a test device for testing the language recognition device 10 .
  • An embodiment of the present application provides a computing device, which includes a processor and a memory, the memory stores program instructions, and when the program instructions are executed by the processor, the processor executes the speech processing method and the language recognition method.
  • the computing device can be understood more from the following description in conjunction with FIG. 17 .
  • An embodiment of the present application provides a computer-readable storage medium, which stores program instructions, and is characterized in that, when the program instructions are executed by a computer, the computer executes the above speech processing method and language recognition method.
  • An embodiment of the present application provides a computer program.
  • the computer program When the computer program is executed by a computer, the computer executes the above speech processing method and language recognition method.
  • FIG. 8 is a flowchart schematically illustrating a voice interaction method provided by an embodiment of the present application. Part of the steps in the speech interaction method are the same as the above-mentioned language recognition method, and here, the same content is marked with the same reference numerals, and the description thereof is simplified.
  • step S10 input voice information of the user is acquired.
  • the input voice information of the user received by the microphone is obtained.
  • step S40 automatic speech recognition is performed on the input speech using a speech recognition model; on the other hand, in step S12, language recognition is performed on the input speech using a language recognition model.
  • automatic speech recognition and language recognition may also be performed sequentially.
  • step S40 in order to be able to recognize the input speech of multiple languages, the speech recognition model of a plurality of different languages (in this embodiment, five languages of Chinese, English, Korean, German and Japanese) is used to perform this input speech Speech content recognition processing obtains multiple texts Ti in different languages.
  • the speech recognition model of a plurality of different languages in this embodiment, five languages of Chinese, English, Korean, German and Japanese
  • this input speech Speech content recognition processing obtains multiple texts Ti in different languages.
  • step S42 a plurality of texts Ti are input into the text translation model, and the text translation model performs translation processing on these texts Ti, and converts these texts Ti into text Ai of the target language (such as Chinese).
  • the target language such as Chinese
  • step S44 multiple texts Ai are sequentially input into the semantic understanding model, and the semantic understanding model performs semantic understanding processing on these texts Ai, so as to obtain multiple corresponding candidate commands Oi.
  • Candidate means an order that has not yet been confirmed for execution.
  • step S12 the input speech is recognized to obtain a multilingual first recognition confidence set, and multiple first recognition confidence levels in the first recognition confidence set correspond to multiple languages respectively.
  • the multilingual first recognition confidence set is ⁇ Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0 ⁇ .
  • step S14 it is judged whether there is a first recognition confidence value greater than a threshold in the multilingual first recognition confidence set.
  • the threshold here can be set to 0.8, for example.
  • step S26 select the candidate command corresponding to the recognition language (such as Chinese) as the target command to be executed according to the recognition language (such as Chinese) from a plurality of candidate commands Oi obtained in step S44, and then make the The process by which the target command is executed. For example, when the target command is "turn on the air conditioner", corresponding control is executed to turn on the air conditioner.
  • the recognition language such as Chinese
  • step S14 when the judgment result in step S14 is "No", that is, there is no first recognition confidence greater than the threshold, or when multiple first recognition confidences are smaller than the threshold, in step S18, according to the user characteristics, the first Recognition confidence is corrected.
  • the specific content of the amendment has been described in detail above, so it will not be described here.
  • step S22 it is judged whether there is a second recognition confidence greater than the threshold, and when the judgment result is "yes”, in step S24, the language whose second recognition confidence is greater than the threshold is determined as the recognition language of the input speech , thereafter, the processing in step S26 is performed.
  • step S28 it is judged whether there is a second recognition confidence greater than the threshold. ASR confidence.
  • step S30 the language whose ASR confidence is greater than the threshold is determined as the recognized language, and then the processing in step S26 is performed.
  • step S28 When the judgment result in step S28 is "No", that is, there is no ASR confidence greater than the threshold, or when multiple ASR confidences are less than the threshold, in step S32, it is judged whether there is an NLU confidence greater than the threshold.
  • the judgment result is "Yes”
  • step S34 the language type whose NLU confidence degree is greater than the threshold is determined as the recognized language type, and then the processing in step S26 is executed.
  • step S32 When the judgment result in step S32 is "No", information indicating that the voice content recognition fails may be output, and the processing ends.
  • the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence, so that the language recognition ability can be improved , thereby improving the ability of voice interaction.
  • FIG. 9 is a schematic structural diagram of a voice interaction system provided by an embodiment of the present application.
  • the voice interaction system (or voice interaction device) 20 has a voice recognition module 110, a language recognition module 12, a text translation module 130, a semantic understanding module 140, an input voice acquisition module 17, and a language confidence correction module 16 and control module 170 .
  • the voice interaction device is used to execute the voice interaction method described with reference to FIG. 8 , therefore, the description of the specific processing flow is omitted here.
  • the voice interaction system 20 is the same as the above-mentioned language recognition device 10, and has a language recognition module 12, an input speech acquisition module 17 and a language confidence correction module 16, which are marked with the same reference numerals and omitted.
  • the voice interaction system may further include an execution device, such as a loudspeaker, a display device, and the like.
  • the speech recognition module 110 executes step S40 in FIG. 8 .
  • the language identification module 12 executes step S12 in FIG. 8 .
  • the text translation module 130 executes step S42 in FIG. 8 .
  • the semantic understanding module 140 executes step S44 in FIG. 8 .
  • the input speech acquisition module 17 executes step S10 in FIG. 8 .
  • the language confidence correction module 16 executes step S18 in FIG. 8 .
  • the control module 170 executes step S14 , step S16 , step S22 , step S24 , step S28 , step S30 , step S32 , and step S34 in FIG. 8 .
  • Step S14 , Step S16 , Step S22 , and Step S24 may also be executed by the language confidence correction module 16 .
  • the speech interaction method described with reference to FIG. 8 essentially includes a multilingual speech recognition method, which can recognize input speech in multiple languages; A speech recognition device for implementing the multilingual speech recognition method is included. Because there are many repetitive contents, no separate embodiments will be given to describe the speech recognition method and speech recognition device here.
  • the voice interaction system 100 and the voice interaction method executed by it according to an embodiment of the present application will be described below with reference to FIGS. 11-17 .
  • the voice interaction system 100 is applied in a car to form a vehicle voice interaction system as an example. Voice question and answer, intelligent voice analysis, real-time voice monitoring and analysis, etc.
  • the vehicle voice interaction system also constitutes a vehicle control device.
  • embodiments of the present application provide a voice processing method, device, and system.
  • the voice interaction system 100 of this embodiment can receive the input voice of the user (that is, the speaker), and perform corresponding processing in response to the content of the input voice, such as turning on the air conditioner, opening the car window, and other processing.
  • the voice interaction system 100 can respond to voices in multiple different languages. For example, in this embodiment, it can respond to voices in five languages: Chinese, English, Korean, German, and Japanese.
  • sounds of different languages include not only the sounds of different language families, for example, Chinese and English belong to different languages, but also the sounds of different minor languages under the same language family, for example, Mandarin and Cantonese of Chinese also belong to different languages.
  • FIG. 11 is a schematic illustration of a voice interaction system according to an embodiment of the present application.
  • the voice interaction system 100 has a voice recognition module 110, a language recognition module 120, a text translation module 130, a semantic understanding module 140, a command analysis and execution module 150, and a language confidence correction module 160.
  • the voice interaction system 100 can also have a microphone , speaker, camera or display etc.
  • Fig. 12 is a flowchart for illustrating one procedure of the voice interaction method involved in an embodiment. A processing flow of the voice interaction system 100 is described below with reference to FIG. 12 , so as to roughly illustrate the architecture of the voice interaction system 100 .
  • the voice interaction system 100 acquires the voice (called input voice) through the microphone, on the one hand, 1.
  • the input voice is input into the voice recognition module 110, Call the speech recognition submodule of a plurality of different languages (in the present embodiment, be Chinese, English, Korean, German and Japanese five languages, self-evident, can also be other quantity languages) this input speech carries out speech content recognition
  • multiple texts Ti in different languages are obtained. 2
  • Multiple texts Ti are input into the text translation module 130, and the text translation module 130 performs translation processing on these texts Ti, and converts these texts Ti into text Ai of the target language (eg Chinese).
  • 3 Input multiple texts Ai into the semantic understanding module 140 sequentially, and the semantic understanding module 140 performs semantic understanding processing on these texts Ai, so as to obtain multiple corresponding candidate commands Oi.
  • the language recognition module 120 performs language recognition processing on the input voice, generates initial confidence degrees of multiple languages, and multiplies each initial confidence degree by the corresponding preset Weights to get the recognition confidence of multiple languages.
  • the command analysis and execution module 150 will recognize multiple candidate commands Oi
  • the candidate command Oi corresponding to the language whose confidence is greater than the threshold ⁇ is determined as the target command to be executed, and corresponding processing is performed according to the content of the target command.
  • the language confidence correction module 160 corrects the recognition confidences of multiple languages according to user characteristics, and the specific content will be described later. A detailed description.
  • the speech recognition module 110, the language recognition module 120, the text translation module 130 and the semantic understanding module 140 respectively include an algorithm model, namely a speech recognition model, a language recognition model, a text translation model and a semantic understanding model. Perform speech recognition processing, language recognition processing, text translation processing and semantic understanding processing respectively.
  • the speech recognition module 110 is used to convert human speech, that is, speech to be recognized, into text in a corresponding language, which can also be said to predict speech content or perform automatic speech recognition (Automatic Speech Recognition, ASR).
  • the speech recognition module 110 has a plurality of speech recognition sub-modules, and each speech recognition sub-module corresponds to a language respectively, and is used to convert the speech into the text Ti of the corresponding language.
  • ASR Automatic Speech Recognition
  • these sub-modules output the text Ti as the prediction result and the confidence of the text Ti.
  • the confidence is called the ASR confidence.
  • the ASR confidence represents the prediction probability of the text predicted by the sub-module, or the Predicted probability of speech content.
  • the text translation module 130 is used to convert text in one natural language (source language) into text in another natural language (target language), for example, convert English text into Chinese text.
  • the text translation module 130 has a plurality of text translation sub-modules, and each text translation sub-module corresponds to a language respectively.
  • the text translation sub-modules of two languages are respectively used to translate English text, Korean text, German text and Japanese text into Chinese text Ai.
  • the text translation module 130 may not process the input Chinese text and output the input Chinese text as it is.
  • the final text translation module 130 outputs 5 Chinese texts Ai.
  • the semantic understanding module 140 is used for performing natural language understanding (Natural Language Understanding, NLU) on the text of the target language, which can also be said to predict the intent of the text and generate commands that can be understood by the machine. For example, if the text is "Please play the song "XX”, after the semantic understanding module 140, the machine can get the intention of "Please play the song "XX”. While generating the command, the semantic understanding module 140 also generates an NLU confidence, which represents the predicted probability of the meaning of the text by the semantic understanding module 140 . In addition, since the speech recognition module 110 will output 5 language texts, the semantic understanding module 140 will eventually generate commands corresponding to 5 languages and 5 NLU confidence levels. Furthermore, these commands output by the semantic understanding module 140 have not been determined to be executed, so they are called candidate commands.
  • natural language understanding Natural Language Understanding
  • the language identification (Language Identification, LID) module is used to identify the language of the user's input voice, that is, the voice to be recognized. It can also be said to predict which one of the multiple languages the user's input voice belongs to.
  • the language The recognition module 120 recognizes which language type the input speech belongs to among Chinese, English, Korean, German, and Japanese, and outputs a set of recognition confidence levels of multiple languages as the recognition result. Predicted probability for a language.
  • algorithmic recognition is performed on the input speech to obtain the confidence levels of multiple languages (this confidence level is called the initial confidence level), and the initial confidence levels of multiple languages are respectively multiplied by
  • the corresponding preset weight values obtain recognition confidences of multiple languages, and the language recognition module 120 outputs the recognition confidences as prediction results.
  • the calculation of multiplying the initial confidence by the preset weight value may or may not be performed by the language recognition model.
  • the command parsing and execution module 150 is used for selecting a target command to be executed from the candidate commands output by the semantic understanding module 140 according to the output of the language identification module 120 .
  • the command analysis and execution module 150 sets the recognition confidence degree greater than The language of the threshold ⁇ is determined as the language of the user's input voice, and the candidate command corresponding to the language is determined as the target command to be executed.
  • the confidence levels of the multiple languages output by the language identification module 120 are ⁇ Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0 ⁇ , Chinese is determined as the language of the user's input voice
  • the candidate command corresponding to Chinese is determined as the target command to be executed.
  • the command parsing and execution module 150 executes the control for enabling the target command to be executed, for example, when the determined target command to be executed is "please play "XX" song", and the voice
  • the command analysis and execution module 150 controls the music player module to play the song "XX”.
  • the music player module does not belong to the voice interaction system 100 in terms of ownership, and is not controlled by the command analysis and execution module 150, at this time, the determined target command to be executed can be sent to the voice interaction system 100 and the music player module.
  • the upper controller sends commands to the controller of the music playing module by the upper controller to realize playing the song "XX".
  • the command analysis and execution module 150 can respond to the user through a speaker or display, for example, when the determined target command to be executed is "please play the song "XX", the command analysis And the execution module 150 controls the speaker to emit the sound of "OK, I will play it for you soon" to respond to the user.
  • the command analysis and execution module 150 uses other methods to Determine the target command to be executed, and these methods are illustrated below.
  • the command analysis and execution module 150 corrects and calculates the recognition confidence of the above-mentioned multiple languages (corresponding to the "first recognition confidence" in this application), and the command analysis and execution module 150 performs corresponding calculations according to the output of the language confidence correction module 160. deal with. For example, it can be modified according to user characteristics, where the user characteristics include the user's historical language records and user-specified languages. Historical language records and user-specified languages can be determined according to audio features (ie, voiceprints) to determine the user identity, and can be obtained by querying the historical language record database and user-specified language database of the voice interaction system 100 according to the user identity. The specific content of these amendments will be described in detail later.
  • the command analysis and execution module 150 determines the language of the input speech according to the revised recognition confidence, and performs corresponding processing. For example, when there is a recognition confidence greater than the threshold ⁇ among the corrected recognition confidences of multiple languages, the language whose recognition confidence is greater than the threshold ⁇ is determined as the language of the user input voice, and the candidate command corresponding to the determined language is Determined as the target command to execute.
  • the value of the recognition confidence can be directly corrected, or the preset weight can be corrected and calculated, and then according to the initial confidence set and the corrected preset The set of weights again calculates the recognition confidence.
  • the language confidence correction module 160 has an audio feature-based adjustment module 162 , a video feature-based adjustment module 163 and a comprehensive adjustment module 164 , and these adjustment modules are used to correct the language confidence in different ways.
  • the command parsing and execution module 150 determines the target command to be executed according to the ASR confidence level output by the speech recognition module 110 or the NLU confidence level output by the semantic understanding module 140 . For example, when there is an ASR confidence greater than the ASR confidence threshold (which can be set to the same value as the above-mentioned threshold ⁇ , such as 0.8), the language corresponding to the ASR confidence is determined as the language of the input speech, and the language corresponding to the language The candidate command of is determined as the target command to be executed.
  • the ASR confidence threshold which can be set to the same value as the above-mentioned threshold ⁇ , such as 0.8
  • the language corresponding to the NLU confidence is determined as the language of the input speech, and the language corresponding to the language
  • the candidate command of is determined as the target command to be executed.
  • the execution timing of mode 2 can be set freely.
  • it can also be executed before executing method 1, or it can also be executed between the multiple methods listed in the description of method 1.
  • the command parsing and execution module 150 determines the language of the input speech by feature similarity. For example, compare the current input voice with the audio data of the historical input voice in the historical record, and obtain the feature similarity between the two through cosine similarity, linear regression or deep learning. When the feature similarity exceeds the threshold , the recognition language of the historical input voice can be determined as the language of the current input voice.
  • Mode 3 can be set freely. Optionally, it can be executed after or before Mode 1 and Mode 2, or it can be executed between Mode 1 and Mode 2, or it can also be exemplified in the description of Mode 1 Execute in multiple ways.
  • the confidence correction module will be described below.
  • the language confidence correction module 160 includes a real-time scene adaptation module 161 , an audio feature-based adjustment module 162 , a video feature-based adjustment module 163 and a comprehensive adjustment module 164 .
  • the real-time scene adaptation module 161 is used to initialize the multilingual preset weight set according to the environmental characteristics and the characteristics of the audio collector (ie, the microphone) when the language recognition model initially contacts the scene.
  • the initial contact scene here is, for example, when the user has just purchased a voice interaction system or a vehicle. At this time, the user generally turns on the voice interaction system to perform some basic settings or tests.
  • the real-time scene adaptation module 161 can use this opportunity to initialize the preset weight set.
  • the initialization of the preset weight set is not limited to be performed when initially contacting the scene, and can also be performed at other appropriate times, such as when replacing a new audio collector, or the user chooses the execution time.
  • the video feature-based adjustment module 163 is configured to modify the recognition confidence sets of multiple languages according to the captured user images.
  • the comprehensive adjustment module 164 is mainly used to modify the recognition confidence sets of multiple languages according to the language specified by the user.
  • the user-specified language is obtained by querying the database of the voice interaction system 100 according to the voiceprint of the input voice.
  • the confidence correction module When there is a recognition confidence greater than the threshold ⁇ among the corrected recognition confidences of multiple languages, the confidence correction module performs correction calculations on the preset weight sets of multiple languages, so that the preset weights of the languages whose recognition confidence is greater than the threshold ⁇ The weights are increased relative to the preset weights of other languages, and in this way, a set of modified weights is obtained. Afterwards, the confidence degree correction module judges whether each correction weight in the correction weight set is within the weight range, and when the judgment result is within the weight range, the value in the correction confidence weight set is used to update the preset weight set for language recognition Module 120 is used for subsequent language recognition.
  • the functions of the speech recognition module 110, the language recognition module 120, the text translation module 130, the semantic understanding module 140, the command analysis and execution module 150 and the confidence correction module can be implemented by the processor executing the program (software) stored in the memory , It can also be realized by hardware such as LSI (Large Scale Integration, large scale integrated circuit) and ASIC (Application Specific Integrated Circuit, application specific integrated circuit).
  • LSI Large Scale Integration, large scale integrated circuit
  • ASIC Application Specific Integrated Circuit, application specific integrated circuit
  • these modules can be formed by an electronic control unit (ECU).
  • ECU electronice control unit
  • one module can be formed by one ECU, or multiple ECUs, or one ECU can be used to form multiple modules.
  • ECU refers to a control device composed of integrated circuits used to implement a series of functions such as data analysis, processing and transmission.
  • the embodiment of the present application provides an electronic control unit ECU, the ECU includes a microcomputer (microcomputer), an input circuit, an output circuit and an analog-to-digital (analog-to-digital, A/D) converter .
  • microcomputer microcomputer
  • A/D analog-to-digital converter
  • the main function of the input circuit is to preprocess the input signal (such as the signal from the sensor), and the processing method is different for different input signals.
  • the input circuit may include an input circuit that processes analog signals and an input circuit that processes digital signals.
  • the main function of the A/D converter is to convert the analog signal into a digital signal. After the analog signal is preprocessed by the corresponding input circuit, it is input to the A/D converter for processing and converted into a digital signal accepted by the microcomputer.
  • the output circuit is a device that establishes a connection between the microcomputer and the actuator. Its function is to convert the processing results sent by the microcomputer into control signals to drive the actuators to work.
  • the output circuit generally uses a power transistor, which controls the electronic circuit of the actuator by turning on or off according to the instructions of the microcomputer.
  • Microcomputer includes central processing unit (central processing unit, CPU), memory and input/output (input/output, I/O) interface, CPU is connected with memory, I/O interface through bus, can communicate with each other through bus exchange.
  • the memory may be a memory such as a read-only memory (ROM) or a random access memory (RAM).
  • the I/O interface is a connection circuit for exchanging information between the central processor unit (CPU) and the input circuit, output circuit or A/D converter. Specifically, the I/O interface can be divided into a bus interface and a communication interface .
  • the memory stores programs, and the CPU calls the programs in the memory to realize the functions of the above modules, or execute the methods described with reference to Fig. 3 , Fig. 4 , Fig. 6 , Fig. 8 , and Fig. 12 .
  • the voice interaction system 100 also has a microphone, a speaker, a camera or a display.
  • the microphone is used to acquire the user's input voice, which corresponds to the voice acquisition module in this application.
  • the speaker is used to play sounds, such as the response tone "OK" to the user's input voice.
  • the camera is used to collect the user's facial image, etc., and send the collected image to the command analysis and execution module 150.
  • the command analysis and execution module 150 can perform image recognition on the image, so as to authenticate the user's identity.
  • the display is used to respond according to the user's input voice, for example, when the input voice is "play the song "XX", the display will display the playing screen of the song.
  • the voice interaction system 100 will be described in more detail below in conjunction with the description of the actions and processing flow of the voice interaction system 100 .
  • the voice interaction method involved in this embodiment will be described at the same time, and it can also be known from the following description that the voice interaction method includes a language recognition method (corresponding to the language recognition module 120, part of the processing of the command analysis and execution module 150, the processing of the confidence correction module, etc.).
  • the language recognition module 120 uses the language recognition model to perform language recognition.
  • the multilingual preset weight set is initialized according to the environment feature and the audio collector feature. An example of an initialization method will be described below with reference to FIG. 13 .
  • the real-time scene adaptation module 161 generates a quasi-environment dataset according to environmental features, audio collector features and expert datasets.
  • the environmental characteristics include, for example, the environmental signal-to-noise ratio, microphone power source information (DC-AC information), or environmental vibration amplitude, and the like.
  • the information on the power source of the microphone can be obtained, for example, through a controller area network (Controller Area Network, CAN) signal of the vehicle.
  • the characteristics of the audio collector mainly include microphone arrangement information (single microphone or microphone array, wherein the microphone array includes linear array, planar array and stereo array).
  • the expert data set is a batch of multi-person, multilingual, and noise-free audio data sets collected in advance, and its content (the language of each piece of voice data) is pre-recorded and known.
  • confidence weight sets 1-confidence weight sets N of N different multiple languages for example, refer to the example confidence weight set 1 ⁇ Chinese: 0.80; English: 0.04; Korean: 0.06; Japanese: 0.05; German: 0.05 ⁇ , confidence weight set 2 ⁇ Chinese: 0.21; English: 0.19; Korean: 0.22; Japanese: 0.20; German: 0.18 ⁇ and confidence weight set N ⁇ Chinese: 0.31; English: 0.09; Korean: 0.12; Japanese: 0.25; German: 0.23 ⁇ .
  • the quasi-environmental data set into the language recognition model to obtain the multilingual initial confidence set ( ⁇ Chinese: p1; English: p2; Korean: p3; Japanese: p4; German: p5 ⁇ in Figure 13), and the initial confidence Multiply with the confidence weight set 1-confidence weight set N of N multiple languages to obtain N recognition confidence sets. Since the content of the expert data set (the language of each piece of speech data) is known, the accuracy rate acc of the N recognition confidence sets can be calculated, and the confidence weight set corresponding to the recognition confidence set with the highest accuracy is It is determined as the optimal confidence weight set, and the preset weight set is set with the value of the optimal confidence weight set to complete the initialization of the preset weight set.
  • the speech interaction system 100 can be adjusted according to different scenarios, and the language recognition can be performed with the best recognition accuracy as possible, thereby improving the reliability of the recognition result. That is, the adoption of the above technical means can suppress the occurrence of the problem of "the trained language recognition model is not very adaptable to the scene, resulting in low reliability of the recognition result".
  • the preset weight set is initialized according to both the environment feature and the audio collector feature.
  • the preset weight set may be initialized only according to one of the environment feature and the audio collector feature.
  • step S212 of FIG. 14 it is determined whether the preset weights of each language in the set preset weight set are within the weight range.
  • the weight range is preset, and its specific value can be determined through testing, which will be described later.
  • the recognition result of the language recognition model under the environment (the above-mentioned environmental characteristics and audio collector characteristics) can be guaranteed. High reliability.
  • the confidence weight set according to the simulated environment data set is not within the weight range ("No" in step S212)
  • the reliability of the result of the language recognition model in this environment is low.
  • step S214 it is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to obtain the feature similarity, thereby determining the current language.
  • the language of the input voice of the secondary user is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to obtain the feature similarity, thereby determining the current language.
  • the language of the input voice of the secondary user is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to obtain the feature similarity, thereby determining the current language.
  • the language of the input voice of the secondary user is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to
  • step S217 an inquiry is made based on the voiceprint to determine whether the user has specified a language, and if there is a user specified language, the recognition language of the user's input voice is determined according to the user specified language.
  • the user-specified language is determined as the recognition language of the input voice; when there are multiple user-specified languages, for example, the one that appears most frequently The language is determined as the recognition language of the input voice.
  • the user-specified language inquired is ⁇ Chinese: 3 times; English: 1 time; German: 1 time ⁇ , Chinese is determined as the recognition language of the input voice.
  • step S219 the language of the input speech is determined according to the recognition result of the language recognition model.
  • step S214 is executed before step S217 , however, there is no limitation on the execution order of the processing of determining the language through the historical language record and the processing of determining the language through the user's designation of the language.
  • the language of the user's input voice is predicted according to the historical language records or the language specified by the user, thereby improving the reliability of the voice interaction system 100 in predicting the language of the input voice sex.
  • step S212 when the weight value of each language in the preset weight set set according to the simulated environment data set is within the weight range ("Yes" in step S212), when the user's input voice is detected, in step S212 In S200, use the language recognition model to perform language recognition on the input speech. In step S221, it is judged whether there is a recognition confidence value greater than the threshold ⁇ in the multilingual recognition confidence set obtained according to the language recognition model. If there is ("Yes" in step S221), in step S222, the user identity is determined through the voiceprint. As another embodiment, the identity of the user may also be determined by means of face recognition or iris recognition.
  • step S223 the user's historical language record and the current dialogue wheel language record are updated (that is, the current language is added to the record).
  • step S225 the multilingual recognition confidence set is output to the command analysis and execution module 150 as the language recognition result.
  • the current dialogue wheel refers to a cycle of continuously listening to (receiving) the user's input voice, for example, a period from a turn-on to turn-off of a language recognition system or a voice interaction system.
  • the language confidence correction module 160 calls the audio feature-based adjustment module 162 or the video feature-based adjustment module 163 to Multilingual recognition confidence sets are corrected.
  • the audio feature adjustment module 162 is first called to perform correction processing, and when there is no recognition confidence greater than the threshold ⁇ in the multilingual recognition confidence set corrected based on the audio feature adjustment module 162, then the call based on The video feature adjustment module 163 performs correction processing.
  • the video feature-based adjustment module 163 may also be called first.
  • step S231 the user identity is determined through the voiceprint, and in step S232, the user's historical language records are queried, and if there are historical language records, the distribution of each language in the historical language records is calculated.
  • the historical language records are ⁇ Chinese: 8; English: 1; Korean: 0; Japanese: 1; German: 0 ⁇
  • the distribution of each language (which can also be referred to as normalization according to the numerical form of the weight) is ⁇ Chinese: 0.8; English: 0.1; Korean: 0.0; Japanese: 0.1; German: 0.0 ⁇ .
  • the time range of historical language records can be set freely, such as the current dialogue round, a few days, a few months or longer.
  • all time node historical language records and current dialogue wheel historical language records (abbreviated as current dialogue wheel language records) can be pre-stored, and the following are respectively executed according to all time node historical language records and current dialogue wheel historical language records
  • the result obtained according to the historical language record of the current dialogue wheel is given priority, which is considering that the credibility of the result obtained according to the language record of the current dialogue wheel is relatively high. is higher.
  • different weight values may be assigned to the two to perform the calculation in step S236 described below.
  • step S236 use the distribution of each language in the historical language records to correct and calculate the multilingual recognition confidence set, and obtain the corrected multilingual recognition confidence (corresponding to the second recognition confidence in this application). )gather.
  • the multilingual initial confidence set is ⁇ Chinese: 0.7; English: 0.1; Korean: 0.1; Japanese: 0.05; German: 0.05 ⁇
  • the preset reliability weight is ⁇ Chinese : 0.25; English: 0.25; Korean: 0.25; Japanese: 0.25; German: 0.25 ⁇
  • the calculated correction when the language distribution is ⁇ Chinese: 0.8; English: 0.1; Korean: 0.0; Japanese: 0.1; German: 0.0 ⁇
  • the final recognition confidence set (after normalization) is ⁇ Chinese: 0.973; English: 0.017; Korean: 0.000; Japanese: 0.010; German: 0.000 ⁇ .
  • step S237 it is judged whether there is a correction confidence greater than the threshold ⁇ in the multilingual recognition confidence set after the judgment, and if there is ("Yes" in step S237), on the one hand, in step S239, The corrected multilingual recognition confidence set is output to the command parsing and execution module 150 as the language recognition result, in addition, the historical language record and the current dialogue wheel language record are updated (see step S222, step S222 in FIG. S223); On the other hand, in step S238, the process of adjusting the confidence weight is performed. Specifically, the correction is performed according to the corrected recognition confidence set, and the confidence weights of languages whose recognition confidence is greater than the threshold ⁇ among the corrected recognition confidences are increased relative to the confidence weights of other languages.
  • the old confidence weight set ⁇ Chinese: 0.25; English: 0.25; Korean: 0.25; Japanese: 0.25; German: 0.25 ⁇ is amended to a new confidence weight set (or called a modified weight set) ⁇ Chinese: 0.29; English: 0.24; Korean: 0.24; Japanese: 0.24; German: 0.24 ⁇ .
  • step S271 it is judged whether each corrected weight in the corrected corrected weight set is within the weight range.
  • the set of weights is set to be used by the language identification module 120 for subsequent language identification, and then the processing ends.
  • the preset weight set is not updated, and the process ends.
  • step S235 when the judgment result in step S235 is "No” or the judgment result in step S237 is "No", the language confidence correction module 160 calls the comprehensive adjustment module 164 for processing.
  • step S251 according to the output of the speech recognition module 110, it is judged whether there is an ASR confidence degree greater than the threshold ⁇ , and if it exists, in step S252, the comprehensive adjustment module 164 will recognize the multilingual recognition confidence
  • the set is output to the command parsing and execution module 150 as a language recognition result.
  • the command parsing and execution module 150 can determine the language corresponding to the ASR confidence with the ASR confidence greater than the threshold ⁇ as the language of the input speech (which can be called the recognition language), and determine the candidate command corresponding to the language is the target command to execute.
  • step S251 When the judgment result in step S251 is "No", according to the output of the semantic understanding module 140, it is judged whether there is an NLU confidence degree greater than the threshold ⁇ .
  • the comprehensive adjustment module 164 outputs the multilingual recognition confidence set as the language recognition result to the command parsing and execution module 150 .
  • the command analysis and execution module 150 determines the language corresponding to the NLU confidence greater than the threshold ⁇ as the language of the input speech, and determines the candidate command corresponding to the language as the target command to be executed.
  • step S256 it is judged according to the voiceprint (user identity) whether there is a language specified by the user.
  • the user-specified language here is the type of system language of the voice interaction system 100 set by the user.
  • the multilingual recognition confidence set is corrected and calculated according to the user-specified language, so that the recognition confidence of the user-specified language is increased relative to the recognition confidence of other languages, and thus the corrected multilingual Language recognition confidence (corresponding to the second recognition confidence in this application).
  • there may be multiple user-specified languages there are multiple languages in the system language history records stored in the database). For example, referring to the right part in FIG.
  • the old recognition confidence set ⁇ Chinese: 0.75; English: 0.12; Korean: 0.11; Japanese: 0.01; German: 0.01 ⁇ is revised to the new recognition confidence set ⁇ Chinese: 0.95; English: 0.32; Korean: 0.11; Japanese: 0.01; German: 0.21 ⁇ .
  • the language specified by the user is an example of a user operation record.
  • the language of the user's historically played songs can also be mentioned.
  • the multilingual preset weight set may be updated to be used for language recognition of the subsequent input speech.
  • the update method is the same as that described with reference to FIG. 14 , and will not be repeated here.
  • step S259 it is judged whether there is a recognition confidence value greater than the threshold ⁇ in the set of multilingual recognition confidence values corrected in step S258, and if yes, the corrected multilingual recognition confidence value is determined in step S261.
  • the degree set is output to the command parsing and recognition module as the language recognition result.
  • step S262 the multilingual preset reliability weight sets are adjusted.
  • the adjustment method is the same as the method explained above, and will not be repeated here.
  • the multilingual preset reliability weight set is updated.
  • step S264 Determine whether there is a user's historical language record based on the user's identity.
  • the judgment result is that there is a user's historical language record
  • step S256 the input voice of the user this time is compared with the input voice in the historical language record to obtain a feature similarity, and according to the feature similarity, find an unknown voice. Enter the language with the closest sound.
  • step S264 when the judgment result in step S264 is "No", that is, there is no historical language record of the user, the comprehensive adjustment module 164 directly outputs the recognition language confidence of multiple users as the language recognition result to the command analysis and execution module 150.
  • the command parsing and execution module 150 may consider that the language of the input voice cannot be recognized, and may, for example, feedback this matter to the user by playing voice.
  • the multilingual recognition confidence set is adjusted according to user characteristics including historical language records or user-specified languages, so that the voice interaction system 100 can improve the prediction accuracy of input voice and improve the user's voice interaction.
  • the degree of confidence in the intelligence of the system 100 is adjusted according to user characteristics including historical language records or user-specified languages, so that the voice interaction system 100 can improve the prediction accuracy of input voice and improve the user's voice interaction.
  • weight range is mentioned, that is, when the real-time scene adaptation module 161 initializes the preset weight set, it is judged whether the initialized preset weight is within the weight range.
  • the feature adjustment module 163 and the comprehensive adjustment module 164 intend to update the preset weight they will also determine whether the preset weight is within the weight range. It can be seen that the "weight range” reflects the robustness range of the model itself.
  • This embodiment also provides a method for setting the "weight range". This method is implemented, for example, in the testing phase before the voice interaction system 100 leaves the factory. In addition, it can also be implemented in the offline inspection phase after leaving the factory.
  • the method mainly includes the following steps:
  • n is the number of language datasets
  • m is the number of languages.
  • the language data sets data 1 , data 2 ,..., data n correspond to the test data sets in this application.
  • this embodiment provides an implementation solution as shown in FIG. 10 , but it is not limited to this solution.
  • This scheme will be described below with reference to FIG. 10 .
  • the confidence weight set of each monolingual can be obtained, and the confidence weight range of each monolingual can be obtained, such as the confidence weight of Chinese shown in the figure Range [c a ,c b ], English confidence weight range [e a ,e b ], Korean confidence weight range [h a ,h b ], Japanese confidence weight range [r a ,r b ] , German confidence weight range [d a , d b ].
  • the language recognition model is tested through a large number of language data sets to set the weight range of the multilingual preset weight set, that is, to specify the language recognition model
  • the range of robustness enables the language recognition model to work within this range, thereby ensuring the reliability of the language recognition results.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

La présente demande se rapporte au domaine technique des véhicules intelligents. L'invention concerne un procédé et un appareil de traitement vocal, et un système. Le procédé comprend les étapes consistant à : acquérir des informations vocales d'entrée d'un utilisateur ; en fonction des informations vocales d'entrée, déterminer une pluralité de premiers niveaux de confiance correspondant aux informations vocales d'entrée, la pluralité de premiers niveaux de confiance correspondant respectivement à une pluralité de langues ; corriger la pluralité de premiers niveaux de confiance en une pluralité de deuxièmes niveaux de confiance selon une caractéristique d'utilisateur de l'utilisateur ; et déterminer la langue des informations vocales d'entrée en fonction de la pluralité de deuxièmes niveaux de confiance. Grâce à l'utilisation du procédé de traitement vocal, la langue des informations vocales d'entrée d'un utilisateur est déterminée en fonction de la prise en considération d'une caractéristique d'utilisateur, et par conséquent la précision de reconnaissance de langue peut être améliorée, et la capacité de reconnaissance vocale peut également être améliorée.
PCT/CN2021/101400 2021-06-22 2021-06-22 Procédé et appareil de traitement vocal, et système WO2022266825A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180001914.8A CN113597641A (zh) 2021-06-22 2021-06-22 语音处理方法、装置及系统
PCT/CN2021/101400 WO2022266825A1 (fr) 2021-06-22 2021-06-22 Procédé et appareil de traitement vocal, et système

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/101400 WO2022266825A1 (fr) 2021-06-22 2021-06-22 Procédé et appareil de traitement vocal, et système

Publications (1)

Publication Number Publication Date
WO2022266825A1 true WO2022266825A1 (fr) 2022-12-29

Family

ID=78242898

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101400 WO2022266825A1 (fr) 2021-06-22 2021-06-22 Procédé et appareil de traitement vocal, et système

Country Status (2)

Country Link
CN (1) CN113597641A (fr)
WO (1) WO2022266825A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004004953A (ja) * 2003-07-30 2004-01-08 Matsushita Electric Ind Co Ltd 音声合成装置および音声合成方法
CN107832286A (zh) * 2017-09-11 2018-03-23 远光软件股份有限公司 智能交互方法、设备及存储介质
CN109522564A (zh) * 2018-12-17 2019-03-26 北京百度网讯科技有限公司 语音翻译方法和装置
CN110085210A (zh) * 2019-03-15 2019-08-02 平安科技(深圳)有限公司 交互信息测试方法、装置、计算机设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104681023A (zh) * 2015-02-15 2015-06-03 联想(北京)有限公司 一种信息处理方法及电子设备
CN108172212B (zh) * 2017-12-25 2020-09-11 横琴国际知识产权交易中心有限公司 一种基于置信度的语音语种识别方法及系统
US20210365641A1 (en) * 2018-06-12 2021-11-25 Langogo Technology Co., Ltd Speech recognition and translation method and translation apparatus
CN112185348B (zh) * 2020-10-19 2024-05-03 平安科技(深圳)有限公司 多语种语音识别方法、装置及电子设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004004953A (ja) * 2003-07-30 2004-01-08 Matsushita Electric Ind Co Ltd 音声合成装置および音声合成方法
CN107832286A (zh) * 2017-09-11 2018-03-23 远光软件股份有限公司 智能交互方法、设备及存储介质
CN109522564A (zh) * 2018-12-17 2019-03-26 北京百度网讯科技有限公司 语音翻译方法和装置
CN110085210A (zh) * 2019-03-15 2019-08-02 平安科技(深圳)有限公司 交互信息测试方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN113597641A (zh) 2021-11-02

Similar Documents

Publication Publication Date Title
US11676575B2 (en) On-device learning in a hybrid speech processing system
US9953648B2 (en) Electronic device and method for controlling the same
US10332513B1 (en) Voice enablement and disablement of speech processing functionality
US9305569B2 (en) Dialogue system and method for responding to multimodal input using calculated situation adaptability
US20230186912A1 (en) Speech recognition method, apparatus and device, and storage medium
JP2022549238A (ja) 意味理解モデルのトレーニング方法、装置、電子デバイスおよびコンピュータプログラム
US8543399B2 (en) Apparatus and method for speech recognition using a plurality of confidence score estimation algorithms
CN111028827A (zh) 基于情绪识别的交互处理方法、装置、设备和存储介质
US10685664B1 (en) Analyzing noise levels to determine usability of microphones
US11574637B1 (en) Spoken language understanding models
KR20160132748A (ko) 전자 장치 및 그 제어 방법
US11756551B2 (en) System and method for producing metadata of an audio signal
US20200162911A1 (en) ELECTRONIC APPARATUS AND WiFi CONNECTING METHOD THEREOF
KR20210095431A (ko) 전자 장치 및 그 제어 방법
CN114925163A (zh) 一种智能设备及意图识别的模型训练方法
WO2022266825A1 (fr) Procédé et appareil de traitement vocal, et système
CN115083412B (zh) 语音交互方法及相关装置、电子设备、存储介质
CN115132195B (zh) 语音唤醒方法、装置、设备、存储介质及程序产品
US11664018B2 (en) Dialogue system, dialogue processing method
KR20140035164A (ko) 음성인식시스템의 동작방법
CN117708305B (zh) 一种针对应答机器人的对话处理方法和系统
US20100292988A1 (en) System and method for speech recognition
US20240212681A1 (en) Voice recognition device having barge-in function and method thereof
US11527247B2 (en) Computing device and method of operating the same
US11893996B1 (en) Supplemental content output

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946327

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21946327

Country of ref document: EP

Kind code of ref document: A1