WO2022266825A1 - Speech processing method and apparatus, and system - Google Patents

Speech processing method and apparatus, and system Download PDF

Info

Publication number
WO2022266825A1
WO2022266825A1 PCT/CN2021/101400 CN2021101400W WO2022266825A1 WO 2022266825 A1 WO2022266825 A1 WO 2022266825A1 CN 2021101400 W CN2021101400 W CN 2021101400W WO 2022266825 A1 WO2022266825 A1 WO 2022266825A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
information
confidence
speech
confidence levels
Prior art date
Application number
PCT/CN2021/101400
Other languages
French (fr)
Chinese (zh)
Inventor
王科涛
聂为然
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202180001914.8A priority Critical patent/CN113597641A/en
Priority to PCT/CN2021/101400 priority patent/WO2022266825A1/en
Publication of WO2022266825A1 publication Critical patent/WO2022266825A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/10Speech classification or search using distance or distortion measures between unknown speech and reference templates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/24Speech recognition using non-acoustical features
    • G10L15/25Speech recognition using non-acoustical features using position of the lips, movement of the lips or face analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of artificial intelligence, in particular to a voice processing method, device and system.
  • the present application provides a speech processing method, device and system capable of improving speech recognition capability, so as to improve the accuracy of speech recognition.
  • the first aspect of the present application relates to a voice processing method, including the following content: acquiring user input voice information; determining a plurality of first confidence levels corresponding to the input voice information according to the input voice information, the multiple first confidence levels respectively corresponding to a plurality of languages; modify the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics; and determine the language of the input voice information according to the plurality of second confidence levels.
  • modifying multiple first confidence levels into multiple second confidence levels according to the user characteristics of the user and determining the language of the input voice information according to the multiple second confidence levels, that is, considering the user characteristics Basically, the language of the voice information input by the user is determined, so that the language recognition accuracy can be improved and the voice recognition ability can be improved.
  • modifying the multiple first confidence levels into multiple second confidence levels according to user characteristics may specifically include: when the multiple first confidence levels are smaller than the first threshold, Modifying the multiple first confidence levels into multiple second confidence levels according to user characteristics.
  • the plurality of first confidence levels are smaller than the first threshold, it is difficult to determine the language of the input voice information according to the first confidence levels.
  • the language of the input voice information can be determined according to the second confidence levels, which can improve language recognition accuracy and speech recognition ability.
  • User characteristics may include one or more of historical language records and user-specified languages.
  • the first recognition confidence level is corrected according to the user's historical language records and/or the user-designated language, and the language of the input voice is determined on this basis, thereby improving the language recognition ability.
  • the historical language record of the user refers to the record of the language to which the voice input by the user belongs before the above-mentioned input voice is input.
  • the user-specified language refers to the type of system language set by the user. There may be only one user-specified language, or there may be multiple user-specified languages (that is, there are multiple system languages set by the user).
  • the historical language records and the user-specified language are obtained by querying the voiceprint features of the input voice information.
  • the voiceprint of the input voice information to query the historical language records or the user-specified language, for example, compared with the way of querying based on face information, iris, etc., it is possible to avoid misidentification of the user (speaker). Non-speakers are identified as speakers) causing language misidentification.
  • the voiceprint can be obtained according to the input voice information, while the method of querying based on face information, iris, etc. also needs to obtain user images. Therefore, the method of querying based on voiceprint requires less equipment. Processing is faster.
  • the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights.
  • the voice processing method may further include the following content: updating multiple preset weights according to multiple second confidence levels.
  • updating multiple preset weights according to multiple second confidence levels specifically includes: when there is a second confidence level greater than the first threshold among the multiple second confidence levels , updating multiple preset weights according to multiple second confidence levels.
  • the method further includes: determining the semantics of the input voice information according to the input voice information and the language of the input voice information.
  • multiple languages are preset.
  • the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the voice processing method further includes: before acquiring the user's input voice information, according to the scene Features set multiple preset weights.
  • the preset weight is set according to the scene characteristics, so that different scenes can be adapted, and the language recognition result is obtained with the preset weight that is most suitable for the scene, which improves the language recognition ability and speech recognition ability.
  • the scene feature includes an environment feature and/or an audio collector feature.
  • the environmental feature includes one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude
  • the audio collector feature includes microphone arrangement information.
  • Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
  • setting multiple preset weights according to scene characteristics specifically includes: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; The first voice data and the scene feature determine the second voice data; determine the second language information of the second voice data according to the second voice data; set a plurality of preset weights according to the first language information and the second language information.
  • determining the second language information of the second voice data according to the second voice data specifically includes: acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple a plurality of test weights; determine a plurality of second language information according to the second voice data and a plurality of test weight groups, and a plurality of second language information corresponds to a plurality of test weight groups respectively; set according to the first language information and the second language information Multiple preset weights, specifically including: determining multiple accuracy rates of multiple second language information according to the first language information and multiple second language information; setting according to the test weight group corresponding to the second language information with the highest accuracy rate Multiple preset weights.
  • setting multiple preset weights specifically includes: setting multiple preset weights within a weight range.
  • updating the multiple preset weights specifically includes: updating the multiple preset weights within a weight range.
  • the weight range is determined as follows:
  • any one of the plurality of test voice data groups includes a plurality of test voice data; obtain a plurality of test weight groups, Any one of the multiple test weight groups includes multiple test weights; the weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
  • the second aspect of the present application provides a voice processing method, including the following content: acquiring user input voice information; determining a plurality of third confidence levels corresponding to the input voice information according to the input voice information, the multiple third confidence levels respectively corresponding to multiple languages; correcting multiple third confidence levels into multiple fourth confidence levels according to scene features; determining the language of the input voice information according to the multiple fourth confidence levels.
  • the speech processing method modify the multiple first confidence levels into multiple second confidence levels according to the scene features, and determine the language of the input voice information according to the multiple second confidence levels, that is, on the basis of considering the scene features Determine the language of the user's input voice information, so that the voice processing method can be adapted to the actual scene as much as possible, and the language recognition accuracy can be improved, and the voice recognition ability can be improved.
  • scene features may include environment features and/or audio collector features.
  • the environmental feature includes one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude
  • the audio collector feature includes microphone arrangement information.
  • modifying multiple third confidence levels into multiple fourth confidence levels according to scene characteristics includes: setting multiple preset weights according to scene features; The weight modifies the plurality of third confidence levels into a plurality of fourth confidence levels.
  • setting multiple preset weights according to scene characteristics specifically includes: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; The first voice data and the scene feature determine the second voice data; determine the second language information of the second voice data according to the second voice data; set a plurality of preset weights according to the first language information and the second language information.
  • determining the second language information of the second voice data according to the second voice data specifically includes: acquiring multiple test weight groups, where the test weight groups include multiple test weights; according to the first Two voice data and a plurality of test weight groups determine a plurality of second language information, and a plurality of second language information corresponds to a plurality of test weight groups; multiple preset weights are set according to the first language information and the second language information, It specifically includes: determining multiple accuracy rates of multiple second language information according to the first language information and multiple second language information; setting multiple preset weights according to the test weight group corresponding to the second language information with the highest accuracy rate.
  • the third aspect of the present application provides a voice processing device, including a processing module and a transceiver module, the transceiver module is used to obtain the user's input voice information; Confidence degree, multiple first confidence degrees respectively correspond to multiple languages.
  • the processing module is further configured to modify the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics of the user, and determine the language of the input voice information according to the plurality of second confidence levels.
  • the processing module is specifically configured to, when the multiple first confidence levels are smaller than the first threshold, modify the multiple first confidence levels to multiple second confidence levels according to user characteristics.
  • the user features include one or more of historical language records and user-specified languages.
  • the historical language records and the user-specified language are obtained by querying the voiceprint features of the input voice information.
  • the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the processing module is further configured to, according to the multiple second confidence levels, update the multiple Default weight.
  • the processing module is specifically configured to update the multiple Default weight.
  • the processing module is further configured to determine the semantics of the input voice information according to the input voice information and the language of the input voice information.
  • the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the processing module is also configured to, before acquiring the user's input voice information, Features set multiple preset weights.
  • Scene features may include environmental features and/or audio picker features.
  • the environmental characteristics may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
  • the processing module is specifically configured to acquire the pre-collected first voice data and the pre-recorded first language information of the first voice data, and determine the For the second voice data, the second language information of the second voice data is determined according to the second voice data, and a plurality of preset weights are set according to the first language information and the second language information.
  • the processing module is specifically configured to obtain multiple test weight groups, any one of the multiple test weight groups includes multiple test weights, and according to the second voice data and the multiple test weights,
  • the weight group determines a plurality of second language information, and the plurality of second language information corresponds to the plurality of test weight groups respectively, and determines multiple accuracy rates of the plurality of second language information according to the first language information and the plurality of second language information,
  • a plurality of preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
  • the processing module is specifically configured to set multiple preset weights within a weight range.
  • the processing module is specifically configured to set multiple preset weights within a weight range.
  • the weight range is determined as follows:
  • any one of the plurality of test voice data groups includes a plurality of test voice data; obtain a plurality of test weight groups, Any one of the multiple test weight groups includes multiple test weights; the weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
  • the speech processing device of the third aspect can obtain the same technical effect as that of the speech processing method of the first aspect, and the description will not be repeated here.
  • the fourth aspect of the present application provides a voice processing device, including a processing module and a transceiver module, the transceiver module is used to obtain the input voice information of the user; Confidence, the plurality of third confidence degrees correspond to multiple languages, and the processing module is also used to modify the plurality of third confidence degrees into a plurality of fourth confidence degrees according to the scene characteristics, and determine the input according to the plurality of fourth confidence degrees The language of the voice message.
  • Scene features may include environmental features and/or audio picker features.
  • the environmental characteristics may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
  • the processing module is specifically configured to set multiple preset weights according to scene characteristics, and correct multiple third confidence levels into multiple fourth confidence levels according to the multiple preset weights.
  • the processing module is specifically configured to acquire the pre-collected first voice data and the pre-recorded first language information of the first voice data, and determine the second language information according to the first voice data and scene characteristics.
  • the second language information of the second voice data is determined according to the second voice data, and a plurality of preset weights are set according to the first language information and the second language information.
  • the processing module is specifically configured to obtain multiple test weight groups, the test weight groups include multiple test weights, and determine multiple second languages according to the second voice data and the multiple test weight groups Information, a plurality of second language information corresponds to a plurality of test weight groups respectively, according to the first language information and the plurality of second language information to determine the multiple accuracy rates of the plurality of second language information, according to the second language with the highest accuracy
  • the test weight group corresponding to the information sets a plurality of preset weights.
  • a fifth aspect of the present application provides a computing device, which includes a processor and a memory, the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the processor performs any one of the functions described in the first aspect or the second aspect. method.
  • the sixth aspect of the present application provides a computer-readable storage medium, which stores computer program instructions. When executed by a computer, the computer program instructions cause the computer to execute any method described in the first aspect or the second aspect.
  • a seventh aspect of the present application provides a computer program product, which includes computer program instructions. When executed by a computer, the computer program instructions cause the computer to execute any method described in the first aspect or the second aspect.
  • the eighth aspect of the present application provides a system, which includes the speech processing device provided in any aspect from the third aspect to the fourth aspect or any possible implementation manner.
  • FIG. 1 is a schematic illustration of an application scenario example of a speech processing solution provided by an embodiment of the present application
  • FIG. 2 is a schematic illustration of a speech processing system applied to the speech processing solution provided by the embodiment of the present application;
  • Fig. 3 is the flowchart of the voice processing method provided by one embodiment of the present application.
  • FIG. 4 is a flowchart of a speech processing method provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural illustration of a speech processing device provided by an embodiment of the present application.
  • FIG. 6 is a flowchart for schematically illustrating a language recognition method provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a language recognition device provided by an embodiment of the present application.
  • FIG. 8 is a flowchart for schematically illustrating a voice interaction method provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a voice interaction system provided by an embodiment of the present application.
  • Figure 10 is a schematic illustration of a method for setting the weight range
  • FIG. 11 is a schematic illustration of a voice interaction system involved in an embodiment of the present application.
  • FIG. 12 is a flow chart illustrating one of the processes of the voice interaction method involved in an embodiment
  • FIG. 13 is a schematic illustration of a method for initializing a preset weight set provided in an embodiment of the present application
  • FIG. 14 is a schematic illustration of a part of the flow of a voice interaction process provided in an embodiment of the present application.
  • FIG. 15 is a schematic illustration of a confidence correction method provided in an embodiment of the present application.
  • FIG. 16 is a schematic illustration of another confidence correction method provided in an embodiment of the present application.
  • Fig. 17 is a schematic illustration of an electronic control unit provided in an embodiment of the present application.
  • the voice processing solution provided in the embodiment of the present application includes a voice processing method, device, and system. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have mutual references and can be combined with each other.
  • FIG. 1 What is illustrated in FIG. 1 is a scene applied to a vehicle.
  • Microphone array receives the voice commands of the driver 300 and other passengers, executes corresponding controls according to the voice commands (such as playing music, opening the windows, turning on the air conditioner, navigating, etc.), and at the same time, responding to the voice commands (Feedback), for example, sending display information through the central control display 210 or sending out voice information through a speaker (not shown) on the central control display 210 .
  • the voice commands such as playing music, opening the windows, turning on the air conditioner, navigating, etc.
  • the vehicle 200 since the vehicle 200 is taken by different occupants, they may issue voice commands in different languages, or even the same occupant may issue voice commands in different languages. function. However, limited by the language recognition capability, sometimes the car-machine system may get a wrong language recognition result, resulting in failure to recognize or wrongly recognize the semantics of the voice command, and thus fail to respond correctly.
  • machine learning models may learn some task-independent information, such as environmental signal-to-noise ratio, audio collectors (sound sensors, Microphone) features, etc., which will lead to errors in the prediction results of the machine learning model when the information changes in actual applications.
  • the vehicle 200 is a convertible car
  • the ambient noise is relatively large (for example, a medium-noise environment)
  • the The computer system may get the wrong language recognition result, so it cannot correctly recognize the voice command and thus cannot make a correct response.
  • the type of microphone array corresponding to the training sample data of the machine learning model is different from the microphone 212 of the vehicle-machine system, it may also cause the vehicle-machine system to generate wrong language recognition results and fail to correctly recognize the driver's voice commands.
  • the embodiments of the present application provide a voice processing method, device, system, etc., which can improve the voice recognition capability of a multilingual voice processing solution.
  • FIG. 2 is a schematic diagram illustrating the architecture of a speech processing system to which the speech processing solution provided by the embodiment of the present application is applied.
  • the voice processing system 180 includes a voice processing device 182 , a sound sensor (microphone) 184 , a speaker 186 , a display device 188 and the like.
  • the voice processing system 180 can be applied to smart vehicles as a car-machine system. In addition, it can also be applied to scenarios such as smart home, smart office, smart robot, smart voice question and answer, smart voice analysis, and real-time voice monitoring and analysis.
  • the sound sensor 184 is used to acquire the user's input voice, and the voice processing device 182 obtains the user's input voice information according to the sensor data of the sound sensor 184, processes the input voice information, and obtains the semantics of the input voice information. And, the voice processing device 182 performs corresponding control according to the semantics, for example, controlling the output of the speaker 186 or the display device 188 .
  • the voice processing device 182 can also be connected with other devices and mechanisms. , air-conditioning system, etc., so as to be able to control the windows, air-conditioning system, etc.
  • Fig. 3 is a flowchart of a speech processing method provided by an embodiment of the present application.
  • the voice processing method may be executed by a vehicle, a vehicle-mounted device, a vehicle-mounted computer, or a vehicle-mounted computer, and may also be executed by components in the vehicle or the vehicle-mounted device, such as a chip or a processor.
  • the voice processing method can also be applied to other scenarios such as smart home or smart office.
  • the speech processing method may be executed by related devices involved in these scenarios, such as a control device, a processor, and the like.
  • the speech processing method includes the following contents:
  • the input voice information of the user may be obtained according to the sensor data collected by the sound sensor, the sensor data may be used directly, or the information obtained after processing the sensor data.
  • the time length of the input voice information is not particularly limited, and may correspond to a paragraph or a sentence of the user.
  • the content spoken by the user may be segmented to form a plurality of input speech information, and the processing of S2-S4 described later is respectively performed on the plurality of input speech information.
  • the multiple first confidence levels correspond to multiple languages.
  • multiple languages may be preset.
  • the meaning of the confidence level of a language refers to the probability that the input voice information belongs to the language. For example, when the multiple first recognition confidence levels obtained are ⁇ Chinese: 0.6; English: 0.4; Korean: 0; German: 0; Japanese: 0 ⁇ , it means that the probability that the language of the input voice information is Chinese is 0.6, which is English. The probability of being Korean, German, and Japanese is 0.4.
  • different languages may refer to different language families. For example, Chinese and English belong to different languages, or they may refer to different minor languages under the same language family. For example, Mandarin and Cantonese in Chinese also belong to different languages.
  • the user features here are, for example, historical language records or user-specified languages.
  • the historical language record is the recognized language of the user's input voice information that was recognized and recorded before the current processing cycle.
  • the recognized language here means the language of the input voice information determined by recognizing the input voice information.
  • the user-specified language refers to the type of system language set by the user, for example, according to his or her frequently used language.
  • the first confidence degree is modified according to the user characteristics, and the language type of the input voice information is determined according to the modified second confidence degree, so that the language type of the input voice information can be determined more accurately, and speech recognition can be improved. ability.
  • the specific correction method for example, take the correction based on historical language records as an example: assuming that there are more records in Chinese in the historical language records, then the confidence of Chinese among the multiple first confidence levels obtained in this processing cycle degree increases, so as to obtain the second degree of confidence, for example, according to the historical language records, the above-mentioned multiple first confidence degrees ⁇ Chinese: 0.6; English: 0.4; Korean: 0; German: 0; Japanese: 0 ⁇ , amended is ⁇ Chinese: 0.8; English: 0.2; Korean: 0; German: 0; Japanese: 0 ⁇ .
  • the multiple first confidence levels when the multiple first confidence levels are smaller than the first threshold, the multiple first confidence levels may be corrected to multiple second confidence levels according to user characteristics.
  • the plurality of first confidence levels are smaller than the first threshold, it is difficult to determine the language of the input voice information according to the first confidence levels.
  • the language of the input voice information can be determined according to the second confidence levels, which can improve language recognition accuracy and speech recognition ability.
  • historical language records and user-specified languages can be obtained by querying the voiceprint features of the input voice information.
  • historical language records and user-specified languages can be easily obtained.
  • the multiple first confidence levels may be determined by multiple initial confidence levels and multiple preset weights.
  • the multiple preset weights may be updated according to the multiple second confidence levels.
  • the preset weight is updated according to the processing result of the current processing cycle, so that the language recognition accuracy of the subsequent processing cycle can be improved.
  • updating may be performed when there is a second confidence level greater than the first threshold among the multiple second confidence levels.
  • the language recognition result obtained according to the plurality of second confidence degrees has higher credibility, and at this time according to the plurality of second confidence degrees Updating the preset weights can more reliably improve the language recognition accuracy in subsequent processing cycles.
  • the semantics of the input voice information may be determined according to the input voice information and the language of the input voice information.
  • the above-mentioned multiple preset weights may be set according to scene characteristics.
  • the scene features here may include environment features and/or audio collector features, for example.
  • the environmental characteristics here may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
  • the following method can be adopted: obtain the pre-collected first voice data and the pre-recorded first language information of the first voice data; determine the second language according to the first voice data and scene characteristics; voice data; determining second language information of the second voice data according to the second voice data; setting a plurality of preset weights according to the first language information and the second language information.
  • the specific manner of determining the second language information of the second voice data according to the second voice data may be: acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple test weights; according to the second voice data and a plurality of test weight groups determine a plurality of second language information, and the plurality of second language information corresponds to a plurality of test weight groups; determine a plurality of second language information according to the first language information and the plurality of second language information multiple accuracy rates; multiple preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
  • an adjustable range that is, a weight range
  • an adjustable range can be set for multiple preset weights, and multiple preset weights can be set or updated within the weight range. If the preset weight exceeds the weight range, the recognition result will not be credible. Therefore, by setting an adjustable range, that is, a weight range, the accuracy of the language recognition result can be improved.
  • the weight range can be determined in the following manner: a plurality of pre-collected test voice data groups and pre-recorded first language information of the plurality of test voice data groups are acquired, and any one of the plurality of test voice data groups includes a plurality of test voice data sets. voice data; acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple test weights; determining the weight range according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
  • Fig. 4 is a flowchart of a speech processing method provided by an embodiment of the present application. Similar to the above-mentioned embodiments, the voice processing method of this embodiment can be executed by the vehicle, vehicle-mounted device, vehicle machine, vehicle-mounted computer, etc., and can also be executed by components in the vehicle or vehicle-mounted device, such as chips or processors. In addition, part of the content in this embodiment is the same as that in the above embodiment, so the description of these content will not be repeated.
  • the speech processing method includes the following contents:
  • the third confidence degree is modified according to the user characteristics, and the language type of the input voice information is determined according to the modified fourth confidence degree, so that the language type of the input voice information can be determined more accurately, and speech recognition can be improved. ability.
  • the third confidence degree may be obtained in the same manner as the first confidence degree, or may be different
  • the fourth confidence degree may be obtained in the same manner as the second confidence degree, or may be different.
  • the specific manner of performing correction according to scene characteristics may be the same as the specific manner of performing correction according to user characteristics in the foregoing embodiments, or may be different.
  • correction processing in this embodiment and the correction processing described in the above embodiments can be used in combination, that is, the language confidence is corrected according to both user characteristics and scene characteristics, so that the language of the input voice information can be determined more accurately.
  • a plurality of preset weights may be set according to scene characteristics; and the plurality of third confidence degrees are modified into a plurality of fourth confidence degrees according to the plurality of preset weights.
  • the method of setting multiple preset weights may specifically be: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; determining according to the first voice data and scene characteristics second voice data; determining second language information of the second voice data according to the second voice data; setting a plurality of preset weights according to the first language information and the second language information.
  • test weight groups are obtained, and the test weight groups include multiple test weights; multiple second language information is determined according to the second voice data and the multiple test weight groups, and the multiple second The language information corresponds to a plurality of test weight groups respectively; according to the first language information and the plurality of second language information, multiple accuracy rates of the plurality of second language information are determined; according to the second language with the highest accuracy rate
  • the test weight group corresponding to the information sets a plurality of preset weights.
  • FIG. 5 is an explanatory diagram of a schematic structure of a speech processing device provided by an embodiment of the present application.
  • the voice processing device 190 is used to execute the voice processing method in the embodiment described with reference to FIG. 3 or the voice processing method in the embodiment described with reference to FIG. 4 , and its structure can be known from the above description, so it is only briefly described here.
  • the voice processing device 190 includes a processing module 192 and a transceiver module 194 .
  • the processing module 192 may be used to execute the content in S2-S4 or S7-S9 above, and the transceiver module 194 may be used to execute the content in S1 or S6 above.
  • the speech processing device 190 may be composed of hardware, may also be composed of software, or may be composed of a combination of software and hardware. Using the speech processing apparatus 190 of this embodiment, the same technical effect as that of the speech processing method described above can be obtained, so repeated description of the technical effect is omitted here.
  • a language recognition method provided by an embodiment of the present application is described below with reference to FIG. 6 .
  • Fig. 6 is a flow chart for schematically illustrating a language recognition method provided by an embodiment of the present application.
  • the language recognition method can be executed by a vehicle, a vehicle-mounted device, a vehicle machine, a vehicle-mounted computer, a chip, a processor, and the like.
  • the input voice information of the user is acquired.
  • the user's input voice data received by the microphone is acquired as the input voice information, or the input voice data of the microphone is preprocessed to obtain the input voice information.
  • the input speech is recognized to obtain a multilingual first recognition confidence set, and multiple first recognition confidence levels in the first recognition confidence set correspond to multiple languages respectively.
  • the multilingual first recognition confidence set is ⁇ Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0 ⁇ . That is, the probability that the language of the input voice information is Chinese is 0.9, the probability that it is English is 0.1, and the probability that it is Korean, German, or Japanese is 0.
  • step S14 it is judged whether there is a first recognition confidence value greater than a threshold in the multilingual first recognition confidence set.
  • the threshold here can be set to 0.8, for example.
  • the recognition result is generated and output according to the first recognition confidence level set.
  • the recognition result here may be a result indicating the recognized language (for example, Chinese), or may be the first recognition confidence set itself.
  • this S14 may be omitted, and step S18 described later may be directly performed.
  • step S18 the first recognition confidence degree set is corrected and calculated according to the user's user characteristics to obtain the first recognition confidence degree set. 2. Recognition confidence set.
  • Examples of user characteristics include the user's historical language records, user-specified language, and the like.
  • the historical language record refers to the record of the recognition language of the speech input by the user before the above-mentioned input speech is input.
  • the user-specified language refers to the type of system language set by the user (such as the system language of the voice interaction system, the system language of the mobile phone operating system when applied to a mobile phone, etc.). In addition, the user may specify one or more languages (that is, the user has set multiple system languages).
  • Historical language records can be obtained by querying the voiceprint of the input voice in the database, and can also be obtained by querying the user's face information, iris information, etc. That is, the user's identity can be determined based on voiceprint, face information, iris information, etc., and thus the user's historical language records can be obtained from the database.
  • Querying historical language records and user-specified languages according to the voiceprint of the input voice can avoid misidentifying users (speakers) (identifying non-speakers as speakers) cause language misidentification.
  • the voiceprint can be obtained according to the input voice information, while the method of querying based on face information, iris, etc. also needs to obtain the user image. Therefore, the method of querying based on the voiceprint requires less equipment and faster processing.
  • the basis of obtaining the user-specified language through voiceprint is to collect the user's voiceprint when the user sets the system language, and combine the voiceprint (or user identity) with the user
  • the set system language is associated and stored in the user-specified language database.
  • the database mentioned here can be stored locally or on a credible platform.
  • a language recognition result is generated according to the second recognition confidence set.
  • the second recognition confidence set may be directly output as the recognition result, or, when there is a second recognition confidence greater than a threshold in the second recognition confidence set, output the second recognition confidence set or information indicating the recognized language, When there is no second recognition confidence greater than the threshold in the second recognition confidence set, the first recognition confidence set is output as the recognition result.
  • the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence. In this way, the language recognition ability can be improved.
  • the language recognition method in this embodiment further includes: when multiple second recognition confidences in the second recognition confidence set are smaller than the threshold, according to the automatic speech recognition confidence obtained by performing automatic speech recognition on the input speech
  • the language recognition result is generated using the confidence degree of natural language understanding (NLU) obtained by performing natural language understanding (NLU) on the input speech.
  • the language of the input speech is determined according to the automatic speech recognition confidence or the natural language understanding confidence, thereby improving the language recognition ability.
  • the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
  • the first recognition confidence set can be obtained in the following manner: the input speech is recognized to obtain an initial confidence set; multiple initial confidence sets in the initial confidence set are combined with preset weights Multiple preset weights in the set are respectively multiplied to obtain a first recognition confidence set.
  • the preset weight set can be updated, so that the preset weights of the recognition languages whose second recognition confidence is greater than the threshold among the multiple languages are relatively Preset weights increased for other languages.
  • the preset weight set is updated as above, so that after the input When the speech is processed, the updated preset weight set is used, so that the accuracy of language recognition can be improved, and the language recognition ability can be improved.
  • the specific method of updating the preset weight set can be: performing correction calculation on the preset weight set to obtain the corrected weight set; Set the set of weights.
  • the reliability of the language recognition result obtained according to the preset weight is relatively low. Therefore, using the above method, correcting the preset weight within the weight range can suppress the misrecognition rate of the language .
  • a preset weight set may be preset according to scene characteristics. Therefore, it is possible to adapt to different scenarios, and obtain the language recognition result with the preset weights that are most suitable for the scenario, thereby improving the language recognition ability.
  • Scene features may include environmental features and/or audio picker features.
  • the environmental characteristics may include environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude
  • the audio collector characteristics may include microphone arrangement information.
  • the microphone arrangement information refers to whether it is a single microphone or a microphone array, or if it is a microphone array, whether it is a linear array, a planar array, or a stereo array.
  • Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
  • the specific method of setting the preset weight set can be: obtaining multiple test weight sets; inputting the pseudo-environmental data set into the language recognition model, the pseudo-environmental data set is obtained according to the scene characteristics and the noise-free data set; according to the language recognition model
  • the output initial confidence set is obtained under the condition of multiple test weight sets of multiple first recognition confidence sets; according to the language information of the pseudo-environmental data set, the prediction accuracy of multiple first recognition confidence sets is calculated;
  • the test weight set corresponding to the first recognition confidence set with the highest prediction accuracy among the multiple test weight sets is determined as the optimal test weight set; the preset weight set is set with the values of multiple test weights in the optimal test weight set .
  • the setting when the set preset weight is within the weight range, the setting is enabled, and when the set preset weight is not within the weight range, the setting is canceled. Or, when the set preset weight is not within the weight range, the setting is still valid, but other methods are preferred to obtain the language recognition result, such as determining the recognition language of the input voice according to the language specified by the user, or, by setting this
  • the second input voice is compared with the input voice in the historical language record to obtain the characteristic similarity. If the feature similarity is greater than the similarity threshold, the language of the input voice in the historical language record is determined as the recognition of the input voice this time. language.
  • the weight range can be set in the following way: obtain multiple test data sets; obtain multiple test weight sets; input the test data sets into the language recognition model; according to the initial confidence set output by the language recognition model and multiple test weight sets, get A plurality of first recognition confidence sets in the case of multiple test weight sets; according to the language information of the test data set, calculate the prediction accuracy of multiple first recognition confidence sets; predict the accuracy of the multiple test weight sets
  • the test weight set corresponding to the highest first recognition confidence set is determined as the optimal test weight set; the optimal test weight set of multiple test data sets is obtained; multiple language types are obtained according to the optimal test weight set of multiple test data sets weight range.
  • test data set may be a pre-collected speech data set, whose language information is known.
  • FIG. 7 is a schematic structural diagram of a language recognition device provided by an embodiment of the present application.
  • an embodiment of the present application provides a language recognition device, which is used to implement the language recognition method shown in Figure 6, and its structure can be compared to the language recognition method in Figure 6 from above It is known from the description, so here only a relatively brief description of the language recognition device 10 is given.
  • the language recognition device 10 includes: an input speech acquisition module 17, configured to acquire the user's input speech; a language recognition module 12, configured to recognize the input speech to obtain a first recognition confidence set, the first Multiple first recognition confidences in the recognition confidence set correspond to multiple languages respectively; the language confidence correction module 16 is used to correct and calculate the first recognition confidence set according to the user characteristics of the user to obtain the second recognition confidence degree set; the recognition result generating module 18, configured to generate a language recognition result according to the second recognition confidence degree set.
  • the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence. In this way, the language recognition ability can be improved.
  • the language confidence correction module 16 may correct and calculate the first set of recognition confidences according to user characteristics to obtain the second set of recognition confidences when the multiple first recognition confidences are less than the threshold.
  • the user characteristics include historical language records.
  • the historical language record of the user refers to the record of the language to which the voice input by the user belongs before inputting the above-mentioned input voice.
  • the historical language record is obtained by querying the voiceprint of the input voice.
  • the voiceprint of the input voice to query the historical language record for example, compared with the way of querying based on face information, iris, etc., it can avoid misidentifying the user (speaker) (identifying the non-speaker as speakers) cause language misidentification.
  • the user characteristics include a user-specified language.
  • the first recognition confidence is modified according to the language specified by the user, and the language of the input speech is determined on this basis, thereby improving the language recognition ability.
  • the language specified by the user is obtained through querying the voiceprint of the input voice.
  • the language specified by the user can be queried according to the voiceprint of the input voice, for example, compared with the way of inquiring based on face information, iris, etc., it can avoid misidentifying the user (speaker) (identifying the non-speaker as speakers) cause language misidentification.
  • the recognition result generation module is also used for: when a plurality of second recognition confidences in the second recognition confidence set are less than a threshold, according to the automatic speech recognition (Automatic Speech Recognition, ASR) obtained by performing automatic speech recognition (ASR) on the input speech, Speech recognition confidence generates language recognition results.
  • ASR Automatic Speech Recognition
  • the language of the input speech is determined according to the automatic speech recognition confidence, thereby improving the language recognition ability.
  • the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
  • the recognition result generation module is also used for: when multiple second recognition confidences in the second recognition confidence set are less than the threshold, generate language recognition according to the natural language understanding confidence obtained by performing natural language understanding on the input speech result.
  • the language of the input speech is determined according to the automatic speech recognition confidence, thereby improving the language recognition ability.
  • the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
  • the language identification module is also used to: recognize the input speech to obtain an initial confidence set; multiply multiple initial confidences in the initial confidence set by multiple preset weights in the preset weight set , to obtain the first recognition confidence set; the language confidence correction module is also used to update the preset weight set when there is a second recognition confidence greater than the threshold in the second recognition confidence set, so that the first Second, the preset weights of recognized languages whose recognition confidence is greater than the threshold are increased relative to the preset weights of other languages.
  • the preset weight set is updated in the above manner, so that the subsequent input speech
  • the updated preset weight set is used, so that the accuracy of language recognition can be improved, and the ability of language recognition can be improved.
  • the language confidence correction module is also used to: perform correction calculation on the preset weight set to obtain a correction weight set; when multiple correction weights in the correction weight set are within the weight range, use the values of multiple correction weights Update preset weight collection.
  • the reliability of the language recognition result obtained according to the preset weight is relatively low. Therefore, using the above method, correcting the preset weight within the weight range can suppress the misrecognition rate of the language.
  • the language identification module is also used to: recognize the input speech to obtain an initial confidence set; multiply multiple initial confidences in the initial confidence set by multiple preset weights in the preset weight set , to obtain the first recognition confidence set; the language confidence correction module is also used to set a preset weight set according to scene features.
  • the preset weight set is set according to the scene characteristics, so that different scenes can be adapted, and the language recognition result can be obtained with the preset weights that are most suitable for the scene, thereby improving the language recognition ability.
  • the scene features include environment features and/or audio collector features.
  • the environmental characteristics include environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude
  • the audio collector characteristics include microphone arrangement information.
  • the microphone arrangement information refers to whether it is a single microphone or a microphone array, or if it is a microphone array, whether it is a linear array, a planar array, or a stereo array.
  • Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
  • the language confidence correction module is also used to: obtain multiple sets of test weights; input the quasi-environment data set into the language recognition model, the quasi-environment data set is obtained according to the scene characteristics and the noise-free data set; according to the language recognition model
  • the output initial confidence set is obtained under the condition of multiple test weight sets of multiple first recognition confidence sets; according to the language information of the pseudo-environmental data set, the prediction accuracy of multiple first recognition confidence sets is calculated;
  • the test weight set corresponding to the first recognition confidence set with the highest prediction accuracy among the multiple test weight sets is determined as the optimal test weight set;
  • the preset weight set is set with the values of multiple test weights in the optimal test weight set . Therefore, it can be said that the language confidence correction module has a preset weight setting module.
  • the language confidence correction module is further configured to set multiple preset weights within the weight range.
  • the weight range is set as follows: obtain multiple test data sets; obtain multiple test weight sets; input the test data sets into the language recognition model; output the initial confidence set according to the language recognition model and multiple Test the weight set to obtain multiple first recognition confidence sets in the case of multiple test weight sets; calculate the prediction accuracy of multiple first recognition confidence sets according to the language information of the test data set; combine multiple test weights
  • the test weight set corresponding to the first recognition confidence set with the highest prediction accuracy in the set is determined as the optimal test weight set; the optimal test weight set of multiple test data sets is obtained; according to the optimal test weight set of multiple test data sets Set to get the weight range of multiple languages.
  • the function of setting the weight range can be realized by the language recognition device 10 . In this case, it can be said that the language recognition device 10 has a weight range setting module, or it can be realized by a test device for testing the language recognition device 10 .
  • An embodiment of the present application provides a computing device, which includes a processor and a memory, the memory stores program instructions, and when the program instructions are executed by the processor, the processor executes the speech processing method and the language recognition method.
  • the computing device can be understood more from the following description in conjunction with FIG. 17 .
  • An embodiment of the present application provides a computer-readable storage medium, which stores program instructions, and is characterized in that, when the program instructions are executed by a computer, the computer executes the above speech processing method and language recognition method.
  • An embodiment of the present application provides a computer program.
  • the computer program When the computer program is executed by a computer, the computer executes the above speech processing method and language recognition method.
  • FIG. 8 is a flowchart schematically illustrating a voice interaction method provided by an embodiment of the present application. Part of the steps in the speech interaction method are the same as the above-mentioned language recognition method, and here, the same content is marked with the same reference numerals, and the description thereof is simplified.
  • step S10 input voice information of the user is acquired.
  • the input voice information of the user received by the microphone is obtained.
  • step S40 automatic speech recognition is performed on the input speech using a speech recognition model; on the other hand, in step S12, language recognition is performed on the input speech using a language recognition model.
  • automatic speech recognition and language recognition may also be performed sequentially.
  • step S40 in order to be able to recognize the input speech of multiple languages, the speech recognition model of a plurality of different languages (in this embodiment, five languages of Chinese, English, Korean, German and Japanese) is used to perform this input speech Speech content recognition processing obtains multiple texts Ti in different languages.
  • the speech recognition model of a plurality of different languages in this embodiment, five languages of Chinese, English, Korean, German and Japanese
  • this input speech Speech content recognition processing obtains multiple texts Ti in different languages.
  • step S42 a plurality of texts Ti are input into the text translation model, and the text translation model performs translation processing on these texts Ti, and converts these texts Ti into text Ai of the target language (such as Chinese).
  • the target language such as Chinese
  • step S44 multiple texts Ai are sequentially input into the semantic understanding model, and the semantic understanding model performs semantic understanding processing on these texts Ai, so as to obtain multiple corresponding candidate commands Oi.
  • Candidate means an order that has not yet been confirmed for execution.
  • step S12 the input speech is recognized to obtain a multilingual first recognition confidence set, and multiple first recognition confidence levels in the first recognition confidence set correspond to multiple languages respectively.
  • the multilingual first recognition confidence set is ⁇ Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0 ⁇ .
  • step S14 it is judged whether there is a first recognition confidence value greater than a threshold in the multilingual first recognition confidence set.
  • the threshold here can be set to 0.8, for example.
  • step S26 select the candidate command corresponding to the recognition language (such as Chinese) as the target command to be executed according to the recognition language (such as Chinese) from a plurality of candidate commands Oi obtained in step S44, and then make the The process by which the target command is executed. For example, when the target command is "turn on the air conditioner", corresponding control is executed to turn on the air conditioner.
  • the recognition language such as Chinese
  • step S14 when the judgment result in step S14 is "No", that is, there is no first recognition confidence greater than the threshold, or when multiple first recognition confidences are smaller than the threshold, in step S18, according to the user characteristics, the first Recognition confidence is corrected.
  • the specific content of the amendment has been described in detail above, so it will not be described here.
  • step S22 it is judged whether there is a second recognition confidence greater than the threshold, and when the judgment result is "yes”, in step S24, the language whose second recognition confidence is greater than the threshold is determined as the recognition language of the input speech , thereafter, the processing in step S26 is performed.
  • step S28 it is judged whether there is a second recognition confidence greater than the threshold. ASR confidence.
  • step S30 the language whose ASR confidence is greater than the threshold is determined as the recognized language, and then the processing in step S26 is performed.
  • step S28 When the judgment result in step S28 is "No", that is, there is no ASR confidence greater than the threshold, or when multiple ASR confidences are less than the threshold, in step S32, it is judged whether there is an NLU confidence greater than the threshold.
  • the judgment result is "Yes”
  • step S34 the language type whose NLU confidence degree is greater than the threshold is determined as the recognized language type, and then the processing in step S26 is executed.
  • step S32 When the judgment result in step S32 is "No", information indicating that the voice content recognition fails may be output, and the processing ends.
  • the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence, so that the language recognition ability can be improved , thereby improving the ability of voice interaction.
  • FIG. 9 is a schematic structural diagram of a voice interaction system provided by an embodiment of the present application.
  • the voice interaction system (or voice interaction device) 20 has a voice recognition module 110, a language recognition module 12, a text translation module 130, a semantic understanding module 140, an input voice acquisition module 17, and a language confidence correction module 16 and control module 170 .
  • the voice interaction device is used to execute the voice interaction method described with reference to FIG. 8 , therefore, the description of the specific processing flow is omitted here.
  • the voice interaction system 20 is the same as the above-mentioned language recognition device 10, and has a language recognition module 12, an input speech acquisition module 17 and a language confidence correction module 16, which are marked with the same reference numerals and omitted.
  • the voice interaction system may further include an execution device, such as a loudspeaker, a display device, and the like.
  • the speech recognition module 110 executes step S40 in FIG. 8 .
  • the language identification module 12 executes step S12 in FIG. 8 .
  • the text translation module 130 executes step S42 in FIG. 8 .
  • the semantic understanding module 140 executes step S44 in FIG. 8 .
  • the input speech acquisition module 17 executes step S10 in FIG. 8 .
  • the language confidence correction module 16 executes step S18 in FIG. 8 .
  • the control module 170 executes step S14 , step S16 , step S22 , step S24 , step S28 , step S30 , step S32 , and step S34 in FIG. 8 .
  • Step S14 , Step S16 , Step S22 , and Step S24 may also be executed by the language confidence correction module 16 .
  • the speech interaction method described with reference to FIG. 8 essentially includes a multilingual speech recognition method, which can recognize input speech in multiple languages; A speech recognition device for implementing the multilingual speech recognition method is included. Because there are many repetitive contents, no separate embodiments will be given to describe the speech recognition method and speech recognition device here.
  • the voice interaction system 100 and the voice interaction method executed by it according to an embodiment of the present application will be described below with reference to FIGS. 11-17 .
  • the voice interaction system 100 is applied in a car to form a vehicle voice interaction system as an example. Voice question and answer, intelligent voice analysis, real-time voice monitoring and analysis, etc.
  • the vehicle voice interaction system also constitutes a vehicle control device.
  • embodiments of the present application provide a voice processing method, device, and system.
  • the voice interaction system 100 of this embodiment can receive the input voice of the user (that is, the speaker), and perform corresponding processing in response to the content of the input voice, such as turning on the air conditioner, opening the car window, and other processing.
  • the voice interaction system 100 can respond to voices in multiple different languages. For example, in this embodiment, it can respond to voices in five languages: Chinese, English, Korean, German, and Japanese.
  • sounds of different languages include not only the sounds of different language families, for example, Chinese and English belong to different languages, but also the sounds of different minor languages under the same language family, for example, Mandarin and Cantonese of Chinese also belong to different languages.
  • FIG. 11 is a schematic illustration of a voice interaction system according to an embodiment of the present application.
  • the voice interaction system 100 has a voice recognition module 110, a language recognition module 120, a text translation module 130, a semantic understanding module 140, a command analysis and execution module 150, and a language confidence correction module 160.
  • the voice interaction system 100 can also have a microphone , speaker, camera or display etc.
  • Fig. 12 is a flowchart for illustrating one procedure of the voice interaction method involved in an embodiment. A processing flow of the voice interaction system 100 is described below with reference to FIG. 12 , so as to roughly illustrate the architecture of the voice interaction system 100 .
  • the voice interaction system 100 acquires the voice (called input voice) through the microphone, on the one hand, 1.
  • the input voice is input into the voice recognition module 110, Call the speech recognition submodule of a plurality of different languages (in the present embodiment, be Chinese, English, Korean, German and Japanese five languages, self-evident, can also be other quantity languages) this input speech carries out speech content recognition
  • multiple texts Ti in different languages are obtained. 2
  • Multiple texts Ti are input into the text translation module 130, and the text translation module 130 performs translation processing on these texts Ti, and converts these texts Ti into text Ai of the target language (eg Chinese).
  • 3 Input multiple texts Ai into the semantic understanding module 140 sequentially, and the semantic understanding module 140 performs semantic understanding processing on these texts Ai, so as to obtain multiple corresponding candidate commands Oi.
  • the language recognition module 120 performs language recognition processing on the input voice, generates initial confidence degrees of multiple languages, and multiplies each initial confidence degree by the corresponding preset Weights to get the recognition confidence of multiple languages.
  • the command analysis and execution module 150 will recognize multiple candidate commands Oi
  • the candidate command Oi corresponding to the language whose confidence is greater than the threshold ⁇ is determined as the target command to be executed, and corresponding processing is performed according to the content of the target command.
  • the language confidence correction module 160 corrects the recognition confidences of multiple languages according to user characteristics, and the specific content will be described later. A detailed description.
  • the speech recognition module 110, the language recognition module 120, the text translation module 130 and the semantic understanding module 140 respectively include an algorithm model, namely a speech recognition model, a language recognition model, a text translation model and a semantic understanding model. Perform speech recognition processing, language recognition processing, text translation processing and semantic understanding processing respectively.
  • the speech recognition module 110 is used to convert human speech, that is, speech to be recognized, into text in a corresponding language, which can also be said to predict speech content or perform automatic speech recognition (Automatic Speech Recognition, ASR).
  • the speech recognition module 110 has a plurality of speech recognition sub-modules, and each speech recognition sub-module corresponds to a language respectively, and is used to convert the speech into the text Ti of the corresponding language.
  • ASR Automatic Speech Recognition
  • these sub-modules output the text Ti as the prediction result and the confidence of the text Ti.
  • the confidence is called the ASR confidence.
  • the ASR confidence represents the prediction probability of the text predicted by the sub-module, or the Predicted probability of speech content.
  • the text translation module 130 is used to convert text in one natural language (source language) into text in another natural language (target language), for example, convert English text into Chinese text.
  • the text translation module 130 has a plurality of text translation sub-modules, and each text translation sub-module corresponds to a language respectively.
  • the text translation sub-modules of two languages are respectively used to translate English text, Korean text, German text and Japanese text into Chinese text Ai.
  • the text translation module 130 may not process the input Chinese text and output the input Chinese text as it is.
  • the final text translation module 130 outputs 5 Chinese texts Ai.
  • the semantic understanding module 140 is used for performing natural language understanding (Natural Language Understanding, NLU) on the text of the target language, which can also be said to predict the intent of the text and generate commands that can be understood by the machine. For example, if the text is "Please play the song "XX”, after the semantic understanding module 140, the machine can get the intention of "Please play the song "XX”. While generating the command, the semantic understanding module 140 also generates an NLU confidence, which represents the predicted probability of the meaning of the text by the semantic understanding module 140 . In addition, since the speech recognition module 110 will output 5 language texts, the semantic understanding module 140 will eventually generate commands corresponding to 5 languages and 5 NLU confidence levels. Furthermore, these commands output by the semantic understanding module 140 have not been determined to be executed, so they are called candidate commands.
  • natural language understanding Natural Language Understanding
  • the language identification (Language Identification, LID) module is used to identify the language of the user's input voice, that is, the voice to be recognized. It can also be said to predict which one of the multiple languages the user's input voice belongs to.
  • the language The recognition module 120 recognizes which language type the input speech belongs to among Chinese, English, Korean, German, and Japanese, and outputs a set of recognition confidence levels of multiple languages as the recognition result. Predicted probability for a language.
  • algorithmic recognition is performed on the input speech to obtain the confidence levels of multiple languages (this confidence level is called the initial confidence level), and the initial confidence levels of multiple languages are respectively multiplied by
  • the corresponding preset weight values obtain recognition confidences of multiple languages, and the language recognition module 120 outputs the recognition confidences as prediction results.
  • the calculation of multiplying the initial confidence by the preset weight value may or may not be performed by the language recognition model.
  • the command parsing and execution module 150 is used for selecting a target command to be executed from the candidate commands output by the semantic understanding module 140 according to the output of the language identification module 120 .
  • the command analysis and execution module 150 sets the recognition confidence degree greater than The language of the threshold ⁇ is determined as the language of the user's input voice, and the candidate command corresponding to the language is determined as the target command to be executed.
  • the confidence levels of the multiple languages output by the language identification module 120 are ⁇ Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0 ⁇ , Chinese is determined as the language of the user's input voice
  • the candidate command corresponding to Chinese is determined as the target command to be executed.
  • the command parsing and execution module 150 executes the control for enabling the target command to be executed, for example, when the determined target command to be executed is "please play "XX" song", and the voice
  • the command analysis and execution module 150 controls the music player module to play the song "XX”.
  • the music player module does not belong to the voice interaction system 100 in terms of ownership, and is not controlled by the command analysis and execution module 150, at this time, the determined target command to be executed can be sent to the voice interaction system 100 and the music player module.
  • the upper controller sends commands to the controller of the music playing module by the upper controller to realize playing the song "XX".
  • the command analysis and execution module 150 can respond to the user through a speaker or display, for example, when the determined target command to be executed is "please play the song "XX", the command analysis And the execution module 150 controls the speaker to emit the sound of "OK, I will play it for you soon" to respond to the user.
  • the command analysis and execution module 150 uses other methods to Determine the target command to be executed, and these methods are illustrated below.
  • the command analysis and execution module 150 corrects and calculates the recognition confidence of the above-mentioned multiple languages (corresponding to the "first recognition confidence" in this application), and the command analysis and execution module 150 performs corresponding calculations according to the output of the language confidence correction module 160. deal with. For example, it can be modified according to user characteristics, where the user characteristics include the user's historical language records and user-specified languages. Historical language records and user-specified languages can be determined according to audio features (ie, voiceprints) to determine the user identity, and can be obtained by querying the historical language record database and user-specified language database of the voice interaction system 100 according to the user identity. The specific content of these amendments will be described in detail later.
  • the command analysis and execution module 150 determines the language of the input speech according to the revised recognition confidence, and performs corresponding processing. For example, when there is a recognition confidence greater than the threshold ⁇ among the corrected recognition confidences of multiple languages, the language whose recognition confidence is greater than the threshold ⁇ is determined as the language of the user input voice, and the candidate command corresponding to the determined language is Determined as the target command to execute.
  • the value of the recognition confidence can be directly corrected, or the preset weight can be corrected and calculated, and then according to the initial confidence set and the corrected preset The set of weights again calculates the recognition confidence.
  • the language confidence correction module 160 has an audio feature-based adjustment module 162 , a video feature-based adjustment module 163 and a comprehensive adjustment module 164 , and these adjustment modules are used to correct the language confidence in different ways.
  • the command parsing and execution module 150 determines the target command to be executed according to the ASR confidence level output by the speech recognition module 110 or the NLU confidence level output by the semantic understanding module 140 . For example, when there is an ASR confidence greater than the ASR confidence threshold (which can be set to the same value as the above-mentioned threshold ⁇ , such as 0.8), the language corresponding to the ASR confidence is determined as the language of the input speech, and the language corresponding to the language The candidate command of is determined as the target command to be executed.
  • the ASR confidence threshold which can be set to the same value as the above-mentioned threshold ⁇ , such as 0.8
  • the language corresponding to the NLU confidence is determined as the language of the input speech, and the language corresponding to the language
  • the candidate command of is determined as the target command to be executed.
  • the execution timing of mode 2 can be set freely.
  • it can also be executed before executing method 1, or it can also be executed between the multiple methods listed in the description of method 1.
  • the command parsing and execution module 150 determines the language of the input speech by feature similarity. For example, compare the current input voice with the audio data of the historical input voice in the historical record, and obtain the feature similarity between the two through cosine similarity, linear regression or deep learning. When the feature similarity exceeds the threshold , the recognition language of the historical input voice can be determined as the language of the current input voice.
  • Mode 3 can be set freely. Optionally, it can be executed after or before Mode 1 and Mode 2, or it can be executed between Mode 1 and Mode 2, or it can also be exemplified in the description of Mode 1 Execute in multiple ways.
  • the confidence correction module will be described below.
  • the language confidence correction module 160 includes a real-time scene adaptation module 161 , an audio feature-based adjustment module 162 , a video feature-based adjustment module 163 and a comprehensive adjustment module 164 .
  • the real-time scene adaptation module 161 is used to initialize the multilingual preset weight set according to the environmental characteristics and the characteristics of the audio collector (ie, the microphone) when the language recognition model initially contacts the scene.
  • the initial contact scene here is, for example, when the user has just purchased a voice interaction system or a vehicle. At this time, the user generally turns on the voice interaction system to perform some basic settings or tests.
  • the real-time scene adaptation module 161 can use this opportunity to initialize the preset weight set.
  • the initialization of the preset weight set is not limited to be performed when initially contacting the scene, and can also be performed at other appropriate times, such as when replacing a new audio collector, or the user chooses the execution time.
  • the video feature-based adjustment module 163 is configured to modify the recognition confidence sets of multiple languages according to the captured user images.
  • the comprehensive adjustment module 164 is mainly used to modify the recognition confidence sets of multiple languages according to the language specified by the user.
  • the user-specified language is obtained by querying the database of the voice interaction system 100 according to the voiceprint of the input voice.
  • the confidence correction module When there is a recognition confidence greater than the threshold ⁇ among the corrected recognition confidences of multiple languages, the confidence correction module performs correction calculations on the preset weight sets of multiple languages, so that the preset weights of the languages whose recognition confidence is greater than the threshold ⁇ The weights are increased relative to the preset weights of other languages, and in this way, a set of modified weights is obtained. Afterwards, the confidence degree correction module judges whether each correction weight in the correction weight set is within the weight range, and when the judgment result is within the weight range, the value in the correction confidence weight set is used to update the preset weight set for language recognition Module 120 is used for subsequent language recognition.
  • the functions of the speech recognition module 110, the language recognition module 120, the text translation module 130, the semantic understanding module 140, the command analysis and execution module 150 and the confidence correction module can be implemented by the processor executing the program (software) stored in the memory , It can also be realized by hardware such as LSI (Large Scale Integration, large scale integrated circuit) and ASIC (Application Specific Integrated Circuit, application specific integrated circuit).
  • LSI Large Scale Integration, large scale integrated circuit
  • ASIC Application Specific Integrated Circuit, application specific integrated circuit
  • these modules can be formed by an electronic control unit (ECU).
  • ECU electronice control unit
  • one module can be formed by one ECU, or multiple ECUs, or one ECU can be used to form multiple modules.
  • ECU refers to a control device composed of integrated circuits used to implement a series of functions such as data analysis, processing and transmission.
  • the embodiment of the present application provides an electronic control unit ECU, the ECU includes a microcomputer (microcomputer), an input circuit, an output circuit and an analog-to-digital (analog-to-digital, A/D) converter .
  • microcomputer microcomputer
  • A/D analog-to-digital converter
  • the main function of the input circuit is to preprocess the input signal (such as the signal from the sensor), and the processing method is different for different input signals.
  • the input circuit may include an input circuit that processes analog signals and an input circuit that processes digital signals.
  • the main function of the A/D converter is to convert the analog signal into a digital signal. After the analog signal is preprocessed by the corresponding input circuit, it is input to the A/D converter for processing and converted into a digital signal accepted by the microcomputer.
  • the output circuit is a device that establishes a connection between the microcomputer and the actuator. Its function is to convert the processing results sent by the microcomputer into control signals to drive the actuators to work.
  • the output circuit generally uses a power transistor, which controls the electronic circuit of the actuator by turning on or off according to the instructions of the microcomputer.
  • Microcomputer includes central processing unit (central processing unit, CPU), memory and input/output (input/output, I/O) interface, CPU is connected with memory, I/O interface through bus, can communicate with each other through bus exchange.
  • the memory may be a memory such as a read-only memory (ROM) or a random access memory (RAM).
  • the I/O interface is a connection circuit for exchanging information between the central processor unit (CPU) and the input circuit, output circuit or A/D converter. Specifically, the I/O interface can be divided into a bus interface and a communication interface .
  • the memory stores programs, and the CPU calls the programs in the memory to realize the functions of the above modules, or execute the methods described with reference to Fig. 3 , Fig. 4 , Fig. 6 , Fig. 8 , and Fig. 12 .
  • the voice interaction system 100 also has a microphone, a speaker, a camera or a display.
  • the microphone is used to acquire the user's input voice, which corresponds to the voice acquisition module in this application.
  • the speaker is used to play sounds, such as the response tone "OK" to the user's input voice.
  • the camera is used to collect the user's facial image, etc., and send the collected image to the command analysis and execution module 150.
  • the command analysis and execution module 150 can perform image recognition on the image, so as to authenticate the user's identity.
  • the display is used to respond according to the user's input voice, for example, when the input voice is "play the song "XX", the display will display the playing screen of the song.
  • the voice interaction system 100 will be described in more detail below in conjunction with the description of the actions and processing flow of the voice interaction system 100 .
  • the voice interaction method involved in this embodiment will be described at the same time, and it can also be known from the following description that the voice interaction method includes a language recognition method (corresponding to the language recognition module 120, part of the processing of the command analysis and execution module 150, the processing of the confidence correction module, etc.).
  • the language recognition module 120 uses the language recognition model to perform language recognition.
  • the multilingual preset weight set is initialized according to the environment feature and the audio collector feature. An example of an initialization method will be described below with reference to FIG. 13 .
  • the real-time scene adaptation module 161 generates a quasi-environment dataset according to environmental features, audio collector features and expert datasets.
  • the environmental characteristics include, for example, the environmental signal-to-noise ratio, microphone power source information (DC-AC information), or environmental vibration amplitude, and the like.
  • the information on the power source of the microphone can be obtained, for example, through a controller area network (Controller Area Network, CAN) signal of the vehicle.
  • the characteristics of the audio collector mainly include microphone arrangement information (single microphone or microphone array, wherein the microphone array includes linear array, planar array and stereo array).
  • the expert data set is a batch of multi-person, multilingual, and noise-free audio data sets collected in advance, and its content (the language of each piece of voice data) is pre-recorded and known.
  • confidence weight sets 1-confidence weight sets N of N different multiple languages for example, refer to the example confidence weight set 1 ⁇ Chinese: 0.80; English: 0.04; Korean: 0.06; Japanese: 0.05; German: 0.05 ⁇ , confidence weight set 2 ⁇ Chinese: 0.21; English: 0.19; Korean: 0.22; Japanese: 0.20; German: 0.18 ⁇ and confidence weight set N ⁇ Chinese: 0.31; English: 0.09; Korean: 0.12; Japanese: 0.25; German: 0.23 ⁇ .
  • the quasi-environmental data set into the language recognition model to obtain the multilingual initial confidence set ( ⁇ Chinese: p1; English: p2; Korean: p3; Japanese: p4; German: p5 ⁇ in Figure 13), and the initial confidence Multiply with the confidence weight set 1-confidence weight set N of N multiple languages to obtain N recognition confidence sets. Since the content of the expert data set (the language of each piece of speech data) is known, the accuracy rate acc of the N recognition confidence sets can be calculated, and the confidence weight set corresponding to the recognition confidence set with the highest accuracy is It is determined as the optimal confidence weight set, and the preset weight set is set with the value of the optimal confidence weight set to complete the initialization of the preset weight set.
  • the speech interaction system 100 can be adjusted according to different scenarios, and the language recognition can be performed with the best recognition accuracy as possible, thereby improving the reliability of the recognition result. That is, the adoption of the above technical means can suppress the occurrence of the problem of "the trained language recognition model is not very adaptable to the scene, resulting in low reliability of the recognition result".
  • the preset weight set is initialized according to both the environment feature and the audio collector feature.
  • the preset weight set may be initialized only according to one of the environment feature and the audio collector feature.
  • step S212 of FIG. 14 it is determined whether the preset weights of each language in the set preset weight set are within the weight range.
  • the weight range is preset, and its specific value can be determined through testing, which will be described later.
  • the recognition result of the language recognition model under the environment (the above-mentioned environmental characteristics and audio collector characteristics) can be guaranteed. High reliability.
  • the confidence weight set according to the simulated environment data set is not within the weight range ("No" in step S212)
  • the reliability of the result of the language recognition model in this environment is low.
  • step S214 it is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to obtain the feature similarity, thereby determining the current language.
  • the language of the input voice of the secondary user is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to obtain the feature similarity, thereby determining the current language.
  • the language of the input voice of the secondary user is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to obtain the feature similarity, thereby determining the current language.
  • the language of the input voice of the secondary user is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to
  • step S217 an inquiry is made based on the voiceprint to determine whether the user has specified a language, and if there is a user specified language, the recognition language of the user's input voice is determined according to the user specified language.
  • the user-specified language is determined as the recognition language of the input voice; when there are multiple user-specified languages, for example, the one that appears most frequently The language is determined as the recognition language of the input voice.
  • the user-specified language inquired is ⁇ Chinese: 3 times; English: 1 time; German: 1 time ⁇ , Chinese is determined as the recognition language of the input voice.
  • step S219 the language of the input speech is determined according to the recognition result of the language recognition model.
  • step S214 is executed before step S217 , however, there is no limitation on the execution order of the processing of determining the language through the historical language record and the processing of determining the language through the user's designation of the language.
  • the language of the user's input voice is predicted according to the historical language records or the language specified by the user, thereby improving the reliability of the voice interaction system 100 in predicting the language of the input voice sex.
  • step S212 when the weight value of each language in the preset weight set set according to the simulated environment data set is within the weight range ("Yes" in step S212), when the user's input voice is detected, in step S212 In S200, use the language recognition model to perform language recognition on the input speech. In step S221, it is judged whether there is a recognition confidence value greater than the threshold ⁇ in the multilingual recognition confidence set obtained according to the language recognition model. If there is ("Yes" in step S221), in step S222, the user identity is determined through the voiceprint. As another embodiment, the identity of the user may also be determined by means of face recognition or iris recognition.
  • step S223 the user's historical language record and the current dialogue wheel language record are updated (that is, the current language is added to the record).
  • step S225 the multilingual recognition confidence set is output to the command analysis and execution module 150 as the language recognition result.
  • the current dialogue wheel refers to a cycle of continuously listening to (receiving) the user's input voice, for example, a period from a turn-on to turn-off of a language recognition system or a voice interaction system.
  • the language confidence correction module 160 calls the audio feature-based adjustment module 162 or the video feature-based adjustment module 163 to Multilingual recognition confidence sets are corrected.
  • the audio feature adjustment module 162 is first called to perform correction processing, and when there is no recognition confidence greater than the threshold ⁇ in the multilingual recognition confidence set corrected based on the audio feature adjustment module 162, then the call based on The video feature adjustment module 163 performs correction processing.
  • the video feature-based adjustment module 163 may also be called first.
  • step S231 the user identity is determined through the voiceprint, and in step S232, the user's historical language records are queried, and if there are historical language records, the distribution of each language in the historical language records is calculated.
  • the historical language records are ⁇ Chinese: 8; English: 1; Korean: 0; Japanese: 1; German: 0 ⁇
  • the distribution of each language (which can also be referred to as normalization according to the numerical form of the weight) is ⁇ Chinese: 0.8; English: 0.1; Korean: 0.0; Japanese: 0.1; German: 0.0 ⁇ .
  • the time range of historical language records can be set freely, such as the current dialogue round, a few days, a few months or longer.
  • all time node historical language records and current dialogue wheel historical language records (abbreviated as current dialogue wheel language records) can be pre-stored, and the following are respectively executed according to all time node historical language records and current dialogue wheel historical language records
  • the result obtained according to the historical language record of the current dialogue wheel is given priority, which is considering that the credibility of the result obtained according to the language record of the current dialogue wheel is relatively high. is higher.
  • different weight values may be assigned to the two to perform the calculation in step S236 described below.
  • step S236 use the distribution of each language in the historical language records to correct and calculate the multilingual recognition confidence set, and obtain the corrected multilingual recognition confidence (corresponding to the second recognition confidence in this application). )gather.
  • the multilingual initial confidence set is ⁇ Chinese: 0.7; English: 0.1; Korean: 0.1; Japanese: 0.05; German: 0.05 ⁇
  • the preset reliability weight is ⁇ Chinese : 0.25; English: 0.25; Korean: 0.25; Japanese: 0.25; German: 0.25 ⁇
  • the calculated correction when the language distribution is ⁇ Chinese: 0.8; English: 0.1; Korean: 0.0; Japanese: 0.1; German: 0.0 ⁇
  • the final recognition confidence set (after normalization) is ⁇ Chinese: 0.973; English: 0.017; Korean: 0.000; Japanese: 0.010; German: 0.000 ⁇ .
  • step S237 it is judged whether there is a correction confidence greater than the threshold ⁇ in the multilingual recognition confidence set after the judgment, and if there is ("Yes" in step S237), on the one hand, in step S239, The corrected multilingual recognition confidence set is output to the command parsing and execution module 150 as the language recognition result, in addition, the historical language record and the current dialogue wheel language record are updated (see step S222, step S222 in FIG. S223); On the other hand, in step S238, the process of adjusting the confidence weight is performed. Specifically, the correction is performed according to the corrected recognition confidence set, and the confidence weights of languages whose recognition confidence is greater than the threshold ⁇ among the corrected recognition confidences are increased relative to the confidence weights of other languages.
  • the old confidence weight set ⁇ Chinese: 0.25; English: 0.25; Korean: 0.25; Japanese: 0.25; German: 0.25 ⁇ is amended to a new confidence weight set (or called a modified weight set) ⁇ Chinese: 0.29; English: 0.24; Korean: 0.24; Japanese: 0.24; German: 0.24 ⁇ .
  • step S271 it is judged whether each corrected weight in the corrected corrected weight set is within the weight range.
  • the set of weights is set to be used by the language identification module 120 for subsequent language identification, and then the processing ends.
  • the preset weight set is not updated, and the process ends.
  • step S235 when the judgment result in step S235 is "No” or the judgment result in step S237 is "No", the language confidence correction module 160 calls the comprehensive adjustment module 164 for processing.
  • step S251 according to the output of the speech recognition module 110, it is judged whether there is an ASR confidence degree greater than the threshold ⁇ , and if it exists, in step S252, the comprehensive adjustment module 164 will recognize the multilingual recognition confidence
  • the set is output to the command parsing and execution module 150 as a language recognition result.
  • the command parsing and execution module 150 can determine the language corresponding to the ASR confidence with the ASR confidence greater than the threshold ⁇ as the language of the input speech (which can be called the recognition language), and determine the candidate command corresponding to the language is the target command to execute.
  • step S251 When the judgment result in step S251 is "No", according to the output of the semantic understanding module 140, it is judged whether there is an NLU confidence degree greater than the threshold ⁇ .
  • the comprehensive adjustment module 164 outputs the multilingual recognition confidence set as the language recognition result to the command parsing and execution module 150 .
  • the command analysis and execution module 150 determines the language corresponding to the NLU confidence greater than the threshold ⁇ as the language of the input speech, and determines the candidate command corresponding to the language as the target command to be executed.
  • step S256 it is judged according to the voiceprint (user identity) whether there is a language specified by the user.
  • the user-specified language here is the type of system language of the voice interaction system 100 set by the user.
  • the multilingual recognition confidence set is corrected and calculated according to the user-specified language, so that the recognition confidence of the user-specified language is increased relative to the recognition confidence of other languages, and thus the corrected multilingual Language recognition confidence (corresponding to the second recognition confidence in this application).
  • there may be multiple user-specified languages there are multiple languages in the system language history records stored in the database). For example, referring to the right part in FIG.
  • the old recognition confidence set ⁇ Chinese: 0.75; English: 0.12; Korean: 0.11; Japanese: 0.01; German: 0.01 ⁇ is revised to the new recognition confidence set ⁇ Chinese: 0.95; English: 0.32; Korean: 0.11; Japanese: 0.01; German: 0.21 ⁇ .
  • the language specified by the user is an example of a user operation record.
  • the language of the user's historically played songs can also be mentioned.
  • the multilingual preset weight set may be updated to be used for language recognition of the subsequent input speech.
  • the update method is the same as that described with reference to FIG. 14 , and will not be repeated here.
  • step S259 it is judged whether there is a recognition confidence value greater than the threshold ⁇ in the set of multilingual recognition confidence values corrected in step S258, and if yes, the corrected multilingual recognition confidence value is determined in step S261.
  • the degree set is output to the command parsing and recognition module as the language recognition result.
  • step S262 the multilingual preset reliability weight sets are adjusted.
  • the adjustment method is the same as the method explained above, and will not be repeated here.
  • the multilingual preset reliability weight set is updated.
  • step S264 Determine whether there is a user's historical language record based on the user's identity.
  • the judgment result is that there is a user's historical language record
  • step S256 the input voice of the user this time is compared with the input voice in the historical language record to obtain a feature similarity, and according to the feature similarity, find an unknown voice. Enter the language with the closest sound.
  • step S264 when the judgment result in step S264 is "No", that is, there is no historical language record of the user, the comprehensive adjustment module 164 directly outputs the recognition language confidence of multiple users as the language recognition result to the command analysis and execution module 150.
  • the command parsing and execution module 150 may consider that the language of the input voice cannot be recognized, and may, for example, feedback this matter to the user by playing voice.
  • the multilingual recognition confidence set is adjusted according to user characteristics including historical language records or user-specified languages, so that the voice interaction system 100 can improve the prediction accuracy of input voice and improve the user's voice interaction.
  • the degree of confidence in the intelligence of the system 100 is adjusted according to user characteristics including historical language records or user-specified languages, so that the voice interaction system 100 can improve the prediction accuracy of input voice and improve the user's voice interaction.
  • weight range is mentioned, that is, when the real-time scene adaptation module 161 initializes the preset weight set, it is judged whether the initialized preset weight is within the weight range.
  • the feature adjustment module 163 and the comprehensive adjustment module 164 intend to update the preset weight they will also determine whether the preset weight is within the weight range. It can be seen that the "weight range” reflects the robustness range of the model itself.
  • This embodiment also provides a method for setting the "weight range". This method is implemented, for example, in the testing phase before the voice interaction system 100 leaves the factory. In addition, it can also be implemented in the offline inspection phase after leaving the factory.
  • the method mainly includes the following steps:
  • n is the number of language datasets
  • m is the number of languages.
  • the language data sets data 1 , data 2 ,..., data n correspond to the test data sets in this application.
  • this embodiment provides an implementation solution as shown in FIG. 10 , but it is not limited to this solution.
  • This scheme will be described below with reference to FIG. 10 .
  • the confidence weight set of each monolingual can be obtained, and the confidence weight range of each monolingual can be obtained, such as the confidence weight of Chinese shown in the figure Range [c a ,c b ], English confidence weight range [e a ,e b ], Korean confidence weight range [h a ,h b ], Japanese confidence weight range [r a ,r b ] , German confidence weight range [d a , d b ].
  • the language recognition model is tested through a large number of language data sets to set the weight range of the multilingual preset weight set, that is, to specify the language recognition model
  • the range of robustness enables the language recognition model to work within this range, thereby ensuring the reliability of the language recognition results.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the technical field of intelligent vehicles. Provided are a speech processing method and apparatus, and a system. The method comprises: acquiring input speech information of a user; according to the input speech information, determining a plurality of first confidence levels corresponding to the input speech information, wherein the plurality of first confidence levels respectively correspond to a plurality of languages; correcting the plurality of first confidence levels to a plurality of second confidence levels according to a user feature of the user; and determining the language of the input speech information according to the plurality of second confidence levels. By using the speech processing method, the language of input speech information of a user is determined on the basis of taking a user feature into consideration, and therefore the language recognition accuracy can be improved, and the speech recognition capability can also be improved.

Description

语音处理方法、装置及系统Speech processing method, device and system 技术领域technical field
本申请涉及人工智能技术领域,特别涉及一种语音处理方法、装置及系统。The present application relates to the technical field of artificial intelligence, in particular to a voice processing method, device and system.
背景技术Background technique
随着计算机技术的发展,语音识别技术已得到了越来越广泛的应用。另外,随着全球化的加深,经常会出现使用不同语种的人混合办公和生活的场景。例如,在国际航班上,乘员往往是来自不同的国家或地区的人,他们所使用的语言并不相同。或者,例如在新加坡这样的国家,英语是大多数当地人的第一外语,同时,由于华人、华裔较多,他们通常会以汉语为日常交流语言。由此,英语和汉语都可能出现在交流中。因此,为了应对这种情况,出现了一种能够识别不同语种语音(例如,既能够识别汉语语音,又能够识别英语语音)的技术。With the development of computer technology, speech recognition technology has been more and more widely used. In addition, with the deepening of globalization, there are often scenes where people who speak different languages mix their office and life. For example, on an international flight, the passengers are often people from different countries or regions, and the languages they use are not the same. Or, for example, in a country like Singapore, English is the first foreign language of most locals. At the same time, because there are many Chinese and ethnic Chinese, they usually use Chinese as their daily communication language. Thus, both English and Chinese may appear in communication. Therefore, in order to cope with this situation, a technology capable of recognizing speech in different languages (for example, capable of recognizing both Chinese speech and English speech) has emerged.
但是目前该技术的识别结果有时会产生波动,即对语音的识别能力较低,在这方面还存在改进的余地。But at present, the recognition results of this technology sometimes fluctuate, that is, the ability to recognize speech is low, and there is still room for improvement in this regard.
发明内容Contents of the invention
本申请提供一种能够提高语音识别能力的语音处理方法、装置及系统等,用于提升语音识别的准确性。The present application provides a speech processing method, device and system capable of improving speech recognition capability, so as to improve the accuracy of speech recognition.
本申请第一方面涉及一种语音处理方法,包括如下内容:获取用户的输入语音信息;根据输入语音信息,确定输入语音信息对应的多个第一置信度,多个第一置信度分别对应于多个语种;根据用户的用户特征修正多个第一置信度为多个第二置信度;根据多个第二置信度,确定输入语音信息的语种。The first aspect of the present application relates to a voice processing method, including the following content: acquiring user input voice information; determining a plurality of first confidence levels corresponding to the input voice information according to the input voice information, the multiple first confidence levels respectively corresponding to a plurality of languages; modify the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics; and determine the language of the input voice information according to the plurality of second confidence levels.
采用如上所述的语音处理方法,根据用户的用户特征修正多个第一置信度为多个第二置信度,根据多个第二置信度确定输入语音信息的语种,即在考虑了用户特征的基础上确定用户的输入语音信息的语种,因此,能够提高语种识别精度,提高语音识别能力。Using the speech processing method as described above, modifying multiple first confidence levels into multiple second confidence levels according to the user characteristics of the user, and determining the language of the input voice information according to the multiple second confidence levels, that is, considering the user characteristics Basically, the language of the voice information input by the user is determined, so that the language recognition accuracy can be improved and the voice recognition ability can be improved.
作为本申请第一方面的一个可能的实现方式,根据用户的用户特征修正多个第一置信度为多个第二置信度,具体可以包括:在多个第一置信度小于第一阈值时,根据用户特征修正多个第一置信度为多个第二置信度。As a possible implementation of the first aspect of the present application, modifying the multiple first confidence levels into multiple second confidence levels according to user characteristics may specifically include: when the multiple first confidence levels are smaller than the first threshold, Modifying the multiple first confidence levels into multiple second confidence levels according to user characteristics.
多个第一置信度小于第一阈值时难以根据第一置信度来确定输入语音信息的语种。在此时根据用户特征修正多个第一置信度为多个第二置信度的话,根据第二置信度确定输入语音信息的语种,能够提高语种的识别精度,提高语音识别能力。When the plurality of first confidence levels are smaller than the first threshold, it is difficult to determine the language of the input voice information according to the first confidence levels. At this time, if multiple first confidence levels are modified into multiple second confidence levels according to user characteristics, the language of the input voice information can be determined according to the second confidence levels, which can improve language recognition accuracy and speech recognition ability.
用户特征可以包括历史语种记录、用户指定语种中的一个或多个。User characteristics may include one or more of historical language records and user-specified languages.
采用该方式,根据用户的历史语种记录和/或用户指定语种来修正第一识别置信度,在此基础上确定输入语音的语种,从而,能够提高语种识别能力。In this manner, the first recognition confidence level is corrected according to the user's historical language records and/or the user-designated language, and the language of the input voice is determined on this basis, thereby improving the language recognition ability.
这里的用户的历史语种记录是指在输入上述输入语音之前该用户输入的语音所 属的语种的记录。这里的用户指定语种是指用户设定的系统语言的种类,可能仅存在一个用户指定语种,也可能存在多个用户指定语种(即用户曾经设定过的系统语言有多种)。The historical language record of the user here refers to the record of the language to which the voice input by the user belongs before the above-mentioned input voice is input. The user-specified language here refers to the type of system language set by the user. There may be only one user-specified language, or there may be multiple user-specified languages (that is, there are multiple system languages set by the user).
作为本申请第一方面的一个可能的实现方式,历史语种记录与用户指定语种是根据输入语音信息的声纹特征查询得到的。As a possible implementation of the first aspect of the present application, the historical language records and the user-specified language are obtained by querying the voiceprint features of the input voice information.
采用如上方式,根据输入语音信息的声纹来查询历史语种记录或用户指定语种,例如与根据人脸信息、虹膜等来查询的方式相比,能够避免对用户(说话者)的误识别(将非说话者识别为说话者)造成语种误识别。另外,采用如上方式,根据输入语音信息即可得到声纹,而根据人脸信息、虹膜等来查询的方式还需要获得用户图像,因而,根据声纹进行查询的方式所需的设备更少、处理更迅速。Using the above method, according to the voiceprint of the input voice information to query the historical language records or the user-specified language, for example, compared with the way of querying based on face information, iris, etc., it is possible to avoid misidentification of the user (speaker). Non-speakers are identified as speakers) causing language misidentification. In addition, using the above method, the voiceprint can be obtained according to the input voice information, while the method of querying based on face information, iris, etc. also needs to obtain user images. Therefore, the method of querying based on voiceprint requires less equipment. Processing is faster.
作为本申请第一方面的一个可能的实现方式,多个第一置信度由多个初始置信度和多个预设权重确定。语音处理方法还可以包括如下内容:根据多个第二置信度,更新多个预设权重。As a possible implementation manner of the first aspect of the present application, the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights. The voice processing method may further include the following content: updating multiple preset weights according to multiple second confidence levels.
如此,可以提高以后的处理周期中的语种识别精度。In this way, the language recognition accuracy in subsequent processing cycles can be improved.
作为本申请第一方面的一个可能的实现方式,根据多个第二置信度,更新多个预设权重,具体包括:在多个第二置信度中存在大于第一阈值的第二置信度时,根据多个第二置信度,更新多个预设权重。As a possible implementation of the first aspect of the present application, updating multiple preset weights according to multiple second confidence levels specifically includes: when there is a second confidence level greater than the first threshold among the multiple second confidence levels , updating multiple preset weights according to multiple second confidence levels.
如此,由于在多个第二置信度中存在大于第一阈值的第二置信度时,本次处理周期的结果可靠性更好,因此,在此时根据多个第二置信度更新多个预设权重,从而能够更加可靠地提高以后的处理周期中的语种识别精度。In this way, when there is a second confidence degree greater than the first threshold among the plurality of second confidence degrees, the reliability of the result of this processing cycle is better. By setting weights, it is possible to more reliably improve the language recognition accuracy in subsequent processing cycles.
作为本申请第一方面的一个可能的实现方式,还包括:根据输入语音信息与输入语音信息的语种确定输入语音信息的语义。As a possible implementation manner of the first aspect of the present application, the method further includes: determining the semantics of the input voice information according to the input voice information and the language of the input voice information.
采用如上方式,能够提高语种识别精度,提高语义理解精度。By adopting the above method, the accuracy of language recognition can be improved, and the accuracy of semantic understanding can be improved.
作为本申请第一方面的一个可能的实现方式,多个语种是预先设定的。As a possible implementation of the first aspect of the present application, multiple languages are preset.
作为本申请第一方面的一个可能的实现方式,多个第一置信度由多个初始置信度和多个预设权重确定;语音处理方法还包括:在获取用户的输入语音信息之前,根据场景特征设定多个预设权重。As a possible implementation of the first aspect of the present application, the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the voice processing method further includes: before acquiring the user's input voice information, according to the scene Features set multiple preset weights.
采用如上方式,根据场景特征设定预设权重,从而,能够适应不同场景,以尽可能最适合场景的预设权重来得到语种识别结果,提高了语种识别能力,提高了语音识别能力。By adopting the above method, the preset weight is set according to the scene characteristics, so that different scenes can be adapted, and the language recognition result is obtained with the preset weight that is most suitable for the scene, which improves the language recognition ability and speech recognition ability.
作为本申请第一方面的一个可能的实现方式,场景特征包括环境特征和/或音频采集器特征。As a possible implementation manner of the first aspect of the present application, the scene feature includes an environment feature and/or an audio collector feature.
作为本申请第一方面的一个可能的实现方式,环境特征包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,音频采集器特征包括麦克风排布信息。As a possible implementation manner of the first aspect of the present application, the environmental feature includes one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude, and the audio collector feature includes microphone arrangement information.
环境信噪比、电源直流交流电信息、环境振动幅度、麦克风排布信息这些事项都有可能影响语种置信度,因此,根据这些信息来调整预设权重,在此基础上进行语种识别,从而能够提高语种识别能力。Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
作为本申请第一方面的一个可能的实现方式,根据场景特征设定多个预设权重,具体包括:获取预先采集的第一语音数据与第一语音数据的预先记录的第一语种信息; 根据第一语音数据与场景特征确定第二语音数据;根据第二语音数据确定第二语音数据的第二语种信息;根据第一语种信息与第二语种信息设定多个预设权重。As a possible implementation of the first aspect of the present application, setting multiple preset weights according to scene characteristics specifically includes: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; The first voice data and the scene feature determine the second voice data; determine the second language information of the second voice data according to the second voice data; set a plurality of preset weights according to the first language information and the second language information.
作为本申请第一方面的一个可能的实现方式,根据第二语音数据确定第二语音数据的第二语种信息,具体包括:获取多个测试权重组,多个测试权重组中的任一个包括多个测试权重;根据第二语音数据与多个测试权重组确定多个第二语种信息,多个第二语种信息与多个测试权重组分别对应;根据第一语种信息与第二语种信息设定多个预设权重,具体包括:根据第一语种信息与多个第二语种信息确定多个第二语种信息的多个准确率;根据准确率最高的第二语种信息对应的测试权重组设定多个预设权重。As a possible implementation of the first aspect of the present application, determining the second language information of the second voice data according to the second voice data specifically includes: acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple a plurality of test weights; determine a plurality of second language information according to the second voice data and a plurality of test weight groups, and a plurality of second language information corresponds to a plurality of test weight groups respectively; set according to the first language information and the second language information Multiple preset weights, specifically including: determining multiple accuracy rates of multiple second language information according to the first language information and multiple second language information; setting according to the test weight group corresponding to the second language information with the highest accuracy rate Multiple preset weights.
作为本申请第一方面的一个可能的实现方式,设定多个预设权重,具体包括:在权重范围内设定多个预设权重。As a possible implementation manner of the first aspect of the present application, setting multiple preset weights specifically includes: setting multiple preset weights within a weight range.
作为本申请第一方面的一个可能的实现方式,更新多个预设权重,具体包括:在权重范围内更新多个预设权重。As a possible implementation manner of the first aspect of the present application, updating the multiple preset weights specifically includes: updating the multiple preset weights within a weight range.
预设权重超出权重范围的话,识别结果将不可靠,因此,在权重范围内设定或更新能够尽量保证识别结果的精度。If the preset weight exceeds the weight range, the recognition result will be unreliable. Therefore, setting or updating within the weight range can ensure the accuracy of the recognition result as much as possible.
作为本申请第一方面的一个可能的实现方式,权重范围是按照如下方式确定的:As a possible implementation of the first aspect of the present application, the weight range is determined as follows:
获取预先采集的多个测试语音数据组与多个测试语音数据组的预先记录的第一语种信息,多个测试语音数据组中的任一个包括多个测试语音数据;获取多个测试权重组,多个测试权重组中的任一个包括多个测试权重;根据多个测试语音数据组、第一语音信息与多个测试权重组确定权重范围。Obtain the pre-recorded first language information of a plurality of test voice data groups collected in advance and a plurality of test voice data groups, any one of the plurality of test voice data groups includes a plurality of test voice data; obtain a plurality of test weight groups, Any one of the multiple test weight groups includes multiple test weights; the weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
采用如上方式,通过大量的语音数据组进行测试,来设定多语种的预设权重集合的权重范围,也就是规定了语种识别模型的鲁棒性范围,使语种识别模型在此范围内工作,从而保证了语种识别结果的可靠性。In the above way, a large number of voice data sets are used for testing to set the weight range of the multilingual preset weight set, that is, to specify the robustness range of the language recognition model, so that the language recognition model works within this range. Therefore, the reliability of the language recognition result is guaranteed.
本申请第二方面提供一种语音处理方法,包括如下内容:获取用户的输入语音信息;根据输入语音信息,确定输入语音信息对应的多个第三置信度,多个第三置信度分别对应于多个语种;根据场景特征修正多个第三置信度为多个第四置信度;根据多个第四置信度,确定输入语音信息的语种。The second aspect of the present application provides a voice processing method, including the following content: acquiring user input voice information; determining a plurality of third confidence levels corresponding to the input voice information according to the input voice information, the multiple third confidence levels respectively corresponding to multiple languages; correcting multiple third confidence levels into multiple fourth confidence levels according to scene features; determining the language of the input voice information according to the multiple fourth confidence levels.
采用如上所述的语音处理方法,根据场景特征修正多个第一置信度为多个第二置信度,根据多个第二置信度确定输入语音信息的语种,即在考虑了场景特征的基础上确定用户的输入语音信息的语种,从而能够使语音处理方法尽可能适应实际的场景,能够提高语种识别精度,提高语音识别能力。Using the speech processing method as described above, modify the multiple first confidence levels into multiple second confidence levels according to the scene features, and determine the language of the input voice information according to the multiple second confidence levels, that is, on the basis of considering the scene features Determine the language of the user's input voice information, so that the voice processing method can be adapted to the actual scene as much as possible, and the language recognition accuracy can be improved, and the voice recognition ability can be improved.
这里,场景特征可以包括环境特征和/或音频采集器特征。Here, scene features may include environment features and/or audio collector features.
作为本申请第二方面的一个可能的实现方式,环境特征包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,音频采集器特征包括麦克风排布信息。As a possible implementation manner of the second aspect of the present application, the environmental feature includes one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude, and the audio collector feature includes microphone arrangement information.
作为本申请第二方面的一个可能的实现方式,根据场景特征修正多个第三置信度为多个第四置信度,具体包括:根据场景特征设定多个预设权重;根据多个预设权重修正多个第三置信度为多个第四置信度。As a possible implementation of the second aspect of the present application, modifying multiple third confidence levels into multiple fourth confidence levels according to scene characteristics includes: setting multiple preset weights according to scene features; The weight modifies the plurality of third confidence levels into a plurality of fourth confidence levels.
作为本申请第二方面的一个可能的实现方式,根据场景特征设定多个预设权重,具体包括:获取预先采集的第一语音数据与第一语音数据的预先记录的第一语种信息; 根据第一语音数据与场景特征确定第二语音数据;根据第二语音数据确定第二语音数据的第二语种信息;根据第一语种信息与第二语种信息设定多个预设权重。As a possible implementation of the second aspect of the present application, setting multiple preset weights according to scene characteristics specifically includes: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; The first voice data and the scene feature determine the second voice data; determine the second language information of the second voice data according to the second voice data; set a plurality of preset weights according to the first language information and the second language information.
作为本申请第二方面的一个可能的实现方式,根据第二语音数据确定第二语音数据的第二语种信息,具体包括:获取多个测试权重组,测试权重组包括多个测试权重;根据第二语音数据与多个测试权重组确定多个第二语种信息,多个第二语种信息与多个测试权重组分别对应;根据第一语种信息与第二语种信息设定多个预设权重,具体包括:根据第一语种信息与多个第二语种信息确定多个第二语种信息的多个准确率;根据准确率最高的第二语种信息对应的测试权重组设定多个预设权重。As a possible implementation of the second aspect of the present application, determining the second language information of the second voice data according to the second voice data specifically includes: acquiring multiple test weight groups, where the test weight groups include multiple test weights; according to the first Two voice data and a plurality of test weight groups determine a plurality of second language information, and a plurality of second language information corresponds to a plurality of test weight groups; multiple preset weights are set according to the first language information and the second language information, It specifically includes: determining multiple accuracy rates of multiple second language information according to the first language information and multiple second language information; setting multiple preset weights according to the test weight group corresponding to the second language information with the highest accuracy rate.
本申请第二方面的具体特征可以与第一方面相同或相似,因此其技术效果也基本相同,这里不再重复描述。The specific features of the second aspect of the present application may be the same as or similar to those of the first aspect, so the technical effects thereof are basically the same, and will not be repeated here.
本申请第三方面提供一种语音处理装置,包括处理模块与收发模块,收发模块用于获取用户的输入语音信息;处理模块用于,根据输入语音信息,确定输入语音信息对应的多个第一置信度,多个第一置信度分别对应于多个语种。处理模块还用于,根据用户的用户特征修正多个第一置信度为多个第二置信度,根据多个第二置信度,确定输入语音信息的语种。The third aspect of the present application provides a voice processing device, including a processing module and a transceiver module, the transceiver module is used to obtain the user's input voice information; Confidence degree, multiple first confidence degrees respectively correspond to multiple languages. The processing module is further configured to modify the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics of the user, and determine the language of the input voice information according to the plurality of second confidence levels.
作为本申请第三方面的一个可能的实现方式,处理模块具体用于,在多个第一置信度小于第一阈值时,根据用户特征修正多个第一置信度为多个第二置信度。As a possible implementation manner of the third aspect of the present application, the processing module is specifically configured to, when the multiple first confidence levels are smaller than the first threshold, modify the multiple first confidence levels to multiple second confidence levels according to user characteristics.
作为本申请第三方面的一个可能的实现方式,用户特征包括历史语种记录、用户指定语种中的一个或多个。As a possible implementation of the third aspect of the present application, the user features include one or more of historical language records and user-specified languages.
作为本申请第三方面的一个可能的实现方式,历史语种记录与用户指定语种是根据输入语音信息的声纹特征查询得到的。As a possible implementation of the third aspect of the present application, the historical language records and the user-specified language are obtained by querying the voiceprint features of the input voice information.
作为本申请第三方面的一个可能的实现方式,多个第一置信度由多个初始置信度和多个预设权重确定;处理模块还用于,根据多个第二置信度,更新多个预设权重。As a possible implementation of the third aspect of the present application, the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the processing module is further configured to, according to the multiple second confidence levels, update the multiple Default weight.
作为本申请第三方面的一个可能的实现方式,处理模块具体用于,在多个第二置信度中存在大于第一阈值的第二置信度时,根据多个第二置信度,更新多个预设权重。As a possible implementation of the third aspect of the present application, the processing module is specifically configured to update the multiple Default weight.
作为本申请第三方面的一个可能的实现方式,处理模块还用于,根据输入语音信息与输入语音信息的语种确定输入语音信息的语义。As a possible implementation manner of the third aspect of the present application, the processing module is further configured to determine the semantics of the input voice information according to the input voice information and the language of the input voice information.
多个语种可以是预先设定的。Multiple languages can be preset.
作为本申请第三方面的一个可能的实现方式,多个第一置信度由多个初始置信度和多个预设权重确定;处理模块还用于,在获取用户的输入语音信息之前,根据场景特征设定多个预设权重。As a possible implementation of the third aspect of the present application, the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights; the processing module is also configured to, before acquiring the user's input voice information, Features set multiple preset weights.
场景特征可以包括环境特征和/或音频采集器特征。环境特征可以包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,音频采集器特征可以包括麦克风排布信息。Scene features may include environmental features and/or audio picker features. The environmental characteristics may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
作为本申请第三方面的一个可能的实现方式,处理模块具体用于,获取预先采集的第一语音数据与第一语音数据的预先记录的第一语种信息,根据第一语音数据与场景特征确定第二语音数据,根据第二语音数据确定第二语音数据的第二语种信息,根据第一语种信息与第二语种信息设定多个预设权重。As a possible implementation of the third aspect of the present application, the processing module is specifically configured to acquire the pre-collected first voice data and the pre-recorded first language information of the first voice data, and determine the For the second voice data, the second language information of the second voice data is determined according to the second voice data, and a plurality of preset weights are set according to the first language information and the second language information.
作为本申请第三方面的一个可能的实现方式,处理模块具体用于,获取多个测试 权重组,多个测试权重组中的任一个包括多个测试权重,根据第二语音数据与多个测试权重组确定多个第二语种信息,多个第二语种信息与多个测试权重组分别对应,根据第一语种信息与多个第二语种信息确定多个第二语种信息的多个准确率,根据准确率最高的第二语种信息对应的测试权重组设定多个预设权重。As a possible implementation of the third aspect of the present application, the processing module is specifically configured to obtain multiple test weight groups, any one of the multiple test weight groups includes multiple test weights, and according to the second voice data and the multiple test weights, The weight group determines a plurality of second language information, and the plurality of second language information corresponds to the plurality of test weight groups respectively, and determines multiple accuracy rates of the plurality of second language information according to the first language information and the plurality of second language information, A plurality of preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
作为本申请第三方面的一个可能的实现方式,处理模块具体用于,在权重范围内设定多个预设权重。As a possible implementation manner of the third aspect of the present application, the processing module is specifically configured to set multiple preset weights within a weight range.
作为本申请第三方面的一个可能的实现方式,处理模块具体用于,在权重范围内设定多个预设权重。As a possible implementation manner of the third aspect of the present application, the processing module is specifically configured to set multiple preset weights within a weight range.
作为本申请第三方面的一个可能的实现方式,权重范围是按照如下方式确定的:As a possible implementation of the third aspect of the present application, the weight range is determined as follows:
获取预先采集的多个测试语音数据组与多个测试语音数据组的预先记录的第一语种信息,多个测试语音数据组中的任一个包括多个测试语音数据;获取多个测试权重组,多个测试权重组中的任一个包括多个测试权重;根据多个测试语音数据组、第一语音信息与多个测试权重组确定权重范围。Obtain the pre-recorded first language information of a plurality of test voice data groups collected in advance and a plurality of test voice data groups, any one of the plurality of test voice data groups includes a plurality of test voice data; obtain a plurality of test weight groups, Any one of the multiple test weight groups includes multiple test weights; the weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
第三方面的语音处理装置,可以获得与第一方面的语音处理方法相同的技术效果,在这里不再重复进行描述。The speech processing device of the third aspect can obtain the same technical effect as that of the speech processing method of the first aspect, and the description will not be repeated here.
本申请第四方面提供一种语音处理装置,包括处理模块与收发模块,收发模块用于获取用户的输入语音信息;处理模块用于,根据输入语音信息,确定输入语音信息对应的多个第三置信度,多个第三置信度分别对应于多个语种,处理模块还用于,根据场景特征修正多个第三置信度为多个第四置信度,根据多个第四置信度,确定输入语音信息的语种。The fourth aspect of the present application provides a voice processing device, including a processing module and a transceiver module, the transceiver module is used to obtain the input voice information of the user; Confidence, the plurality of third confidence degrees correspond to multiple languages, and the processing module is also used to modify the plurality of third confidence degrees into a plurality of fourth confidence degrees according to the scene characteristics, and determine the input according to the plurality of fourth confidence degrees The language of the voice message.
场景特征可以包括环境特征和/或音频采集器特征。Scene features may include environmental features and/or audio picker features.
环境特征可以包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,音频采集器特征可以包括麦克风排布信息。The environmental characteristics may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
作为第四方面的一个可能的实现方式,处理模块具体用于,根据场景特征设定多个预设权重,根据多个预设权重修正多个第三置信度为多个第四置信度。As a possible implementation manner of the fourth aspect, the processing module is specifically configured to set multiple preset weights according to scene characteristics, and correct multiple third confidence levels into multiple fourth confidence levels according to the multiple preset weights.
作为第四方面的一个可能的实现方式,处理模块具体用于,获取预先采集的第一语音数据与第一语音数据的预先记录的第一语种信息,根据第一语音数据与场景特征确定第二语音数据,根据第二语音数据确定第二语音数据的第二语种信息,根据第一语种信息与第二语种信息设定多个预设权重。As a possible implementation of the fourth aspect, the processing module is specifically configured to acquire the pre-collected first voice data and the pre-recorded first language information of the first voice data, and determine the second language information according to the first voice data and scene characteristics. For voice data, the second language information of the second voice data is determined according to the second voice data, and a plurality of preset weights are set according to the first language information and the second language information.
作为第四方面的一个可能的实现方式,处理模块具体用于,获取多个测试权重组,测试权重组包括多个测试权重,根据第二语音数据与多个测试权重组确定多个第二语种信息,多个第二语种信息与多个测试权重组分别对应,根据第一语种信息与多个第二语种信息确定多个第二语种信息的多个准确率,根据准确率最高的第二语种信息对应的测试权重组设定多个预设权重。As a possible implementation of the fourth aspect, the processing module is specifically configured to obtain multiple test weight groups, the test weight groups include multiple test weights, and determine multiple second languages according to the second voice data and the multiple test weight groups Information, a plurality of second language information corresponds to a plurality of test weight groups respectively, according to the first language information and the plurality of second language information to determine the multiple accuracy rates of the plurality of second language information, according to the second language with the highest accuracy The test weight group corresponding to the information sets a plurality of preset weights.
采用第四方面的语音处理装置,可以获得与第二方面的语音处理方法相同的技术效果,这里不再重复描述。By adopting the speech processing device of the fourth aspect, the same technical effect as that of the speech processing method of the second aspect can be obtained, and the description will not be repeated here.
本申请第五方面提供一种计算设备,其包括处理器与存储器,存储器存储有计算机程序指令,计算机程序指令当被处理器执行时使得处理器执行第一方面或第二方面中描述的任一方法。A fifth aspect of the present application provides a computing device, which includes a processor and a memory, the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the processor performs any one of the functions described in the first aspect or the second aspect. method.
本申请第六方面提供一种计算机可读存储介质,其存储有计算机程序指令,计算机程序指令当被计算机执行时使计算机执行第一方面或第二方面中描述的任一方法。The sixth aspect of the present application provides a computer-readable storage medium, which stores computer program instructions. When executed by a computer, the computer program instructions cause the computer to execute any method described in the first aspect or the second aspect.
本申请第七方面提供一种计算机程序产品,其包括计算机程序指令,计算机程序指令当被计算机执行时使计算机执行第一方面或第二方面中描述的任一方法。A seventh aspect of the present application provides a computer program product, which includes computer program instructions. When executed by a computer, the computer program instructions cause the computer to execute any method described in the first aspect or the second aspect.
本申请第八方面提供一种系统,其包括第三方面至第四方面任一方面或任一可能的实施方式所提供的语音处理装置。The eighth aspect of the present application provides a system, which includes the speech processing device provided in any aspect from the third aspect to the fourth aspect or any possible implementation manner.
附图说明Description of drawings
图1是本申请实施例提供的语音处理方案的一个应用场景例的示意说明图;FIG. 1 is a schematic illustration of an application scenario example of a speech processing solution provided by an embodiment of the present application;
图2是本申请实施例提供的语音处理方案所应用的一个语音处理系统的示意说明图;FIG. 2 is a schematic illustration of a speech processing system applied to the speech processing solution provided by the embodiment of the present application;
图3是本申请一个实施例提供的语音处理方法的流程图;Fig. 3 is the flowchart of the voice processing method provided by one embodiment of the present application;
图4是本申请一个实施例提供的语音处理方法的流程图;FIG. 4 is a flowchart of a speech processing method provided by an embodiment of the present application;
图5为本申请一个实施例提供的语音处理装置的示意性结构说明图;FIG. 5 is a schematic structural illustration of a speech processing device provided by an embodiment of the present application;
图6是是用于示意性地说明本申请一个实施例提供的语种识别方法的流程图;FIG. 6 is a flowchart for schematically illustrating a language recognition method provided by an embodiment of the present application;
图7所示为本申请一个实施例提供的语种识别装置的结构示意图;FIG. 7 is a schematic structural diagram of a language recognition device provided by an embodiment of the present application;
图8用于示意性地说明本申请一个实施例提供的语音交互方法的流程图;FIG. 8 is a flowchart for schematically illustrating a voice interaction method provided by an embodiment of the present application;
图9所示为本申请一个实施例提供的语音交互系统的结构示意图;FIG. 9 is a schematic structural diagram of a voice interaction system provided by an embodiment of the present application;
图10所示为权重范围的一种设定方法的示意说明图;Figure 10 is a schematic illustration of a method for setting the weight range;
图11为本申请一个实施例涉及的语音交互系统的示意说明图;FIG. 11 is a schematic illustration of a voice interaction system involved in an embodiment of the present application;
图12为用于说明一个实施例中涉及的语音交互方法的其中一个流程的流程图;FIG. 12 is a flow chart illustrating one of the processes of the voice interaction method involved in an embodiment;
图13为本申请一个实施例中提供的一种预设权重集合的初始化方法的示意说明图;FIG. 13 is a schematic illustration of a method for initializing a preset weight set provided in an embodiment of the present application;
图14为本申请一个实施例中提供的一种语音交互过程的一部分流程的示意说明图;FIG. 14 is a schematic illustration of a part of the flow of a voice interaction process provided in an embodiment of the present application;
图15为本申请一个实施例中提供的一种置信度修正方式的示意说明图;FIG. 15 is a schematic illustration of a confidence correction method provided in an embodiment of the present application;
图16为本申请一个实施例中提供的又一种置信度修正方式的示意说明图;FIG. 16 is a schematic illustration of another confidence correction method provided in an embodiment of the present application;
图17为本申请一个实施例中提供的一种电子控制单元的示意说明图。Fig. 17 is a schematic illustration of an electronic control unit provided in an embodiment of the present application.
应理解,上述结构示意图中,各框图的尺寸和形态仅供参考,不应构成对本发明实施例的排他性的解读。结构示意图所呈现的各框图间的相对位置和包含关系,仅为示意性地表示各框图间的结构关联,而非限制本发明实施例的物理连接方式。It should be understood that in the above structural schematic diagrams, the dimensions and shapes of each block diagram are for reference only, and should not constitute an exclusive interpretation of the embodiments of the present invention. The relative positions and containment relationships among the block diagrams presented in the structural diagram are only schematic representations of the structural relationships among the block diagrams, rather than limiting the physical connection manners of the embodiments of the present invention.
具体实施方式detailed description
下面结合附图并举实施例,对本申请提供的技术方案作进一步说明。应理解,本申请实施例中提供的系统结构和业务场景主要是为了说明本申请的技术方案的可能的实施方式,不应被解读为对本申请的技术方案的唯一限定。本领域普通技术人员可知,随着系统结构的演进和新业务场景的出现,本申请提供的技术方案对类似技术问题同样适用。The technical solutions provided by the present application will be further described below in conjunction with the accompanying drawings and examples. It should be understood that the system structure and business scenarios provided in the embodiments of the present application are mainly for illustrating possible implementations of the technical solution of the present application, and should not be interpreted as the only limitation on the technical solution of the present application. Those skilled in the art know that, with the evolution of the system structure and the emergence of new business scenarios, the technical solutions provided in this application are also applicable to similar technical problems.
应理解,本申请实施例提供的语音处理方案,包括语音处理方法、装置及系统。由于这些技术方案解决问题的原理相同或相似,在如下具体实施例的介绍中,某些重复之 处可能不再赘述,但应视为这些具体实施例之间已有相互引用,可以相互结合。It should be understood that the voice processing solution provided in the embodiment of the present application includes a voice processing method, device, and system. Since the principles of these technical solutions to solve problems are the same or similar, in the introduction of the following specific embodiments, some repetitions may not be repeated, but it should be considered that these specific embodiments have mutual references and can be combined with each other.
首先参照图1描述本申请实施例提供的语音处理方案的一个应用场景例。在图1中例示的是应用在车辆上的场景,具体而言,如图1所示,车辆200的车机系统具有语音交互功能,可以通过中控显示屏210上的麦克风212(本示例中为麦克风阵列)接收驾驶员300等乘员的语音指令,按照该语音指令执行相应的控制(例如播放音乐、打开车窗、开启空调、进行导航等),同时,还可以针对该语音指令做出回应(反馈),例如通过中控显示屏210发出显示信息或者通过中控显示屏210上的扬声器(未图示)发出语音信息。First, an example application scenario of the speech processing solution provided by the embodiment of the present application is described with reference to FIG. 1 . What is illustrated in FIG. 1 is a scene applied to a vehicle. Specifically, as shown in FIG. Microphone array) receives the voice commands of the driver 300 and other passengers, executes corresponding controls according to the voice commands (such as playing music, opening the windows, turning on the air conditioner, navigating, etc.), and at the same time, responding to the voice commands (Feedback), for example, sending display information through the central control display 210 or sending out voice information through a speaker (not shown) on the central control display 210 .
例如由于车辆200被不同的乘员乘坐,他们可能会发出不同语种的语音指令,或者即便是同一个乘员也可能会发出不同语种的语音指令,为此,车机系统具有能够应对不同语种的语音指令的功能。然而,受限于语种识别能力,有时,车机系统可能会得到错误的语种识别结果而造成无法识别或错误地识别该语音指令的语义,进而无法正确地做出响应。For example, since the vehicle 200 is taken by different occupants, they may issue voice commands in different languages, or even the same occupant may issue voice commands in different languages. function. However, limited by the language recognition capability, sometimes the car-machine system may get a wrong language recognition result, resulting in failure to recognize or wrongly recognize the semantics of the voice command, and thus fail to respond correctly.
具体而言,作为语种识别方案,存在一种利用机器学习模型进行分类识别的技术,然而,机器学习模型可能会学习一些与任务无关的信息,比如环境信噪比、音频采集器(声音传感器、麦克风)特征等,这会导致当实际应用中这些信息发生改变时,机器学习模型的预测结果就可能产生错误。Specifically, as a language recognition solution, there is a technology that uses machine learning models for classification and recognition. However, machine learning models may learn some task-independent information, such as environmental signal-to-noise ratio, audio collectors (sound sensors, Microphone) features, etc., which will lead to errors in the prediction results of the machine learning model when the information changes in actual applications.
例如,在图1所示的场景下,车辆200是敞篷汽车,此时的环境噪声相对较大(例如为中噪环境),因而对于驾驶员300发出的英文语音指令“请播放音乐”,车机系统可能得到错误的语种识别结果从而不能正确地识别该语音指令进而无法做出正确的响应。另外,假如机器学习模型的训练样本数据对应的麦克风阵列类型与车机系统的麦克风212不同,那么也有可能造成车机系统产生错误的语种识别结果而不能正确地识别驾驶员的语音指令。For example, in the scenario shown in FIG. 1 , the vehicle 200 is a convertible car, and the ambient noise is relatively large (for example, a medium-noise environment), so for the English voice command "please play music" issued by the driver 300, the The computer system may get the wrong language recognition result, so it cannot correctly recognize the voice command and thus cannot make a correct response. In addition, if the type of microphone array corresponding to the training sample data of the machine learning model is different from the microphone 212 of the vehicle-machine system, it may also cause the vehicle-machine system to generate wrong language recognition results and fail to correctly recognize the driver's voice commands.
为此,本申请实施例提供一种语音处理方法、装置及系统等,能够提高多语种语音处理方案的语音识别能力。To this end, the embodiments of the present application provide a voice processing method, device, system, etc., which can improve the voice recognition capability of a multilingual voice processing solution.
下面描述本申请实施例提供的语音处理方法、装置及系统所应用的一种系统架构。图2是本申请实施例提供的语音处理方案所应用的一个语音处理系统的架构的示意说明图。如图2所示,该语音处理系统180包括语音处理装置182、声音传感器(麦克风)184、扬声器186与显示装置188等。A system architecture applied to the voice processing method, device, and system provided in the embodiments of the present application is described below. FIG. 2 is a schematic diagram illustrating the architecture of a speech processing system to which the speech processing solution provided by the embodiment of the present application is applied. As shown in FIG. 2 , the voice processing system 180 includes a voice processing device 182 , a sound sensor (microphone) 184 , a speaker 186 , a display device 188 and the like.
该语音处理系统180可以作为车机系统应用于智能车辆,另外,还可以应用于智能家居、智能办公、智能机器人、智能语音问答、智能语音分析、实时语音监控分析等场景。The voice processing system 180 can be applied to smart vehicles as a car-machine system. In addition, it can also be applied to scenarios such as smart home, smart office, smart robot, smart voice question and answer, smart voice analysis, and real-time voice monitoring and analysis.
声音传感器184用于获取用户的输入语音,语音处理装置182根据该声音传感器184的传感器数据得到用户的输入语音信息,对该输入语音信息进行处理,而得到该输入语音信息的语义。并且,语音处理装置182根据该语义执行相应的控制,例如控制扬声器186或显示装置188的输出。另外,除了扬声器186与显示装置188之外,语音处理装置182还可以与其他装置、机构连接,例如,语音处理系统180应用在车机系统中时,语音处理装置182还可以与车窗升降系统、空调系统等连接,以能够实现对车窗、空调系统等的控制。The sound sensor 184 is used to acquire the user's input voice, and the voice processing device 182 obtains the user's input voice information according to the sensor data of the sound sensor 184, processes the input voice information, and obtains the semantics of the input voice information. And, the voice processing device 182 performs corresponding control according to the semantics, for example, controlling the output of the speaker 186 or the display device 188 . In addition, in addition to the speaker 186 and the display device 188, the voice processing device 182 can also be connected with other devices and mechanisms. , air-conditioning system, etc., so as to be able to control the windows, air-conditioning system, etc.
下面参照图3对本申请一个实施例提供的语音处理方法进行描述。The speech processing method provided by an embodiment of the present application will be described below with reference to FIG. 3 .
图3是本申请一个实施例提供的语音处理方法的流程图。该语音处理方法可以由 车辆、车载装置、车机、车载电脑等执行,也可以由车辆或车载装置中的部件例如芯片或处理器等执行。另外,除了应用于在车辆上以外,该语音处理方法还可以应用于智能家居或智能办公等其他场景中。此时,该语音处理方法可由这些场景中涉及的相关装置例如控制装置、处理器等执行。Fig. 3 is a flowchart of a speech processing method provided by an embodiment of the present application. The voice processing method may be executed by a vehicle, a vehicle-mounted device, a vehicle-mounted computer, or a vehicle-mounted computer, and may also be executed by components in the vehicle or the vehicle-mounted device, such as a chip or a processor. In addition, in addition to being applied to vehicles, the voice processing method can also be applied to other scenarios such as smart home or smart office. At this point, the speech processing method may be executed by related devices involved in these scenarios, such as a control device, a processor, and the like.
如图3所示,该语音处理方法包括如下内容:As shown in Figure 3, the speech processing method includes the following contents:
S1,获取用户的输入语音信息。用户的输入语音信息可以根据声音传感器采集的传感器数据得到,可以直接使用传感器数据,也可以是对传感器数据进行处理后得到的信息。另外,该输入语音信息的时间长度没有特别限定,可以对应用户的一段话,也可以对应一句话。再者,在进行语音处理时,可以将用户讲出的内容进行切分而形成多个输入语音信息,对多个输入语音信息分别执行后述S2-S4的处理。S1. Obtain voice information input by the user. The input voice information of the user may be obtained according to the sensor data collected by the sound sensor, the sensor data may be used directly, or the information obtained after processing the sensor data. In addition, the time length of the input voice information is not particularly limited, and may correspond to a paragraph or a sentence of the user. Furthermore, when performing speech processing, the content spoken by the user may be segmented to form a plurality of input speech information, and the processing of S2-S4 described later is respectively performed on the plurality of input speech information.
S2,根据输入语音信息,确定输入语音信息对应的多个第一置信度,多个第一置信度分别对应于多个语种。这里,多个语种可以是预先设定的。另外,语种的置信度的含义是指输入语音信息属于该语种的概率。例如,得到的多个第一识别置信度为{汉语:0.6;英语:0.4;韩语:0;德语:0;日语:0}时,表示输入语音信息的语种是汉语的概率为0.6,是英语的概率为0.4,是韩语、德语、日语的概率是0。S2. Determine multiple first confidence levels corresponding to the input voice information according to the input voice information, where the multiple first confidence levels correspond to multiple languages. Here, multiple languages may be preset. In addition, the meaning of the confidence level of a language refers to the probability that the input voice information belongs to the language. For example, when the multiple first recognition confidence levels obtained are {Chinese: 0.6; English: 0.4; Korean: 0; German: 0; Japanese: 0}, it means that the probability that the language of the input voice information is Chinese is 0.6, which is English. The probability of being Korean, German, and Japanese is 0.4.
另外,不同语种可以是指不同语系,例如汉语和英语属于不同语种,也可以是指同一语系下的不同小语种,例如汉语的普通话和粤语也属于不同语种。In addition, different languages may refer to different language families. For example, Chinese and English belong to different languages, or they may refer to different minor languages under the same language family. For example, Mandarin and Cantonese in Chinese also belong to different languages.
S3,根据用户的用户特征修正多个第一置信度为多个第二置信度。这里的用户特征例如是历史语种记录或用户指定语种。历史语种记录是在当前处理周期之前识别并记录的用户的输入语音信息的识别语种。这里的识别语种的含义是对输入语音信息进行识别而确定的该输入语音信息的所属语种。用户指定语种是指用户例如根据自身常用语言设定的系统语言的种类。S3. Modify the multiple first confidence levels into multiple second confidence levels according to the user characteristics of the user. The user features here are, for example, historical language records or user-specified languages. The historical language record is the recognized language of the user's input voice information that was recognized and recorded before the current processing cycle. The recognized language here means the language of the input voice information determined by recognizing the input voice information. The user-specified language refers to the type of system language set by the user, for example, according to his or her frequently used language.
S4,根据多个第二置信度,确定输入语音信息的语种。S4. Determine the language of the input voice information according to multiple second confidence levels.
采用如上的语音处理方法,根据用户特征来修正第一置信度,根据修正后得到的第二置信度来确定输入语音信息的语种,从而能够更加准确地确定输入语音信息的语种,能够提高语音识别能力。Using the above speech processing method, the first confidence degree is modified according to the user characteristics, and the language type of the input voice information is determined according to the modified second confidence degree, so that the language type of the input voice information can be determined more accurately, and speech recognition can be improved. ability.
关于具体的修正方法,例如,以根据历史语种记录来进行修正为例:假设历史语种记录中汉语的记录数较多,那么将本次处理周期中得到的多个第一置信度中汉语的置信度增大,如此来得到第二置信度,例如,根据历史语种记录,将上述多个第一置信度{汉语:0.6;英语:0.4;韩语:0;德语:0;日语:0},修正为{汉语:0.8;英语:0.2;韩语:0;德语:0;日语:0}。Regarding the specific correction method, for example, take the correction based on historical language records as an example: assuming that there are more records in Chinese in the historical language records, then the confidence of Chinese among the multiple first confidence levels obtained in this processing cycle degree increases, so as to obtain the second degree of confidence, for example, according to the historical language records, the above-mentioned multiple first confidence degrees {Chinese: 0.6; English: 0.4; Korean: 0; German: 0; Japanese: 0}, amended is {Chinese: 0.8; English: 0.2; Korean: 0; German: 0; Japanese: 0}.
可选地,可以在多个第一置信度小于第一阈值时,根据用户特征修正多个第一置信度为多个第二置信度。多个第一置信度小于第一阈值时难以根据第一置信度来确定输入语音信息的语种。在此时根据用户特征修正多个第一置信度为多个第二置信度的话,根据第二置信度确定输入语音信息的语种,能够提高语种的识别精度,提高语音识别能力。Optionally, when the multiple first confidence levels are smaller than the first threshold, the multiple first confidence levels may be corrected to multiple second confidence levels according to user characteristics. When the plurality of first confidence levels are smaller than the first threshold, it is difficult to determine the language of the input voice information according to the first confidence levels. At this time, if multiple first confidence levels are modified into multiple second confidence levels according to user characteristics, the language of the input voice information can be determined according to the second confidence levels, which can improve language recognition accuracy and speech recognition ability.
可选地,历史语种记录与用户指定语种可以根据输入语音信息的声纹特征查询得到。如此,可容易地获得历史语种记录与用户指定语种。Optionally, historical language records and user-specified languages can be obtained by querying the voiceprint features of the input voice information. In this way, historical language records and user-specified languages can be easily obtained.
可选地,多个第一置信度可由多个初始置信度和多个预设权重确定,此时,可以 根据多个第二置信度,更新多个预设权重。Optionally, the multiple first confidence levels may be determined by multiple initial confidence levels and multiple preset weights. At this time, the multiple preset weights may be updated according to the multiple second confidence levels.
如此,根据本次处理周期的处理结果,更新预设权重,从而能够提高之后的处理周期的语种识别精度。In this way, the preset weight is updated according to the processing result of the current processing cycle, so that the language recognition accuracy of the subsequent processing cycle can be improved.
作为具体的更新方法,可以在多个第二置信度中存在大于第一阈值的第二置信度时,执行更新。As a specific updating method, updating may be performed when there is a second confidence level greater than the first threshold among the multiple second confidence levels.
当多个第二置信度中存在大于第一阈值的第二置信度时,根据多个第二置信度得到的语种识别结果的可信度更高,在此时根据多个第二置信度来更新预设权重能够更加可靠地提高之后的处理周期的语种识别精度。When there is a second confidence degree greater than the first threshold among the plurality of second confidence degrees, the language recognition result obtained according to the plurality of second confidence degrees has higher credibility, and at this time according to the plurality of second confidence degrees Updating the preset weights can more reliably improve the language recognition accuracy in subsequent processing cycles.
可选地,在确定输入语音信息的语种后,可以根据输入语音信息与输入语音信息的语种来确定输入语音信息的语义。Optionally, after the language of the input voice information is determined, the semantics of the input voice information may be determined according to the input voice information and the language of the input voice information.
采用如上方式,由于能够更加准确地确定输入语音信息的语种,因而能够提高输入语音信息的语义的识别精度。By adopting the above method, since the language type of the input voice information can be determined more accurately, the semantic recognition accuracy of the input voice information can be improved.
可选地,可以在获取用户的输入语音信息之前,根据场景特征设定上述多个预设权重。Optionally, before acquiring the input voice information of the user, the above-mentioned multiple preset weights may be set according to scene characteristics.
由于在获取用户的输入语音信息之前,根据场景特征设定上述多个预设权重,因此能够提高语种识别精度。Since the above-mentioned multiple preset weights are set according to scene characteristics before the user's input voice information is acquired, the language recognition accuracy can be improved.
这里的场景特征例如可以包括环境特征和/或音频采集器特征。The scene features here may include environment features and/or audio collector features, for example.
通过根据环境特征和/或音频采集器特征来设定多个预设权重,能够提高语种识别精度。By setting multiple preset weights according to environmental features and/or audio collector features, language recognition accuracy can be improved.
这里的环境特征可以包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,音频采集器特征可以包括麦克风排布信息。The environmental characteristics here may include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information.
作为设定多个预设权重的具体方式,可以采用如下方式:获取预先采集的第一语音数据与第一语音数据的预先记录的第一语种信息;根据第一语音数据与场景特征确定第二语音数据;根据第二语音数据确定第二语音数据的第二语种信息;根据第一语种信息与第二语种信息设定多个预设权重。As a specific way of setting a plurality of preset weights, the following method can be adopted: obtain the pre-collected first voice data and the pre-recorded first language information of the first voice data; determine the second language according to the first voice data and scene characteristics; voice data; determining second language information of the second voice data according to the second voice data; setting a plurality of preset weights according to the first language information and the second language information.
更进一步地,根据第二语音数据确定第二语音数据的第二语种信息的具体方式可以为:获取多个测试权重组,多个测试权重组中的任一个包括多个测试权重;根据第二语音数据与多个测试权重组确定多个第二语种信息,多个第二语种信息与多个测试权重组分别对应;根据第一语种信息与多个第二语种信息确定多个第二语种信息的多个准确率;根据准确率最高的第二语种信息对应的测试权重组设定多个预设权重。Furthermore, the specific manner of determining the second language information of the second voice data according to the second voice data may be: acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple test weights; according to the second voice data and a plurality of test weight groups determine a plurality of second language information, and the plurality of second language information corresponds to a plurality of test weight groups; determine a plurality of second language information according to the first language information and the plurality of second language information multiple accuracy rates; multiple preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
另外,可以对多个预设权重设定可调节的范围即权重范围,在权重范围内设定或更新多个预设权重。如果预设权重超出权重范围,则识别结果将不可信。因此,通过设定可调节的范围即权重范围,能够提高语种识别结果的精度。In addition, an adjustable range, that is, a weight range, can be set for multiple preset weights, and multiple preset weights can be set or updated within the weight range. If the preset weight exceeds the weight range, the recognition result will not be credible. Therefore, by setting an adjustable range, that is, a weight range, the accuracy of the language recognition result can be improved.
这里,权重范围可以按照如下方式确定:获取预先采集的多个测试语音数据组与多个测试语音数据组的预先记录的第一语种信息,多个测试语音数据组中的任一个包括多个测试语音数据;获取多个测试权重组,多个测试权重组中的任一个包括多个测试权重;根据多个测试语音数据组、第一语音信息与多个测试权重组确定权重范围。Here, the weight range can be determined in the following manner: a plurality of pre-collected test voice data groups and pre-recorded first language information of the plurality of test voice data groups are acquired, and any one of the plurality of test voice data groups includes a plurality of test voice data sets. voice data; acquiring multiple test weight groups, any one of the multiple test weight groups includes multiple test weights; determining the weight range according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
下面参照图4描述本申请另一个实施例提供的语音处理方法。图4是本申请一个实施例提供的语音处理方法的流程图。与上述实施例类似,该实施例的语音处理方法 可由车辆、车载装置、车机、车载电脑等执行,也可以由车辆或车载装置中的部件例如芯片或处理器等执行。另外,本实施例中的部分内容与上述实施例相同,因而对于这些内容不再重复描述。The speech processing method provided by another embodiment of the present application is described below with reference to FIG. 4 . Fig. 4 is a flowchart of a speech processing method provided by an embodiment of the present application. Similar to the above-mentioned embodiments, the voice processing method of this embodiment can be executed by the vehicle, vehicle-mounted device, vehicle machine, vehicle-mounted computer, etc., and can also be executed by components in the vehicle or vehicle-mounted device, such as chips or processors. In addition, part of the content in this embodiment is the same as that in the above embodiment, so the description of these content will not be repeated.
如图4所示,该语音处理方法包括如下内容:As shown in Figure 4, the speech processing method includes the following contents:
S6,获取用户的输入语音信息。S6. Obtain voice information input by the user.
S7,根据输入语音信息,确定输入语音信息对应的多个第三置信度,多个第三置信度分别对应于多个语种。S7. Determine multiple third confidence levels corresponding to the input voice information according to the input voice information, where the multiple third confidence levels correspond to multiple languages.
S8,根据场景特征修正多个第三置信度为多个第四置信度。S8. Correct the multiple third confidence levels into multiple fourth confidence levels according to scene features.
S9,根据多个第四置信度,确定输入语音信息的语种。S9. Determine the language of the input voice information according to multiple fourth confidence levels.
采用如上的语音处理方法,根据用户特征来修正第三置信度,根据修正后得到的第四置信度来确定输入语音信息的语种,从而能够更加准确地确定输入语音信息的语种,能够提高语音识别能力。这里,第三置信度的获得方式可以与上述第一置信度相同,也可以不同,第四置信度的获得方式可以与上述第二置信度相同,也可以不同。根据场景特征进行修正的具体方式可以和上述实施例中的根据用户特征进行修正的具体方式相同,也可以不同。Using the above speech processing method, the third confidence degree is modified according to the user characteristics, and the language type of the input voice information is determined according to the modified fourth confidence degree, so that the language type of the input voice information can be determined more accurately, and speech recognition can be improved. ability. Here, the third confidence degree may be obtained in the same manner as the first confidence degree, or may be different, and the fourth confidence degree may be obtained in the same manner as the second confidence degree, or may be different. The specific manner of performing correction according to scene characteristics may be the same as the specific manner of performing correction according to user characteristics in the foregoing embodiments, or may be different.
本实施例中的修正处理和上述实施例中描述的修正处理可以结合使用,即根据用户特征和场景特征两方来对语种置信度进行修正,从而能够更加准确地确定输入语音信息的语种。The correction processing in this embodiment and the correction processing described in the above embodiments can be used in combination, that is, the language confidence is corrected according to both user characteristics and scene characteristics, so that the language of the input voice information can be determined more accurately.
在本实施例中,可选地,可以根据场景特征设定多个预设权重;根据多个预设权重修正多个第三置信度为多个第四置信度。In this embodiment, optionally, a plurality of preset weights may be set according to scene characteristics; and the plurality of third confidence degrees are modified into a plurality of fourth confidence degrees according to the plurality of preset weights.
另外,可选地,多个预设权重的设定方式具体可以是:获取预先采集的第一语音数据与第一语音数据的预先记录的第一语种信息;根据第一语音数据与场景特征确定第二语音数据;根据第二语音数据确定第二语音数据的第二语种信息;根据第一语种信息与第二语种信息设定多个预设权重。In addition, optionally, the method of setting multiple preset weights may specifically be: acquiring pre-collected first voice data and pre-recorded first language information of the first voice data; determining according to the first voice data and scene characteristics second voice data; determining second language information of the second voice data according to the second voice data; setting a plurality of preset weights according to the first language information and the second language information.
可选地,作为一个具体的实现方式,获取多个测试权重组,测试权重组包括多个测试权重;根据第二语音数据与多个测试权重组确定多个第二语种信息,多个第二语种信息与多个测试权重组分别对应;根据所述第一语种信息与所述多个第二语种信息确定所述多个第二语种信息的多个准确率;根据准确率最高的第二语种信息对应的测试权重组设定多个预设权重。Optionally, as a specific implementation, multiple test weight groups are obtained, and the test weight groups include multiple test weights; multiple second language information is determined according to the second voice data and the multiple test weight groups, and the multiple second The language information corresponds to a plurality of test weight groups respectively; according to the first language information and the plurality of second language information, multiple accuracy rates of the plurality of second language information are determined; according to the second language with the highest accuracy rate The test weight group corresponding to the information sets a plurality of preset weights.
下面参照图5对本申请一个实施例提供的一种语音处理装置进行描述。图5为本申请一个实施例提供的语音处理装置的示意性结构说明图。该语音处理装置190用于执行参照图3描述的实施例中的语音处理方法或者参照图4描述的实施例中的语音处理方法,其结构根据上面的描述可以获知,因此这里仅作简要描述。如图5所示,该语音处理装置190包括处理模块192与收发模块194。处理模块192可以用于执行上述S2-S4或者S7-S9中的内容,收发模块194可以用于执行上述S1或者S6中的内容。另外,该语音处理装置190可以由硬件构成,也可以由软件构成,或者由软硬件结合构成。采用本实施例的语音处理装置190,能够获得与上面描述的语音处理方法相同的技术效果,因而这里省略对技术效果的重复描述。A voice processing device provided by an embodiment of the present application is described below with reference to FIG. 5 . FIG. 5 is an explanatory diagram of a schematic structure of a speech processing device provided by an embodiment of the present application. The voice processing device 190 is used to execute the voice processing method in the embodiment described with reference to FIG. 3 or the voice processing method in the embodiment described with reference to FIG. 4 , and its structure can be known from the above description, so it is only briefly described here. As shown in FIG. 5 , the voice processing device 190 includes a processing module 192 and a transceiver module 194 . The processing module 192 may be used to execute the content in S2-S4 or S7-S9 above, and the transceiver module 194 may be used to execute the content in S1 or S6 above. In addition, the speech processing device 190 may be composed of hardware, may also be composed of software, or may be composed of a combination of software and hardware. Using the speech processing apparatus 190 of this embodiment, the same technical effect as that of the speech processing method described above can be obtained, so repeated description of the technical effect is omitted here.
下面参照图6对本申请一个实施例提供的一种语种识别方法进行描述。A language recognition method provided by an embodiment of the present application is described below with reference to FIG. 6 .
图6是用于示意性地说明本申请一个实施例提供的一种语种识别方法的流程图。该语种识别方法可以由车辆、车载装置、车机、车载电脑、芯片、处理器等执行。如图6所示,在本实施例的语种识别方法中,首先,在步骤S10中,获取用户的输入语音信息。例如获取麦克风接收到的用户的输入语音数据作为输入语音信息,或者对麦克风的输入语音数据进行预处理而得到输入语音信息。在步骤S12中,对输入语音进行识别而得到多语种的第一识别置信度集合,第一识别置信度集合中的多个第一识别置信度与多个语种分别对应。例如,多语种的第一识别置信度集合为{汉语:0.9;英语:0.1;韩语:0;德语:0;日语:0}。即,输入语音信息的语种是汉语的概率是0.9,是英语的概率是0.1,是韩语、德语、日语的概率是0。Fig. 6 is a flow chart for schematically illustrating a language recognition method provided by an embodiment of the present application. The language recognition method can be executed by a vehicle, a vehicle-mounted device, a vehicle machine, a vehicle-mounted computer, a chip, a processor, and the like. As shown in FIG. 6 , in the language recognition method of this embodiment, firstly, in step S10 , the input voice information of the user is acquired. For example, the user's input voice data received by the microphone is acquired as the input voice information, or the input voice data of the microphone is preprocessed to obtain the input voice information. In step S12 , the input speech is recognized to obtain a multilingual first recognition confidence set, and multiple first recognition confidence levels in the first recognition confidence set correspond to multiple languages respectively. For example, the multilingual first recognition confidence set is {Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0}. That is, the probability that the language of the input voice information is Chinese is 0.9, the probability that it is English is 0.1, and the probability that it is Korean, German, or Japanese is 0.
在步骤S14中,判断多语种的第一识别置信度集合中是否存在大于阈值的第一识别置信度。这里的阈值例如可以设定为0.8。当判断结果为“是”时,即存在大于阈值的第一识别置信度(例如汉语的第一识别置信度0.9)时,根据该第一识别置信度集合生成识别结果并输出。这里的识别结果可以是指示识别语种(例如汉语)的结果,也可以是第一识别置信度集合本身。另外,作为其他实施例,也可以省略该S14,直接进行后述的步骤S18。In step S14, it is judged whether there is a first recognition confidence value greater than a threshold in the multilingual first recognition confidence set. The threshold here can be set to 0.8, for example. When the judgment result is "Yes", that is, when there is a first recognition confidence level greater than the threshold (for example, the first recognition confidence level of Chinese is 0.9), the recognition result is generated and output according to the first recognition confidence level set. The recognition result here may be a result indicating the recognized language (for example, Chinese), or may be the first recognition confidence set itself. In addition, as another embodiment, this S14 may be omitted, and step S18 described later may be directly performed.
另外,当步骤S14中的判断结果为“否”时,即不存在大于阈值的第一识别置信度,在步骤S18中,根据用户的用户特征对第一识别置信度集合进行修正计算而得到第二识别置信度集合。In addition, when the judgment result in step S14 is "No", that is, there is no first recognition confidence degree greater than the threshold, in step S18, the first recognition confidence degree set is corrected and calculated according to the user's user characteristics to obtain the first recognition confidence degree set. 2. Recognition confidence set.
作为用户特征的例子,可以举出用户的历史语种记录和用户指定语种等。Examples of user characteristics include the user's historical language records, user-specified language, and the like.
历史语种记录是指在输入上述输入语音之前该用户输入的语音的识别语种的记录。用户指定语种是指用户设定的系统语言(例如语音交互系统的系统语言、应用在手机上时手机操作系统的系统语言等)的种类。另外,用户指定语种可能是一个,也可能是多个(即用户曾经设定的系统语言有多种)。历史语种记录可以通过输入语音的声纹在数据库中进行查询得到,还可以通过用户的人脸信息、虹膜信息等进行查询得到。即,根据声纹、人脸信息、虹膜信息等能够确定用户的身份,由此,可以从数据库中得到用户的历史语种记录。根据输入语音的声纹来查询历史语种记录与用户指定语种,例如与根据人脸信息、虹膜等来查询的方式相比,能够避免对用户(说话者)的误识别(将非说话者识别为说话者)造成语种误识别。此外,根据输入语音信息即可获得声纹,而根据人脸信息、虹膜等来查询的方式还需要获得用户图像,因而,根据声纹进行查询的方式所需的设备更少、处理更迅速。The historical language record refers to the record of the recognition language of the speech input by the user before the above-mentioned input speech is input. The user-specified language refers to the type of system language set by the user (such as the system language of the voice interaction system, the system language of the mobile phone operating system when applied to a mobile phone, etc.). In addition, the user may specify one or more languages (that is, the user has set multiple system languages). Historical language records can be obtained by querying the voiceprint of the input voice in the database, and can also be obtained by querying the user's face information, iris information, etc. That is, the user's identity can be determined based on voiceprint, face information, iris information, etc., and thus the user's historical language records can be obtained from the database. Querying historical language records and user-specified languages according to the voiceprint of the input voice, for example, compared with the way of querying based on face information, iris, etc., can avoid misidentifying users (speakers) (identifying non-speakers as speakers) cause language misidentification. In addition, the voiceprint can be obtained according to the input voice information, while the method of querying based on face information, iris, etc. also needs to obtain the user image. Therefore, the method of querying based on the voiceprint requires less equipment and faster processing.
另外,或许需要说明的是,例如通过声纹来得到用户指定语种这一方式的基础是,在用户对系统语言进行设定时采集用户的声纹,将声纹(或者说用户身份)与用户设定的系统语言相关联地存储在用户指定语种数据库中。这里所说的数据库可以存储在本地,也可以存储在具有公信力的平台上。In addition, it may need to be explained that, for example, the basis of obtaining the user-specified language through voiceprint is to collect the user's voiceprint when the user sets the system language, and combine the voiceprint (or user identity) with the user The set system language is associated and stored in the user-specified language database. The database mentioned here can be stored locally or on a credible platform.
在根据用户特征修正第一识别置信度集合得到第二识别置信度集合后,在步骤S20中,根据第二识别置信度集合生成语种识别结果。例如可以直接将第二识别置信度集合作为识别结果输出,或者,在第二识别置信度集合中存在大于阈值的第二识别置信度时,输出第二识别置信度集合或指示识别语种的信息,在第二识别置信度集合中不存在大于阈值的第二识别置信度时,将第一识别置信度集合作为识别结果输出。After modifying the first recognition confidence set according to the user characteristics to obtain the second recognition confidence set, in step S20, a language recognition result is generated according to the second recognition confidence set. For example, the second recognition confidence set may be directly output as the recognition result, or, when there is a second recognition confidence greater than a threshold in the second recognition confidence set, output the second recognition confidence set or information indicating the recognized language, When there is no second recognition confidence greater than the threshold in the second recognition confidence set, the first recognition confidence set is output as the recognition result.
采用如上方法,根据用户特征对第一识别置信度进行计算而得到第二识别置信度,根据该第二识别置信度确定语种识别结果,如此,能够提高语种识别能力。Using the above method, the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence. In this way, the language recognition ability can be improved.
可选地,在本实施例的语种识别方法中还包括:在第二识别置信度集合中的多个第二识别置信度小于阈值时,根据对输入语音进行自动语音识别得到的自动语音识别置信度或者对输入语音进行自然语言理解(Natural Language Understanding,NLU)得到的自然语言理解置信度生成语种识别结果。Optionally, the language recognition method in this embodiment further includes: when multiple second recognition confidences in the second recognition confidence set are smaller than the threshold, according to the automatic speech recognition confidence obtained by performing automatic speech recognition on the input speech The language recognition result is generated using the confidence degree of natural language understanding (NLU) obtained by performing natural language understanding (NLU) on the input speech.
采用如上方式,在根据第二识别置信度难以识别输入语音的语种时,根据自动语音识别置信度或自然语言理解置信度来确定输入语音的语种,从而,能够提高语种识别能力。作为具体的实现方式,例如将自动语音识别置信度超过阈值的语种作为输入语音的识别语种。In the above manner, when it is difficult to recognize the language of the input speech according to the second recognition confidence, the language of the input speech is determined according to the automatic speech recognition confidence or the natural language understanding confidence, thereby improving the language recognition ability. As a specific implementation manner, for example, the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
可选地,在本实施例中,第一识别置信度集合可以通过如下方式得到:对输入语音进行识别而得到初始置信度集合;将初始置信度集合中的多个初始置信度与预设权重集合中的多个预设权重分别相乘,而得到第一识别置信度集合。Optionally, in this embodiment, the first recognition confidence set can be obtained in the following manner: the input speech is recognized to obtain an initial confidence set; multiple initial confidence sets in the initial confidence set are combined with preset weights Multiple preset weights in the set are respectively multiplied to obtain a first recognition confidence set.
此时,当第二识别置信度集合中存在大于阈值的第二识别置信度时,可以更新预设权重集合,而使多个语种中第二识别置信度大于阈值的识别语种的预设权重相对于其他语种的预设权重增大。At this time, when there is a second recognition confidence greater than the threshold in the second recognition confidence set, the preset weight set can be updated, so that the preset weights of the recognition languages whose second recognition confidence is greater than the threshold among the multiple languages are relatively Preset weights increased for other languages.
采用如上技术手段,当第二识别置信度集合中存在大于阈值的第二识别置信度时,也就是可以确定输入语音的语种时,按照如上方式更新预设权重集合,如此,在对之后的输入语音进行处理时,使用更新后的预设权重集合,从而能够提高语种识别的精度,提高语种识别能力。Using the above technical means, when there is a second recognition confidence greater than the threshold in the second recognition confidence set, that is, when the language of the input speech can be determined, the preset weight set is updated as above, so that after the input When the speech is processed, the updated preset weight set is used, so that the accuracy of language recognition can be improved, and the language recognition ability can be improved.
更新预设权重集合的具体方式可以为:对预设权重集合进行修正计算而得到修正权重集合;在修正权重集合中的多个修正权重在权重范围内时,用多个修正权重的值更新预设权重集合。The specific method of updating the preset weight set can be: performing correction calculation on the preset weight set to obtain the corrected weight set; Set the set of weights.
如果预设权重超出权重范围外的话,根据预设权重得到语种识别的结果的可信度是比较低的,因而,采用如上方式,在权重范围内修正预设权重,能够抑制语种的误识别率。If the preset weight exceeds the weight range, the reliability of the language recognition result obtained according to the preset weight is relatively low. Therefore, using the above method, correcting the preset weight within the weight range can suppress the misrecognition rate of the language .
另外,在本实施例中,可以根据场景特征预先设定预设权重集合。从而,能够适应不同场景,以尽可能最适合场景的预设权重来得到语种识别结果,提高了语种识别能力。In addition, in this embodiment, a preset weight set may be preset according to scene characteristics. Therefore, it is possible to adapt to different scenarios, and obtain the language recognition result with the preset weights that are most suitable for the scenario, thereby improving the language recognition ability.
场景特征可以包括环境特征和/或音频采集器特征。环境特征可以包括环境信噪比、电源直流交流电信息或环境振动幅度,音频采集器特征可以包括麦克风排布信息。麦克风排布信息是指,是单麦克风还是麦克风阵列,或者在是麦克风阵列时是线性阵列、平面阵列还是立体阵列。Scene features may include environmental features and/or audio picker features. The environmental characteristics may include environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics may include microphone arrangement information. The microphone arrangement information refers to whether it is a single microphone or a microphone array, or if it is a microphone array, whether it is a linear array, a planar array, or a stereo array.
环境信噪比、电源直流交流电信息、环境振动幅度、麦克风排布信息这些事项都有可能影响语种置信度,因此,根据这些信息来调整预设权重,在此基础上进行语种识别,从而能够提高语种识别能力。Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
述设定预设权重集合的具体方式可以是:获取多个测试权重集合;将拟环境数据集输入语种识别模型,拟环境数据集是根据场景特征和无噪声数据集得到的;根据语种识别模型输出的初始置信度集合,得到多个测试权重集合情况下的多个第一识别置 信度集合;根据拟环境数据集的语种信息,计算出多个第一识别置信度集合的预测准确率;将多个测试权重集合中预测准确率最高的第一识别置信度集合对应的测试权重集合确定为最优测试权重集合;用最优测试权重集合中的多个测试权重的值设定预设权重集合。The specific method of setting the preset weight set can be: obtaining multiple test weight sets; inputting the pseudo-environmental data set into the language recognition model, the pseudo-environmental data set is obtained according to the scene characteristics and the noise-free data set; according to the language recognition model The output initial confidence set is obtained under the condition of multiple test weight sets of multiple first recognition confidence sets; according to the language information of the pseudo-environmental data set, the prediction accuracy of multiple first recognition confidence sets is calculated; The test weight set corresponding to the first recognition confidence set with the highest prediction accuracy among the multiple test weight sets is determined as the optimal test weight set; the preset weight set is set with the values of multiple test weights in the optimal test weight set .
可选地,当设定的预设权重在权重范围内时使设定生效,当设定的预设权重不在权重范围内时,取消设定。或者,当设定的预设权重不在权重范围内时,依然使设定生效,但是优先采用其他方式来得到语种识别结果,例如根据用户指定语种来确定输入语音的识别语种,或者,通过将本次的输入语音与历史语种记录中的输入语音进行比较,得到特征性相似度,如果特征相似度大于相似度阈值,则将历史语种记录中的输入语音的语种确定为本次的输入语音的识别语种。Optionally, when the set preset weight is within the weight range, the setting is enabled, and when the set preset weight is not within the weight range, the setting is canceled. Or, when the set preset weight is not within the weight range, the setting is still valid, but other methods are preferred to obtain the language recognition result, such as determining the recognition language of the input voice according to the language specified by the user, or, by setting this The second input voice is compared with the input voice in the historical language record to obtain the characteristic similarity. If the feature similarity is greater than the similarity threshold, the language of the input voice in the historical language record is determined as the recognition of the input voice this time. language.
权重范围可以按照如下方式设定:获取多个测试数据集;获取多个测试权重集合;将测试数据集输入语种识别模型;根据语种识别模型输出的初始置信度集合与多个测试权重集合,得到多个测试权重集合情况下的多个第一识别置信度集合;根据测试数据集的语种信息,计算出多个第一识别置信度集合的预测准确率;将多个测试权重集合中预测准确率最高的第一识别置信度集合对应的测试权重集合确定为最优测试权重集合;得到多个测试数据集的最优测试权重集合;根据多个测试数据集的最优测试权重集合得到多个语种的权重范围。The weight range can be set in the following way: obtain multiple test data sets; obtain multiple test weight sets; input the test data sets into the language recognition model; according to the initial confidence set output by the language recognition model and multiple test weight sets, get A plurality of first recognition confidence sets in the case of multiple test weight sets; according to the language information of the test data set, calculate the prediction accuracy of multiple first recognition confidence sets; predict the accuracy of the multiple test weight sets The test weight set corresponding to the highest first recognition confidence set is determined as the optimal test weight set; the optimal test weight set of multiple test data sets is obtained; multiple language types are obtained according to the optimal test weight set of multiple test data sets weight range.
其中,测试数据集可以是预先采集的语音数据集,其语种信息是已知的。Wherein, the test data set may be a pre-collected speech data set, whose language information is known.
采用如上方式,通过大量的语种数据集对语种识别模型进行测试,来设定多语种的预设权重集合的权重范围,也就是规定了语种识别模型的鲁棒性范围,使语种识别模型在此范围内工作,从而保证了语种识别结果的可靠性。Using the above method, a large number of language data sets are used to test the language recognition model to set the weight range of the multilingual preset weight set, that is, to specify the robustness range of the language recognition model, so that the language recognition model is here Working within the scope, thus ensuring the reliability of language recognition results.
图7所示为本申请一个实施例提供的语种识别装置的结构示意图。如图7所示,本申请一个实施例提供一种语种识别装置,该语种识别装置用于执行图6中所示的语种识别方法,其结构可以从上面对图6中的语种识别方法的描述中获知,因而这里仅相对简略地对该语种识别装置10进行描述。FIG. 7 is a schematic structural diagram of a language recognition device provided by an embodiment of the present application. As shown in Figure 7, an embodiment of the present application provides a language recognition device, which is used to implement the language recognition method shown in Figure 6, and its structure can be compared to the language recognition method in Figure 6 from above It is known from the description, so here only a relatively brief description of the language recognition device 10 is given.
如图7所示,该语种识别装置10包括:输入语音获取模块17,用于获取用户的输入语音;语种识别模块12,用于对输入语音进行识别而得到第一识别置信度集合,第一识别置信度集合中的多个第一识别置信度与多个语种分别对应;语种置信度修正模块16,用于根据用户的用户特征对第一识别置信度集合进行修正计算而得到第二识别置信度集合;识别结果生成模块18,用于根据第二识别置信度集合生成语种识别结果。As shown in FIG. 7 , the language recognition device 10 includes: an input speech acquisition module 17, configured to acquire the user's input speech; a language recognition module 12, configured to recognize the input speech to obtain a first recognition confidence set, the first Multiple first recognition confidences in the recognition confidence set correspond to multiple languages respectively; the language confidence correction module 16 is used to correct and calculate the first recognition confidence set according to the user characteristics of the user to obtain the second recognition confidence degree set; the recognition result generating module 18, configured to generate a language recognition result according to the second recognition confidence degree set.
采用如上装置,根据用户特征对第一识别置信度进行计算而得到第二识别置信度,根据该第二识别置信度确定语种识别结果,如此,能够提高语种识别能力。Using the above device, the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence. In this way, the language recognition ability can be improved.
可选地,语种置信度修正模块16可以当多个第一识别置信度小于阈值时,根据用户的用户特征对第一识别置信度集合进行修正计算而得到第二识别置信度集合。Optionally, the language confidence correction module 16 may correct and calculate the first set of recognition confidences according to user characteristics to obtain the second set of recognition confidences when the multiple first recognition confidences are less than the threshold.
可选地,用户特征包括历史语种记录。Optionally, the user characteristics include historical language records.
采用该方式,在根据第一识别置信度难以识别输入语音的语种时,根据用户的历史语种记录来修正第一识别置信度,在此基础上确定输入语音的语种,从而,能够提高语种识别能力。这里的用户的历史语种记录是指在输入上述输入语音之前该用户输 入的语音所属的语种的记录。In this way, when it is difficult to recognize the language of the input voice based on the first recognition confidence, the first recognition confidence is corrected according to the user's historical language records, and the language of the input voice is determined on this basis, thereby improving the language recognition ability . The historical language record of the user here refers to the record of the language to which the voice input by the user belongs before inputting the above-mentioned input voice.
可选地,历史语种记录是根据输入语音的声纹查询得到。Optionally, the historical language record is obtained by querying the voiceprint of the input voice.
采用如上方式,根据输入语音的声纹来查询历史语种记录,例如与根据人脸信息、虹膜等来查询的方式相比,能够避免对用户(说话者)的误识别(将非说话者识别为说话者)造成语种误识别。In the above way, according to the voiceprint of the input voice to query the historical language record, for example, compared with the way of querying based on face information, iris, etc., it can avoid misidentifying the user (speaker) (identifying the non-speaker as speakers) cause language misidentification.
可选地,用户特征包括用户指定语种。Optionally, the user characteristics include a user-specified language.
采用如上方式,在根据第一识别置信度难以识别输入语音的语种时,根据用户指定语种来修正第一识别置信度,在此基础上确定输入语音的语种,从而,能够提高语种识别能力。In the above manner, when it is difficult to recognize the language of the input speech based on the first recognition confidence, the first recognition confidence is modified according to the language specified by the user, and the language of the input speech is determined on this basis, thereby improving the language recognition ability.
可选地,用户指定语种是根据输入语音的声纹查询得到。Optionally, the language specified by the user is obtained through querying the voiceprint of the input voice.
采用如上方式,根据输入语音的声纹来查询用户指定语种,例如与根据人脸信息、虹膜等来查询的方式相比,能够避免对用户(说话者)的误识别(将非说话者识别为说话者)造成语种误识别。In the above manner, the language specified by the user can be queried according to the voiceprint of the input voice, for example, compared with the way of inquiring based on face information, iris, etc., it can avoid misidentifying the user (speaker) (identifying the non-speaker as speakers) cause language misidentification.
可选地,识别结果生成模块还用于:在第二识别置信度集合中的多个第二识别置信度小于阈值时,根据对输入语音进行自动语音识别(Automatic Speech Recognition,ASR)得到的自动语音识别置信度生成语种识别结果。Optionally, the recognition result generation module is also used for: when a plurality of second recognition confidences in the second recognition confidence set are less than a threshold, according to the automatic speech recognition (Automatic Speech Recognition, ASR) obtained by performing automatic speech recognition (ASR) on the input speech, Speech recognition confidence generates language recognition results.
采用如上方式,在根据第二识别置信度难以识别输入语音的语种时,根据自动语音识别置信度来确定输入语音的语种,从而,能够提高语种识别能力。作为具体的实现方式,例如将自动语音识别置信度超过阈值的语种作为输入语音的识别语种。With the above method, when it is difficult to recognize the language of the input speech according to the second recognition confidence, the language of the input speech is determined according to the automatic speech recognition confidence, thereby improving the language recognition ability. As a specific implementation manner, for example, the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
可选地,识别结果生成模块还用于:在第二识别置信度集合中的多个第二识别置信度小于阈值时,根据对输入语音进行自然语言理解得到的自然语言理解置信度生成语种识别结果。Optionally, the recognition result generation module is also used for: when multiple second recognition confidences in the second recognition confidence set are less than the threshold, generate language recognition according to the natural language understanding confidence obtained by performing natural language understanding on the input speech result.
采用如上方式,在根据第二识别置信度难以识别输入语音的语种时,根据自动语音识别置信度来确定输入语音的语种,从而,能够提高语种识别能力。作为具体的实现方式,例如将自动语音识别置信度超过阈值的语种作为输入语音的识别语种。With the above method, when it is difficult to recognize the language of the input speech according to the second recognition confidence, the language of the input speech is determined according to the automatic speech recognition confidence, thereby improving the language recognition ability. As a specific implementation manner, for example, the language whose automatic speech recognition confidence exceeds a threshold is used as the recognized language of the input speech.
可选地,语种识别模块还用于:对输入语音进行识别而得到初始置信度集合;将初始置信度集合中的多个初始置信度与预设权重集合中的多个预设权重分别相乘,而得到第一识别置信度集合;语种置信度修正模块还用于:当第二识别置信度集合中存在大于阈值的第二识别置信度时更新预设权重集合,而使多个语种中第二识别置信度大于阈值的识别语种的预设权重相对于其他语种的预设权重增大。Optionally, the language identification module is also used to: recognize the input speech to obtain an initial confidence set; multiply multiple initial confidences in the initial confidence set by multiple preset weights in the preset weight set , to obtain the first recognition confidence set; the language confidence correction module is also used to update the preset weight set when there is a second recognition confidence greater than the threshold in the second recognition confidence set, so that the first Second, the preset weights of recognized languages whose recognition confidence is greater than the threshold are increased relative to the preset weights of other languages.
采用如上方式,当第二识别置信度集合中存在大于阈值的第二识别置信度时,也就是可以确定输入语音的语种时,按照如上方式更新预设权重集合,如此,在对之后的输入语音进行处理时,使用更新后的预设权重集合,从而能够提高语种识别的精度,提高语种识别能力。In the above manner, when there is a second recognition confidence greater than the threshold in the second recognition confidence set, that is, when the language of the input speech can be determined, the preset weight set is updated in the above manner, so that the subsequent input speech When processing, the updated preset weight set is used, so that the accuracy of language recognition can be improved, and the ability of language recognition can be improved.
可选地,语种置信度修正模块还用于:对预设权重集合进行修正计算而得到修正权重集合;在修正权重集合中的多个修正权重在权重范围内时,用多个修正权重的值更新预设权重集合。Optionally, the language confidence correction module is also used to: perform correction calculation on the preset weight set to obtain a correction weight set; when multiple correction weights in the correction weight set are within the weight range, use the values of multiple correction weights Update preset weight collection.
预设权重超出权重范围外的话,根据预设权重得到语种识别的结果的可信度是比较低的,因而,采用如上方式,在权重范围内修正预设权重,能够抑制语种的误识别 率。If the preset weight exceeds the weight range, the reliability of the language recognition result obtained according to the preset weight is relatively low. Therefore, using the above method, correcting the preset weight within the weight range can suppress the misrecognition rate of the language.
可选地,语种识别模块还用于:对输入语音进行识别而得到初始置信度集合;将初始置信度集合中的多个初始置信度与预设权重集合中的多个预设权重分别相乘,而得到第一识别置信度集合;语种置信度修正模块还用于根据场景特征设定预设权重集合。Optionally, the language identification module is also used to: recognize the input speech to obtain an initial confidence set; multiply multiple initial confidences in the initial confidence set by multiple preset weights in the preset weight set , to obtain the first recognition confidence set; the language confidence correction module is also used to set a preset weight set according to scene features.
采用如上方式,根据场景特征设定预设权重集合,从而,能够适应不同场景,以尽可能最适合场景的预设权重来得到语种识别结果,提高了语种识别能力。In the above way, the preset weight set is set according to the scene characteristics, so that different scenes can be adapted, and the language recognition result can be obtained with the preset weights that are most suitable for the scene, thereby improving the language recognition ability.
可选地,场景特征包括环境特征和/或音频采集器特征。Optionally, the scene features include environment features and/or audio collector features.
可选地,环境特征包括环境信噪比、电源直流交流电信息或环境振动幅度,音频采集器特征包括麦克风排布信息。麦克风排布信息是指,是单麦克风还是麦克风阵列,或者在是麦克风阵列时是线性阵列、平面阵列还是立体阵列。Optionally, the environmental characteristics include environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics include microphone arrangement information. The microphone arrangement information refers to whether it is a single microphone or a microphone array, or if it is a microphone array, whether it is a linear array, a planar array, or a stereo array.
环境信噪比、电源直流交流电信息、环境振动幅度、麦克风排布信息这些事项都有可能影响语种置信度,因此,根据这些信息来调整预设权重,在此基础上进行语种识别,从而能够提高语种识别能力。Environmental signal-to-noise ratio, power supply DC and AC information, environmental vibration amplitude, and microphone arrangement information may affect the language confidence. Therefore, the preset weights are adjusted according to these information, and language recognition is performed on this basis. Language recognition ability.
可选地,语种置信度修正模块还用于:获取多个测试权重集合;将拟环境数据集输入语种识别模型,拟环境数据集是根据场景特征和无噪声数据集得到的;根据语种识别模型输出的初始置信度集合,得到多个测试权重集合情况下的多个第一识别置信度集合;根据拟环境数据集的语种信息,计算出多个第一识别置信度集合的预测准确率;将多个测试权重集合中预测准确率最高的第一识别置信度集合对应的测试权重集合确定为最优测试权重集合;用最优测试权重集合中的多个测试权重的值设定预设权重集合。由此,可以说,语种置信度修正模块具有预设权重设定模块。Optionally, the language confidence correction module is also used to: obtain multiple sets of test weights; input the quasi-environment data set into the language recognition model, the quasi-environment data set is obtained according to the scene characteristics and the noise-free data set; according to the language recognition model The output initial confidence set is obtained under the condition of multiple test weight sets of multiple first recognition confidence sets; according to the language information of the pseudo-environmental data set, the prediction accuracy of multiple first recognition confidence sets is calculated; The test weight set corresponding to the first recognition confidence set with the highest prediction accuracy among the multiple test weight sets is determined as the optimal test weight set; the preset weight set is set with the values of multiple test weights in the optimal test weight set . Therefore, it can be said that the language confidence correction module has a preset weight setting module.
可选地,语种置信度修正模块还用于在权重范围内设定多个预设权重。Optionally, the language confidence correction module is further configured to set multiple preset weights within the weight range.
可选地,权重范围是按照如下方式设定的:获取多个测试数据集;获取多个测试权重集合;将测试数据集输入语种识别模型;根据语种识别模型输出的初始置信度集合与多个测试权重集合,得到多个测试权重集合情况下的多个第一识别置信度集合;根据测试数据集的语种信息,计算出多个第一识别置信度集合的预测准确率;将多个测试权重集合中预测准确率最高的第一识别置信度集合对应的测试权重集合确定为最优测试权重集合;得到多个测试数据集的最优测试权重集合;根据多个测试数据集的最优测试权重集合得到多个语种的权重范围。权重范围设定的功能可以由语种识别装置10来实现,此时,可以说语种识别装置10具有权重范围设定模块,也可以由对语种识别装置10进行测试的测试装置来实现。Optionally, the weight range is set as follows: obtain multiple test data sets; obtain multiple test weight sets; input the test data sets into the language recognition model; output the initial confidence set according to the language recognition model and multiple Test the weight set to obtain multiple first recognition confidence sets in the case of multiple test weight sets; calculate the prediction accuracy of multiple first recognition confidence sets according to the language information of the test data set; combine multiple test weights The test weight set corresponding to the first recognition confidence set with the highest prediction accuracy in the set is determined as the optimal test weight set; the optimal test weight set of multiple test data sets is obtained; according to the optimal test weight set of multiple test data sets Set to get the weight range of multiple languages. The function of setting the weight range can be realized by the language recognition device 10 . In this case, it can be said that the language recognition device 10 has a weight range setting module, or it can be realized by a test device for testing the language recognition device 10 .
采用如上方式,通过大量的语种数据集对语种识别模型进行测试,来设定多语种的预设权重集合的权重范围,也就是规定了语种识别模型的鲁棒性范围,使语种识别模型在此范围内工作,从而保证了语种识别结果的可靠性。Using the above method, a large number of language data sets are used to test the language recognition model to set the weight range of the multilingual preset weight set, that is, to specify the robustness range of the language recognition model, so that the language recognition model is here Working within the scope, thus ensuring the reliability of language recognition results.
本申请一个实施例提供一种计算设备,其包括处理器与存储器,存储器存储有程序指令,程序指令当被处理器执行时使得处理器执行上述语音处理方法、语种识别方法。该计算设备可以从后面结合图17进行的说明中得到更多了解。An embodiment of the present application provides a computing device, which includes a processor and a memory, the memory stores program instructions, and when the program instructions are executed by the processor, the processor executes the speech processing method and the language recognition method. The computing device can be understood more from the following description in conjunction with FIG. 17 .
本申请一个实施例提供一种计算机可读存储介质,其存储有程序指令,其特征在于,程序指令当被计算机执行时使得计算机执行上述语音处理方法、语种识别方法。An embodiment of the present application provides a computer-readable storage medium, which stores program instructions, and is characterized in that, when the program instructions are executed by a computer, the computer executes the above speech processing method and language recognition method.
本申请一个实施例提供一种计算机程序,计算机程序当被计算机执行时使得计算机执行上述语音处理方法、语种识别方法。An embodiment of the present application provides a computer program. When the computer program is executed by a computer, the computer executes the above speech processing method and language recognition method.
图8用于示意性地说明本申请一个实施例提供的语音交互方法的流程图。该语音交互方法中的部分步骤与上述语种识别方法相同,在这里,对于相同的内容,使用相同的附图标记进行标注,并简略化其说明。FIG. 8 is a flowchart schematically illustrating a voice interaction method provided by an embodiment of the present application. Part of the steps in the speech interaction method are the same as the above-mentioned language recognition method, and here, the same content is marked with the same reference numerals, and the description thereof is simplified.
如图8所示,在该语音交互方法中,首先,在步骤S10中,获取用户的输入语音信息。例如获取麦克风接收到的用户的输入语音信息。此时,一方面,在步骤S40中,对该输入语音利用语音识别模型进行自动语音识别,另一方面,在步骤S12中,对该输入语音利用语种识别模型进行语种识别。作为其他实施例,自动语音识别与语种识别也可以先后进行。As shown in FIG. 8, in the voice interaction method, firstly, in step S10, input voice information of the user is acquired. For example, the input voice information of the user received by the microphone is obtained. At this time, on the one hand, in step S40, automatic speech recognition is performed on the input speech using a speech recognition model; on the other hand, in step S12, language recognition is performed on the input speech using a language recognition model. As other embodiments, automatic speech recognition and language recognition may also be performed sequentially.
另外,在步骤S40中,为了能够识别多个语种的输入语音,利用多个不同语种(本实施例中为汉语、英语、韩语、德语和日语五个语种)的语音识别模型对该输入语音进行语音内容识别处理,得到多个不同语种的文本Ti。In addition, in step S40, in order to be able to recognize the input speech of multiple languages, the speech recognition model of a plurality of different languages (in this embodiment, five languages of Chinese, English, Korean, German and Japanese) is used to perform this input speech Speech content recognition processing obtains multiple texts Ti in different languages.
之后,在步骤S42中,多个文本Ti被输入文本翻译模型,文本翻译模型对这些文本Ti进行翻译处理,将这些文本Ti转换成目标语言(例如汉语)的文本Ai。After that, in step S42, a plurality of texts Ti are input into the text translation model, and the text translation model performs translation processing on these texts Ti, and converts these texts Ti into text Ai of the target language (such as Chinese).
之后,在步骤S44中,将多个文本Ai依次输入语义理解模型,语义理解模型对这些文本Ai进行语义理解处理,从而得到对应的多个候选命令Oi。候选命令的意思是指尚未被确认为要执行。After that, in step S44, multiple texts Ai are sequentially input into the semantic understanding model, and the semantic understanding model performs semantic understanding processing on these texts Ai, so as to obtain multiple corresponding candidate commands Oi. Candidate means an order that has not yet been confirmed for execution.
另外,在步骤S12中,对输入语音进行识别而得到多语种的第一识别置信度集合,第一识别置信度集合中的多个第一识别置信度与多个语种分别对应。例如,多语种的第一识别置信度集合为{汉语:0.9;英语:0.1;韩语:0;德语:0;日语:0}。In addition, in step S12 , the input speech is recognized to obtain a multilingual first recognition confidence set, and multiple first recognition confidence levels in the first recognition confidence set correspond to multiple languages respectively. For example, the multilingual first recognition confidence set is {Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0}.
在步骤S14中,判断多语种的第一识别置信度集合中是否存在大于阈值的第一识别置信度。这里的阈值例如可以设定为0.8。当判断结果为“是”时,即存在大于阈值的第一识别置信度(例如汉语的第一识别置信度0.9)时,在步骤S16中,作为识别结果,将第一识别置信度大于阈值的语种(例如汉语)确定为输入语音的识别语种。In step S14, it is judged whether there is a first recognition confidence value greater than a threshold in the multilingual first recognition confidence set. The threshold here can be set to 0.8, for example. When the judgment result is "Yes", that is, when there is a first recognition confidence greater than the threshold (for example, the first recognition confidence of Chinese is 0.9), in step S16, as the recognition result, the first recognition confidence greater than the threshold The language (such as Chinese) is determined as the recognition language of the input speech.
之后,在步骤S26中,根据识别语种(例如汉语)从步骤S44中得到的多个候选命令Oi中选择与识别语种(例如汉语)对应的候选命令作为要执行的目标命令,之后,进行使该目标命令得以执行的处理。例如,当目标命令是“打开空调”时,执行相应的控制而使空调打开。Afterwards, in step S26, select the candidate command corresponding to the recognition language (such as Chinese) as the target command to be executed according to the recognition language (such as Chinese) from a plurality of candidate commands Oi obtained in step S44, and then make the The process by which the target command is executed. For example, when the target command is "turn on the air conditioner", corresponding control is executed to turn on the air conditioner.
另外,当步骤S14中的判断结果为“否”,即不存在大于阈值的第一识别置信度,或者,多个第一识别置信度小于阈值时,在步骤S18中,根据用户特征对第一识别置信度进行修正。该修正的具体内容由于在上面已经进行了详细描述,因而这里不再展开描述。In addition, when the judgment result in step S14 is "No", that is, there is no first recognition confidence greater than the threshold, or when multiple first recognition confidences are smaller than the threshold, in step S18, according to the user characteristics, the first Recognition confidence is corrected. The specific content of the amendment has been described in detail above, so it will not be described here.
之后,在步骤S22中,判断是否存在大于阈值的第二识别置信度,当判断结果为“是”时,在步骤S24中,将第二识别置信度大于阈值的语种确定为输入语音的识别语种,之后,执行步骤S26中的处理。Afterwards, in step S22, it is judged whether there is a second recognition confidence greater than the threshold, and when the judgment result is "yes", in step S24, the language whose second recognition confidence is greater than the threshold is determined as the recognition language of the input speech , thereafter, the processing in step S26 is performed.
另外,当步骤S22中的判断结果为“是”时,即不存在大于阈值的第二识别置信度,或者多个第二识别置信度小于阈值时,在步骤S28中,判断是否存在大于阈值的ASR置信度。在判断结果为“是”时,在步骤S30中,将ASR置信度大于阈值的语 种确定为识别语种,之后,执行步骤S26中的处理。In addition, when the judgment result in step S22 is "Yes", that is, there is no second recognition confidence greater than the threshold, or when multiple second recognition confidences are smaller than the threshold, in step S28, it is judged whether there is a second recognition confidence greater than the threshold. ASR confidence. When the judgment result is "Yes", in step S30, the language whose ASR confidence is greater than the threshold is determined as the recognized language, and then the processing in step S26 is performed.
当步骤S28中的判断结果为“否”时,即不存在大于阈值的ASR置信度,或者多个ASR置信度小于阈值时,在步骤S32中,判断是否存在大于阈值的NLU置信度。在判断结果为“是”时,在步骤S34中,将NLU置信度大于阈值的语种确定为识别语种,之后,执行步骤S26中的处理。When the judgment result in step S28 is "No", that is, there is no ASR confidence greater than the threshold, or when multiple ASR confidences are less than the threshold, in step S32, it is judged whether there is an NLU confidence greater than the threshold. When the judgment result is "Yes", in step S34, the language type whose NLU confidence degree is greater than the threshold is determined as the recognized language type, and then the processing in step S26 is executed.
当步骤S32中的判断结果为“否”时,可以输出表示语音内容识别失败的信息,结束处理。When the judgment result in step S32 is "No", information indicating that the voice content recognition fails may be output, and the processing ends.
采用如上的语音交互方法,在语种识别中,根据用户特征对第一识别置信度进行计算而得到第二识别置信度,根据该第二识别置信度确定语种识别结果,如此,能够提高语种识别能力,进而提高语音交互能力。Using the voice interaction method above, in the language recognition, the first recognition confidence is calculated according to the user characteristics to obtain the second recognition confidence, and the language recognition result is determined according to the second recognition confidence, so that the language recognition ability can be improved , thereby improving the ability of voice interaction.
图9所示为本申请一个实施例提供的语音交互系统的结构示意图。如图9所示,该语音交互系统(或语音交互装置)20具有语音识别模块110、语种识别模块12、文本翻译模块130、语义理解模块140、输入语音获取模块17、语种置信度修正模块16与控制模块170。该语音交互装置用于执行参照图8所说明的语音交互方法,因此,对于具体的处理流程,在这里省略了对其的描述。另外,该语音交互系统20与上述语种识别装置10同样,具有语种识别模块12、输入语音获取模块17与语种置信度修正模块16,这里使用相同的附图标记进行标注,并省略了对其的详细说明。另外,该语音交互系统还可以包括执行装置,例如扬声器、显示装置等。FIG. 9 is a schematic structural diagram of a voice interaction system provided by an embodiment of the present application. As shown in Figure 9, the voice interaction system (or voice interaction device) 20 has a voice recognition module 110, a language recognition module 12, a text translation module 130, a semantic understanding module 140, an input voice acquisition module 17, and a language confidence correction module 16 and control module 170 . The voice interaction device is used to execute the voice interaction method described with reference to FIG. 8 , therefore, the description of the specific processing flow is omitted here. In addition, the voice interaction system 20 is the same as the above-mentioned language recognition device 10, and has a language recognition module 12, an input speech acquisition module 17 and a language confidence correction module 16, which are marked with the same reference numerals and omitted. Detailed description. In addition, the voice interaction system may further include an execution device, such as a loudspeaker, a display device, and the like.
下面对该语音交互系统20与上述语音交互方法的步骤的对应关系进行简要的说明。The corresponding relationship between the speech interaction system 20 and the steps of the above-mentioned speech interaction method will be briefly described below.
语音识别模块110执行图8中的步骤S40。语种识别模块12执行图8中的步骤S12。文本翻译模块130执行图8中的步骤S42。语义理解模块140执行图8中的步骤S44。输入语音获取模块17执行图8中的步骤S10。语种置信度修正模块16执行图8中的步骤S18。控制模块170执行图8中的步骤S14、步骤S16、步骤S22、步骤S24、步骤S28、步骤S30、步骤S32、步骤S34。此外,步骤S14、步骤S16、步骤S22、步骤S24还可以由语种置信度修正模块16执行。The speech recognition module 110 executes step S40 in FIG. 8 . The language identification module 12 executes step S12 in FIG. 8 . The text translation module 130 executes step S42 in FIG. 8 . The semantic understanding module 140 executes step S44 in FIG. 8 . The input speech acquisition module 17 executes step S10 in FIG. 8 . The language confidence correction module 16 executes step S18 in FIG. 8 . The control module 170 executes step S14 , step S16 , step S22 , step S24 , step S28 , step S30 , step S32 , and step S34 in FIG. 8 . In addition, Step S14 , Step S16 , Step S22 , and Step S24 may also be executed by the language confidence correction module 16 .
另外,由上面的描述可知,参照图8所说明的语音交互方法中实质上包含了多语种的语音识别方法,能够对多个语种的输入语音进行识别;参照图9所说明的语音交互装置也包含了执行该多语种的语音识别方法的语音识别装置。因为重复的内容较多,这里不再就语音识别方法与语音识别装置给出单独的实施例来描述。In addition, it can be seen from the above description that the speech interaction method described with reference to FIG. 8 essentially includes a multilingual speech recognition method, which can recognize input speech in multiple languages; A speech recognition device for implementing the multilingual speech recognition method is included. Because there are many repetitive contents, no separate embodiments will be given to describe the speech recognition method and speech recognition device here.
下面参照图11-图17对本申请一个实施例涉及的语音交互系统100以及由其执行的语音交互方法进行描述。The voice interaction system 100 and the voice interaction method executed by it according to an embodiment of the present application will be described below with reference to FIGS. 11-17 .
在本实施例中,以语音交互系统100应用在汽车上而构成车载语音交互系统为例进行说明,然而,本申请并不限于此,还可以应用在其他场景,例如智能家居、智能机器人、智能语音问答、智能语音分析、实时语音监控分析等。另外,车载语音交互系统还构成车辆控制装置。另外,可以理解,通过上面描述的语音交互方法及本实施例描述的语音交互系统,本申请的实施例中提供了一种语音处理方法、装置及系统。In this embodiment, the voice interaction system 100 is applied in a car to form a vehicle voice interaction system as an example. Voice question and answer, intelligent voice analysis, real-time voice monitoring and analysis, etc. In addition, the vehicle voice interaction system also constitutes a vehicle control device. In addition, it can be understood that, through the voice interaction method described above and the voice interaction system described in this embodiment, embodiments of the present application provide a voice processing method, device, and system.
<系统架构><system architecture>
本实施例的语音交互系统100可以接收用户(即说话者)的输入语音,响应于该输 入语音的内容而执行相应的处理,例如打开空调、打开车窗等处理。而且,该语音交互系统100可以针对多个不同语种的语音作出响应,例如,在本实施例中可以针对汉语、英语、韩语、德语、日语这5种语言的语音作出响应。The voice interaction system 100 of this embodiment can receive the input voice of the user (that is, the speaker), and perform corresponding processing in response to the content of the input voice, such as turning on the air conditioner, opening the car window, and other processing. Moreover, the voice interaction system 100 can respond to voices in multiple different languages. For example, in this embodiment, it can respond to voices in five languages: Chinese, English, Korean, German, and Japanese.
这里所谓的不同语种的语音既包括不同语系的语音,例如汉语和英语属于不同语种,也包括同一语系下的不同小语种的语音,例如汉语的普通话和粤语也属于不同语种。The so-called sounds of different languages here include not only the sounds of different language families, for example, Chinese and English belong to different languages, but also the sounds of different minor languages under the same language family, for example, Mandarin and Cantonese of Chinese also belong to different languages.
图11为本申请一个实施例涉及的语音交互系统的示意说明图。该语音交互系统100具有语音识别模块110、语种识别模块120、文本翻译模块130、语义理解模块140、命令解析与执行模块150、语种置信度修正模块160,此外,语音交互系统100还可以具有麦克风、扬声器、摄像头或显示器等。FIG. 11 is a schematic illustration of a voice interaction system according to an embodiment of the present application. The voice interaction system 100 has a voice recognition module 110, a language recognition module 120, a text translation module 130, a semantic understanding module 140, a command analysis and execution module 150, and a language confidence correction module 160. In addition, the voice interaction system 100 can also have a microphone , speaker, camera or display etc.
图12为用于说明一个实施例中涉及的语音交互方法的其中一个流程的流程图。下面参照图12,对语音交互系统100的一个处理流程进行描述,以概略性地说明语音交互系统100的架构。Fig. 12 is a flowchart for illustrating one procedure of the voice interaction method involved in an embodiment. A processing flow of the voice interaction system 100 is described below with reference to FIG. 12 , so as to roughly illustrate the architecture of the voice interaction system 100 .
如图12所示,当用户发出一段语音S时,语音交互系统100通过麦克风获取到该语音(称其为输入语音),一方面,①该输入语音被输入语音识别模块110,语音识别模块110调用多个不同语种(本实施例中为汉语、英语、韩语、德语和日语五个语种,不言而喻,还可以是其他数量的语种)的语音识别子模块对该输入语音进行语音内容识别处理,得到多个不同语种的文本Ti。②多个文本Ti被输入文本翻译模块130,文本翻译模块130对这些文本Ti进行翻译处理,将这些文本Ti转换成目标语言(例如汉语)的文本Ai。③将多个文本Ai依次输入语义理解模块140,语义理解模块140对这些文本Ai进行语义理解处理,从而得到对应的多个候选命令Oi。As shown in Figure 12, when the user sends out a section of voice S, the voice interaction system 100 acquires the voice (called input voice) through the microphone, on the one hand, 1. the input voice is input into the voice recognition module 110, Call the speech recognition submodule of a plurality of different languages (in the present embodiment, be Chinese, English, Korean, German and Japanese five languages, self-evident, can also be other quantity languages) this input speech carries out speech content recognition After processing, multiple texts Ti in different languages are obtained. ② Multiple texts Ti are input into the text translation module 130, and the text translation module 130 performs translation processing on these texts Ti, and converts these texts Ti into text Ai of the target language (eg Chinese). ③ Input multiple texts Ai into the semantic understanding module 140 sequentially, and the semantic understanding module 140 performs semantic understanding processing on these texts Ai, so as to obtain multiple corresponding candidate commands Oi.
另一方面,④用户的输入语音还被输入语种识别模块120,语种识别模块120对输入语音进行语种识别处理,生成多个语种的初始置信度,并对各初始置信度乘以对应的预设权重,得到多个语种的识别置信度。On the other hand, ④ the user's input voice is also input into the language recognition module 120, and the language recognition module 120 performs language recognition processing on the input voice, generates initial confidence degrees of multiple languages, and multiplies each initial confidence degree by the corresponding preset Weights to get the recognition confidence of multiple languages.
⑤在多个语种的识别置信度中存在大于阈值λ的识别置信度时,可以认为输入语音的语种是识别置信度大于阈值λ的语种,命令解析与执行模块150将多个候选命令Oi中识别置信度大于阈值λ的语种所对应的候选命令Oi确定为要执行的目标命令,并且,根据该目标命令的内容进行相应的处理。⑤ When there is a recognition confidence degree greater than the threshold λ in the recognition confidence degrees of multiple languages, it can be considered that the language of the input voice is a language whose recognition confidence degree is greater than the threshold λ, and the command analysis and execution module 150 will recognize multiple candidate commands Oi The candidate command Oi corresponding to the language whose confidence is greater than the threshold λ is determined as the target command to be executed, and corresponding processing is performed according to the content of the target command.
另外,在多个语种的识别置信度中不存在大于阈值λ的识别置信度时,语种置信度修正模块160根据用户特征等对多个语种的识别置信度进行修正,其具体内容将在后面进行详细描述。In addition, when there is no recognition confidence greater than the threshold λ among the recognition confidences of multiple languages, the language confidence correction module 160 corrects the recognition confidences of multiple languages according to user characteristics, and the specific content will be described later. A detailed description.
下面对语音交互系统100的各个构成要素的结构进行描述。The structure of each component of the voice interaction system 100 will be described below.
<结构><structure>
在本实施例中,语音识别模块110、语种识别模块120、文本翻译模块130与语义理解模块140分别包括算法模型即语音识别模型、语种识别模型、文本翻译模型与语义理解模型,由这些模型来分别执行语音识别处理、语种识别处理、文本翻译处理与语义理解处理。In this embodiment, the speech recognition module 110, the language recognition module 120, the text translation module 130 and the semantic understanding module 140 respectively include an algorithm model, namely a speech recognition model, a language recognition model, a text translation model and a semantic understanding model. Perform speech recognition processing, language recognition processing, text translation processing and semantic understanding processing respectively.
语音识别模块110用于将人的语音即待识别语音转换为对应语言的文本,也可以说是对语音的内容进行预测或者进行自动语音识别(Automatic Speech Recognition, ASR)。在这里,语音识别模块110具有多个语音识别子模块,各语音识别子模块分别对应一种语言,用于将语音转换为对应语言的文本Ti,例如,在本实施例中,具有汉语、英语、韩语、德语与日语5个语言的语音识别子模块,分别用于将输入的语音转换为汉语文本T1、英语文本T2、韩语文本T3、德语文本T4与日语文本T5。在完成识别后,这些子模块输出作为预测结果的文本Ti以及该文本Ti的置信度,该置信度称为ASR置信度,该ASR置信度表示子模块预测的文本的预测概率,或者说是对语音内容的预测概率。The speech recognition module 110 is used to convert human speech, that is, speech to be recognized, into text in a corresponding language, which can also be said to predict speech content or perform automatic speech recognition (Automatic Speech Recognition, ASR). Here, the speech recognition module 110 has a plurality of speech recognition sub-modules, and each speech recognition sub-module corresponds to a language respectively, and is used to convert the speech into the text Ti of the corresponding language. For example, in this embodiment, there are Chinese, English The speech recognition sub-modules of 5 languages, Korean, German and Japanese, are used to convert the input speech into Chinese text T1, English text T2, Korean text T3, German text T4 and Japanese text T5. After the recognition is completed, these sub-modules output the text Ti as the prediction result and the confidence of the text Ti. The confidence is called the ASR confidence. The ASR confidence represents the prediction probability of the text predicted by the sub-module, or the Predicted probability of speech content.
文本翻译模块130用于将一种自然语言(源语言)的文本转换为另一种自然语言(目标语言)的文本,例如将英文文本转换为中文文本。这里,文本翻译模块130具有多个文本翻译子模块,各文本翻译子模块分别对应一种语言,例如,在本实施例中,以汉语为目标语言,因而,具有英语、韩语、德语、日语4个语言的文本翻译子模块,分别用于将英语文本、韩语文本、德语文本与日语文本翻译成汉语文本Ai。另外,在将汉语文本输入文本翻译模块130时,由于这里汉语是翻译的目标语言,因而文本翻译模块130可以不对输入的汉语文本进行处理,原样将输入的汉语文本输出。最终文本翻译模块130输出5个汉语文本Ai。The text translation module 130 is used to convert text in one natural language (source language) into text in another natural language (target language), for example, convert English text into Chinese text. Here, the text translation module 130 has a plurality of text translation sub-modules, and each text translation sub-module corresponds to a language respectively. The text translation sub-modules of two languages are respectively used to translate English text, Korean text, German text and Japanese text into Chinese text Ai. In addition, when the Chinese text is input into the text translation module 130, since Chinese is the target language for translation, the text translation module 130 may not process the input Chinese text and output the input Chinese text as it is. The final text translation module 130 outputs 5 Chinese texts Ai.
语义理解模块140用于对目标语言的文本进行自然语言理解(Natural Language Understanding,NLU),也可以说是对文本的意图进行预测,生成能够被机器理解的命令。例如,文本为“请播放《XX》歌曲”,经过语义理解模块140,即可让机器得到“请播放《XX》歌曲”这个意图。在生成命令的同时,语义理解模块140还生成NLU置信度,该NLU置信度表示语义理解模块140对文本的意图的预测概率。另外,由于语音识别模块110会输出5个语言文本,因此语义理解模块140最终也会生成对应5个语言的命令与5个NLU置信度。再者,语义理解模块140输出的这些命令尚未被确定要执行,因此将其称之为候选命令。The semantic understanding module 140 is used for performing natural language understanding (Natural Language Understanding, NLU) on the text of the target language, which can also be said to predict the intent of the text and generate commands that can be understood by the machine. For example, if the text is "Please play the song "XX", after the semantic understanding module 140, the machine can get the intention of "Please play the song "XX". While generating the command, the semantic understanding module 140 also generates an NLU confidence, which represents the predicted probability of the meaning of the text by the semantic understanding module 140 . In addition, since the speech recognition module 110 will output 5 language texts, the semantic understanding module 140 will eventually generate commands corresponding to 5 languages and 5 NLU confidence levels. Furthermore, these commands output by the semantic understanding module 140 have not been determined to be executed, so they are called candidate commands.
语种识别(Language Identification,LID)模块用于识别用户的输入语音即待识别语音的语种,也可以说是预测用户的输入语音属于多个语种中的哪一个,例如,在本实施例中,语种识别模块120识别输入语音属于汉语、英语、韩语、德语与日语中的哪一个语种,作为识别结果输出多个语种的识别置信度的集合,该识别置信度表示语种识别模块120对输入语音属于哪一语种的预测概率。另外,在本实施例的语种识别模型中,对输入语音进行算法识别,得到多个语种的置信度(将此置信度称之为初始置信度),对多个语种的初始置信度分别乘以对应的预设权重值,得到多个语种的识别置信度,语种识别模块120将该识别置信度作为预测结果输出。对初始置信度乘以预设权重值这一计算可以由语种识别模型来执行,也可以并非由语种识别模型来执行。The language identification (Language Identification, LID) module is used to identify the language of the user's input voice, that is, the voice to be recognized. It can also be said to predict which one of the multiple languages the user's input voice belongs to. For example, in this embodiment, the language The recognition module 120 recognizes which language type the input speech belongs to among Chinese, English, Korean, German, and Japanese, and outputs a set of recognition confidence levels of multiple languages as the recognition result. Predicted probability for a language. In addition, in the language recognition model of this embodiment, algorithmic recognition is performed on the input speech to obtain the confidence levels of multiple languages (this confidence level is called the initial confidence level), and the initial confidence levels of multiple languages are respectively multiplied by The corresponding preset weight values obtain recognition confidences of multiple languages, and the language recognition module 120 outputs the recognition confidences as prediction results. The calculation of multiplying the initial confidence by the preset weight value may or may not be performed by the language recognition model.
命令解析与执行模块150用于根据语种识别模块120的输出从语义理解模块140输出的候选命令中选择要执行的目标命令。在本实施例中,在语种识别模块120输出的多个语种的识别置信度中存在大于阈值λ(例如设定在0.8以上)的识别置信度时,命令解析与执行模块150将识别置信度大于阈值λ的语种确定为用户的上述输入语音的语种,将该语种所对应的候选命令确定为要执行的目标命令。例如,语种识别模块120输出的多个语种的置信度为{汉语:0.9;英语:0.1;韩语:0;德语:0;日语:0}时,将汉语确定为用户的上述输入语音的语种,将语义理解模块140输出的5个语言 的候选命令中与汉语对应的候选命令确定为要执行的目标命令。The command parsing and execution module 150 is used for selecting a target command to be executed from the candidate commands output by the semantic understanding module 140 according to the output of the language identification module 120 . In this embodiment, when there is a recognition confidence degree greater than the threshold λ (for example, set above 0.8) among the recognition confidence degrees of multiple languages output by the language recognition module 120, the command analysis and execution module 150 sets the recognition confidence degree greater than The language of the threshold λ is determined as the language of the user's input voice, and the candidate command corresponding to the language is determined as the target command to be executed. For example, when the confidence levels of the multiple languages output by the language identification module 120 are {Chinese: 0.9; English: 0.1; Korean: 0; German: 0; Japanese: 0}, Chinese is determined as the language of the user's input voice, Among the candidate commands in five languages output by the semantic understanding module 140, the candidate command corresponding to Chinese is determined as the target command to be executed.
在确定要执行的目标命令后,命令解析与执行模块150执行用于使该目标命令得以执行的控制,例如,当确定的要执行的目标命令是“请播放《XX》歌曲”,并且,语音交互系统100具有音乐播放模块时,命令解析与执行模块150控制音乐播放模块播放《XX》歌曲。而如果音乐播放模块在权属上不属于语音交互系统100,不受命令解析与执行模块150控制,此时,可以将确定的要执行的目标命令发送给语音交互系统100与音乐播放模块共同的上位控制器,由该上位控制器向音乐播放模块的控制器发送命令,以实现播放《XX》歌曲。After determining the target command to be executed, the command parsing and execution module 150 executes the control for enabling the target command to be executed, for example, when the determined target command to be executed is "please play "XX" song", and the voice When the interactive system 100 has a music player module, the command analysis and execution module 150 controls the music player module to play the song "XX". And if the music player module does not belong to the voice interaction system 100 in terms of ownership, and is not controlled by the command analysis and execution module 150, at this time, the determined target command to be executed can be sent to the voice interaction system 100 and the music player module. The upper controller sends commands to the controller of the music playing module by the upper controller to realize playing the song "XX".
另外,在确定要执行的目标命令后,命令解析与执行模块150可以通过扬声器或者显示器对用户进行回应,例如,当确定的要执行的目标命令是“请播放《XX》歌曲”时,命令解析与执行模块150控制扬声器发出“好的,即将为您播放”的声音,以对用户进行回应。In addition, after determining the target command to be executed, the command analysis and execution module 150 can respond to the user through a speaker or display, for example, when the determined target command to be executed is "please play the song "XX", the command analysis And the execution module 150 controls the speaker to emit the sound of "OK, I will play it for you soon" to respond to the user.
另一方面,在本实施例中,在语种识别模块120输出的多个语种的识别置信度集合中不存在大于阈值λ(例如0.8)的识别置信度时,命令解析与执行模块150按照其他方式确定要执行的目标命令,下面对这些方式进行举例说明。On the other hand, in this embodiment, when there is no recognition confidence greater than the threshold λ (for example, 0.8) in the recognition confidence sets of multiple languages output by the language recognition module 120, the command analysis and execution module 150 uses other methods to Determine the target command to be executed, and these methods are illustrated below.
方式一method one
命令解析与执行模块150对上述多个语种的识别置信度(对应本申请中的“第一识别置信度”)进行修正计算,命令解析与执行模块150根据语种置信度修正模块160的输出进行相应处理。例如可以根据用户特征进行修正,这里的用户特征包括用户的历史语种记录和用户指定语种等。历史语种记录和用户指定语种可以根据音频特征(即声纹)确定用户身份,根据用户身份在语音交互系统100具有的历史语种记录数据库和用户指定语种数据库中查询得到。这些修正的具体内容将在后面予以详细描述。在得到修正后的识别置信度后,命令解析与执行模块150根据修正后的识别置信度确定输入语音的语种,执行相应的处理。例如在修正后的多个语种的识别置信度中存在大于阈值λ的识别置信度时,将识别置信度大于阈值λ的语种确定为用户输入语音的语种,将与确定的该语种对应的候选命令确定为要执行的目标命令。The command analysis and execution module 150 corrects and calculates the recognition confidence of the above-mentioned multiple languages (corresponding to the "first recognition confidence" in this application), and the command analysis and execution module 150 performs corresponding calculations according to the output of the language confidence correction module 160. deal with. For example, it can be modified according to user characteristics, where the user characteristics include the user's historical language records and user-specified languages. Historical language records and user-specified languages can be determined according to audio features (ie, voiceprints) to determine the user identity, and can be obtained by querying the historical language record database and user-specified language database of the voice interaction system 100 according to the user identity. The specific content of these amendments will be described in detail later. After obtaining the revised recognition confidence, the command analysis and execution module 150 determines the language of the input speech according to the revised recognition confidence, and performs corresponding processing. For example, when there is a recognition confidence greater than the threshold λ among the corrected recognition confidences of multiple languages, the language whose recognition confidence is greater than the threshold λ is determined as the language of the user input voice, and the candidate command corresponding to the determined language is Determined as the target command to execute.
作为方式一中的对识别置信度进行修正计算的具体实现方式,可以直接对识别置信度的数值进行修正,也可以对预设权重进行修正计算,然后根据初始置信度集合与修正后的预设权重集合再次计算识别置信度。As a specific implementation method of correcting and calculating the recognition confidence in method 1, the value of the recognition confidence can be directly corrected, or the preset weight can be corrected and calculated, and then according to the initial confidence set and the corrected preset The set of weights again calculates the recognition confidence.
在本实施例中,语种置信度修正模块160具有基于音频特征调整模块162、基于视频特征调整模块163与综合调整模块164,这些调整模块用于以不同的方式对语种置信度进行修正。In this embodiment, the language confidence correction module 160 has an audio feature-based adjustment module 162 , a video feature-based adjustment module 163 and a comprehensive adjustment module 164 , and these adjustment modules are used to correct the language confidence in different ways.
方式二way two
命令解析与执行模块150根据语音识别模块110输出的ASR置信度或语义理解模块140输出的NLU置信度确定要执行的目标命令。例如,当存在大于ASR置信度阈值(可以与上述阈值λ设定为相同值,例如0.8)的ASR置信度时,将该ASR置信度对应的语种确定为输入语音的语种,将该语种所对应的候选命令确定为要执行的目标命令。或者,当存在大于NLU置信度阈值(可以与上述阈值λ设定为相同值,例如0.8)的NLU置信度时,将该NLU置信度对应的语种确定为输入语音的语种,将该语种所 对应的候选命令确定为要执行的目标命令。The command parsing and execution module 150 determines the target command to be executed according to the ASR confidence level output by the speech recognition module 110 or the NLU confidence level output by the semantic understanding module 140 . For example, when there is an ASR confidence greater than the ASR confidence threshold (which can be set to the same value as the above-mentioned threshold λ, such as 0.8), the language corresponding to the ASR confidence is determined as the language of the input speech, and the language corresponding to the language The candidate command of is determined as the target command to be executed. Or, when there is an NLU confidence greater than the NLU confidence threshold (which can be set to the same value as the above-mentioned threshold λ, such as 0.8), the language corresponding to the NLU confidence is determined as the language of the input speech, and the language corresponding to the language The candidate command of is determined as the target command to be executed.
方式二的执行时机可以自由设定,可选地,例如可以在执行方式一后仍不能确定输入语音的语种时(即根据用户特征等修正的识别置信度中依然不存在大于阈值λ的识别置信度时)执行,也可以在执行方式一之前执行,或者,还可以在方式一的描述中列举的多个方式间执行。The execution timing of mode 2 can be set freely. Optionally, for example, when the language of the input speech cannot be determined after the execution of mode 1 (that is, there is still no recognition confidence greater than the threshold λ in the recognition confidence modified according to user characteristics, etc.). degree), it can also be executed before executing method 1, or it can also be executed between the multiple methods listed in the description of method 1.
方式三way three
命令解析与执行模块150通过特征相似度来确定输入语音的语种。例如,将此本次的输入语音与历史记录中的历史输入语音的音频数据进行比较,通过余弦相似度、线性回归或深度学习等方式得到二者的特征相似度,在特征相似度超过阈值时,可以历史输入语音的识别语种确定为本次输入语音的语种。The command parsing and execution module 150 determines the language of the input speech by feature similarity. For example, compare the current input voice with the audio data of the historical input voice in the historical record, and obtain the feature similarity between the two through cosine similarity, linear regression or deep learning. When the feature similarity exceeds the threshold , the recognition language of the historical input voice can be determined as the language of the current input voice.
方式三的执行时机可以自由设定,可选地,可以在方式一、方式二之后或之前执行,也可以在方式一与方式二之间执行,或者还可以在方式一的描述中例举的多个方式间执行。The execution timing of Mode 3 can be set freely. Optionally, it can be executed after or before Mode 1 and Mode 2, or it can be executed between Mode 1 and Mode 2, or it can also be exemplified in the description of Mode 1 Execute in multiple ways.
下面对置信度修正模块进行说明。The confidence correction module will be described below.
语种置信度修正模块160包括实时场景适应模块161、基于音频特征调整模块162、基于视频特征调整模块163与综合调整模块164。The language confidence correction module 160 includes a real-time scene adaptation module 161 , an audio feature-based adjustment module 162 , a video feature-based adjustment module 163 and a comprehensive adjustment module 164 .
实时场景适应模块161用于在语种识别模型初步接触场景时,根据环境特征和音频采集器(即麦克风)特征对多语种的预设权重集合进行初始化。这里的初步接触场景时例如是用户刚刚购入语音交互系统或车辆时,此时,用户一般会打开语音交互系统进行一些基本的设定或测试。本实施例中,实时场景适应模块161可以利用这个时机来初始化预设权重集合。另外,作为其他实施例,预设权重集合的初始化并不限于在初步接触场景时执行,还可以在其他适当的时候执行,例如更换新的音频采集器时,或者由用户自己选择执行时机。The real-time scene adaptation module 161 is used to initialize the multilingual preset weight set according to the environmental characteristics and the characteristics of the audio collector (ie, the microphone) when the language recognition model initially contacts the scene. The initial contact scene here is, for example, when the user has just purchased a voice interaction system or a vehicle. At this time, the user generally turns on the voice interaction system to perform some basic settings or tests. In this embodiment, the real-time scene adaptation module 161 can use this opportunity to initialize the preset weight set. In addition, as another embodiment, the initialization of the preset weight set is not limited to be performed when initially contacting the scene, and can also be performed at other appropriate times, such as when replacing a new audio collector, or the user chooses the execution time.
基于视频特征调整模块163用于根据拍摄到的用户图像对多个语种的识别置信度集合进行修正。The video feature-based adjustment module 163 is configured to modify the recognition confidence sets of multiple languages according to the captured user images.
基于音频特征调整模块162用于根据用户的声音信息对多个语种的识别置信度集合进行修正。具体而言,可以根据声音信息(声纹)在语音交互系统100具有的数据库中进行查询得到用户的历史语种记录,根据历史语种记录对多个语种的识别置信度集合进行修正。The audio feature-based adjustment module 162 is configured to modify the recognition confidence sets of multiple languages according to the user's voice information. Specifically, the user's historical language records can be obtained by querying the database of the voice interaction system 100 according to the voice information (voiceprint), and the recognition confidence sets of multiple languages can be corrected according to the historical language records.
综合调整模块164主要用于根据用户指定语种对多个语种的识别置信度集合进行修正。其中,用户指定语种根据输入语音的声纹在语音交互系统100具有的数据库中进行查询得到。The comprehensive adjustment module 164 is mainly used to modify the recognition confidence sets of multiple languages according to the language specified by the user. Wherein, the user-specified language is obtained by querying the database of the voice interaction system 100 according to the voiceprint of the input voice.
在修正后的多个语种的识别置信度中存在大于阈值λ的识别置信度时,置信度修正模块对多语种的预设权重集合进行修正计算,使识别置信度大于阈值λ的语种的预设权重相对于其他语种的预设权重增大,如此,得到修正权重集合。之后,置信度修正模块判断修正权重集合中的各修正权重是否在权重范围内,当判断结果为在权重范围内时,用修正置信度权重集合中的值来更新预设权重集合以供语种识别模块120进行之后的语种识别时使用。When there is a recognition confidence greater than the threshold λ among the corrected recognition confidences of multiple languages, the confidence correction module performs correction calculations on the preset weight sets of multiple languages, so that the preset weights of the languages whose recognition confidence is greater than the threshold λ The weights are increased relative to the preset weights of other languages, and in this way, a set of modified weights is obtained. Afterwards, the confidence degree correction module judges whether each correction weight in the correction weight set is within the weight range, and when the judgment result is within the weight range, the value in the correction confidence weight set is used to update the preset weight set for language recognition Module 120 is used for subsequent language recognition.
语音识别模块110、语种识别模块120、文本翻译模块130、语义理解模块140、 命令解析与执行模块150与置信度修正模块的功能可以由处理器来执行存储在存储器中的程序(软件)来实现,也可以通过LSI(Large Scale Integration,大规模集成电路)和ASIC(Application Specific Integrated Circuit,专用集成电路)等硬件来实现。The functions of the speech recognition module 110, the language recognition module 120, the text translation module 130, the semantic understanding module 140, the command analysis and execution module 150 and the confidence correction module can be implemented by the processor executing the program (software) stored in the memory , It can also be realized by hardware such as LSI (Large Scale Integration, large scale integrated circuit) and ASIC (Application Specific Integrated Circuit, application specific integrated circuit).
典型地,这些模块可以用电子控制单元(electronic control unit,ECU)构成,可选地,一个模块可以用一个ECU构成,也可以用多个ECU构成,或者,可以用一个ECU构成多个模块。Typically, these modules can be formed by an electronic control unit (ECU). Optionally, one module can be formed by one ECU, or multiple ECUs, or one ECU can be used to form multiple modules.
ECU是指由集成电路组成的用于实现对数据的分析处理发送等一系列功能的控制装置。如图17所示,本申请实施例提供了一种电子控制单元ECU,该ECU包括微型计算机(microcomputer)、输入电路、输出电路和模/数(analog-to-digital,A/D)转换器。ECU refers to a control device composed of integrated circuits used to implement a series of functions such as data analysis, processing and transmission. As shown in Figure 17, the embodiment of the present application provides an electronic control unit ECU, the ECU includes a microcomputer (microcomputer), an input circuit, an output circuit and an analog-to-digital (analog-to-digital, A/D) converter .
输入电路的主要功能是对输入信号(例如来自传感器的信号)进行预处理,输入信号不同,处理方法也不同。具体地,因为输入信号有两类:模拟信号和数字信号,所以输入电路可以包括处理模拟信号的输入电路和处理数字信号的输入电路。The main function of the input circuit is to preprocess the input signal (such as the signal from the sensor), and the processing method is different for different input signals. Specifically, since there are two types of input signals: analog signals and digital signals, the input circuit may include an input circuit that processes analog signals and an input circuit that processes digital signals.
A/D转换器的主要功能是将模拟信号转变为数字信号,模拟信号经过相应输入电路预处理后输入A/D转换器进行处理转换为微型计算机接受的数字信号。The main function of the A/D converter is to convert the analog signal into a digital signal. After the analog signal is preprocessed by the corresponding input circuit, it is input to the A/D converter for processing and converted into a digital signal accepted by the microcomputer.
输出电路是微型计算机与执行器之间建立联系的一个装置。它的功能是将微型计算机发出的处理结果转变成控制信号,以驱动执行器工作。输出电路一般采用的是功率晶体管,根据微型计算机的指令通过导通或截止来控制执行元件的电子回路。The output circuit is a device that establishes a connection between the microcomputer and the actuator. Its function is to convert the processing results sent by the microcomputer into control signals to drive the actuators to work. The output circuit generally uses a power transistor, which controls the electronic circuit of the actuator by turning on or off according to the instructions of the microcomputer.
微型计算机包括中央处理器(central processing unit,CPU)、存储器和输入/输出(input/output,I/O)接口,CPU通过总线与存储器、I/O接口相连,彼此之间可以通过总线进行信息交换。存储器可以是只读存储器(read-only memory,ROM)或随机存取存储器(random access memory,RAM)等存储器。I/O接口是中央处理单元(central processor unit,CPU)与输入电路、输出电路或A/D转换器之间交换信息的连接电路,具体的,I/O接口可以分为总线接口和通信接口。存储器存储有程序,CPU调用存储器中的程序可以实现上述模块的功能,或者,执行参照图3、图4、图6、图8、图12等所描述的方法。Microcomputer includes central processing unit (central processing unit, CPU), memory and input/output (input/output, I/O) interface, CPU is connected with memory, I/O interface through bus, can communicate with each other through bus exchange. The memory may be a memory such as a read-only memory (ROM) or a random access memory (RAM). The I/O interface is a connection circuit for exchanging information between the central processor unit (CPU) and the input circuit, output circuit or A/D converter. Specifically, the I/O interface can be divided into a bus interface and a communication interface . The memory stores programs, and the CPU calls the programs in the memory to realize the functions of the above modules, or execute the methods described with reference to Fig. 3 , Fig. 4 , Fig. 6 , Fig. 8 , and Fig. 12 .
另外,如上所述,语音交互系统100还具有麦克风、扬声器、摄像头或显示器。麦克风用于获取用户的输入语音,对应本申请中的语音获取模块。扬声器用于播放声音,例如针对用户的输入语音播放应答音“好的”。摄像头用于采集用户的面部图像等,并将采集的图像发送给命令解析与执行模块150,命令解析与执行模块150可以对此图像进行图像识别,从而可以对用户的身份进行认证。显示器用于根据用户的输入语音做出响应,例如,当输入语音是“播放《XX》歌曲”时,显示器上显示该歌曲的播放画面。In addition, as mentioned above, the voice interaction system 100 also has a microphone, a speaker, a camera or a display. The microphone is used to acquire the user's input voice, which corresponds to the voice acquisition module in this application. The speaker is used to play sounds, such as the response tone "OK" to the user's input voice. The camera is used to collect the user's facial image, etc., and send the collected image to the command analysis and execution module 150. The command analysis and execution module 150 can perform image recognition on the image, so as to authenticate the user's identity. The display is used to respond according to the user's input voice, for example, when the input voice is "play the song "XX", the display will display the playing screen of the song.
<动作与处理流程><Action and processing flow>
下面结合对语音交互系统100的动作与处理流程进行的描述,来更加详细地描述语音交互系统100。另外,结合语音交互系统100的动作与处理流程的描述,同时对本实施例涉及的语音交互方法进行描述,并且,通过如下的说明也可知,该语音交互方法中包含语种识别方法(对应语种识别模块120的处理、命令解析与执行模块150的部分处理、置信度修正模块的处理等)。The voice interaction system 100 will be described in more detail below in conjunction with the description of the actions and processing flow of the voice interaction system 100 . In addition, in conjunction with the description of the actions and processing flow of the voice interaction system 100, the voice interaction method involved in this embodiment will be described at the same time, and it can also be known from the following description that the voice interaction method includes a language recognition method (corresponding to the language recognition module 120, part of the processing of the command analysis and execution module 150, the processing of the confidence correction module, etc.).
如上所述,语种识别模块120利用语种识别模型来进行语种识别,在本实施例中,如图14中的步骤S210所示,当训练好的语种识别模型初步接触场景时,实时场景适应模块161根据环境特征、音频采集器特征对多语种的预设权重集合进行初始化。下面参照图13对初始化方法的一个例子进行说明。As mentioned above, the language recognition module 120 uses the language recognition model to perform language recognition. In this embodiment, as shown in step S210 in FIG. The multilingual preset weight set is initialized according to the environment feature and the audio collector feature. An example of an initialization method will be described below with reference to FIG. 13 .
如图13所示,实时场景适应模块161根据环境特征、音频采集器特征和专家数据集产生拟环境数据集。这里,环境特征例如包括环境信噪比、麦克风电源来源信息(直流-交流电信息)或环境振动幅度等。麦克风电源来源信息例如可以通过车辆的控制器局域网络(Controller Area Network,CAN)信号获取。音频采集器特征主要包括麦克风排布信息(单麦克风还是麦克风阵列,其中,麦克风阵列又包括线性阵列、平面阵列和立体阵列)。专家数据集是预先采集的批量多人、多语种、无噪声的音频数据集,其内容(各条语音数据的语种)是预先记录的,是已知的。As shown in FIG. 13 , the real-time scene adaptation module 161 generates a quasi-environment dataset according to environmental features, audio collector features and expert datasets. Here, the environmental characteristics include, for example, the environmental signal-to-noise ratio, microphone power source information (DC-AC information), or environmental vibration amplitude, and the like. The information on the power source of the microphone can be obtained, for example, through a controller area network (Controller Area Network, CAN) signal of the vehicle. The characteristics of the audio collector mainly include microphone arrangement information (single microphone or microphone array, wherein the microphone array includes linear array, planar array and stereo array). The expert data set is a batch of multi-person, multilingual, and noise-free audio data sets collected in advance, and its content (the language of each piece of voice data) is pre-recorded and known.
另外,随机初始化N个不同的多个语种的置信度权重集合1-置信度权重集合N,例如,参见图13中例举出的置信度权重集合1{中文:0.80;英语:0.04;韩语:0.06;日语:0.05;德语:0.05}、置信度权重集合2{中文:0.21;英语:0.19;韩语:0.22;日语:0.20;德语:0.18}和置信度权重集合N{中文:0.31;英语:0.09;韩语:0.12;日语:0.25;德语:0.23}。In addition, randomly initialize confidence weight sets 1-confidence weight sets N of N different multiple languages, for example, refer to the example confidence weight set 1 {Chinese: 0.80; English: 0.04; Korean: 0.06; Japanese: 0.05; German: 0.05}, confidence weight set 2 {Chinese: 0.21; English: 0.19; Korean: 0.22; Japanese: 0.20; German: 0.18} and confidence weight set N {Chinese: 0.31; English: 0.09; Korean: 0.12; Japanese: 0.25; German: 0.23}.
将拟环境数据集输入语种识别模型,得到多语种的初始置信度集合(图13中的{中文:p1;英语:p2;韩语:p3;日语:p4;德语:p5}),将初始置信度与N个多个语种的置信度权重集合1-置信度权重集合N相乘,得到N个识别置信度集合。由于专家数据集的内容(各条语音数据的语种)是已知的,因而可以计算得到的N个识别置信度集合的准确率acc,将准确率最高的识别置信度集合对应的置信度权重集合确定为最优置信度权重集合,用该最优置信度权重集合的值设定预设权重集合,完成预设权重集合的初始化。例如,图13中,置信度权重集合2{中文:0.21;英语:0.19;韩语:0.22;日语:0.20;德语:0.18}对应的准确率最高(0.98),因此,预设权重集合被设定为{中文:0.21;英语:0.19;韩语:0.22;日语:0.20;德语:0.18}。Input the quasi-environmental data set into the language recognition model to obtain the multilingual initial confidence set ({Chinese: p1; English: p2; Korean: p3; Japanese: p4; German: p5} in Figure 13), and the initial confidence Multiply with the confidence weight set 1-confidence weight set N of N multiple languages to obtain N recognition confidence sets. Since the content of the expert data set (the language of each piece of speech data) is known, the accuracy rate acc of the N recognition confidence sets can be calculated, and the confidence weight set corresponding to the recognition confidence set with the highest accuracy is It is determined as the optimal confidence weight set, and the preset weight set is set with the value of the optimal confidence weight set to complete the initialization of the preset weight set. For example, in Figure 13, confidence weight set 2 {Chinese: 0.21; English: 0.19; Korean: 0.22; Japanese: 0.20; German: 0.18} corresponds to the highest accuracy rate (0.98), therefore, the preset weight set is set is {Chinese: 0.21; English: 0.19; Korean: 0.22; Japanese: 0.20; German: 0.18}.
采用如上的技术手段,由于在语种识别模型初步接触场景时,根据环境特征和音频采集器特征来初始化预设权重集合,以此来调整后续有用户的输入语音时对输入语音的识别置信度,从而,能够使语音交互系统100根据不同的场景做出调整,尽可能以最优的识别精度进行语种识别,进而提高了识别结果的可靠性。即,采用如上的技术手段,能够抑制“已训练好的语种识别模型对场景适应性并不很强造成识别结果可靠性较低”的问题的发生。Using the above technical means, when the language recognition model initially contacts the scene, the preset weight set is initialized according to the environmental characteristics and the audio collector characteristics, so as to adjust the recognition confidence of the input voice when there is a subsequent user input voice, Therefore, the speech interaction system 100 can be adjusted according to different scenarios, and the language recognition can be performed with the best recognition accuracy as possible, thereby improving the reliability of the recognition result. That is, the adoption of the above technical means can suppress the occurrence of the problem of "the trained language recognition model is not very adaptable to the scene, resulting in low reliability of the recognition result".
在本实施例中根据环境特征和音频采集器特征双方来初始化预设权重集合,然而,作为其他实施例,也可以仅根据环境特征和音频采集器特征中的一方来初始化预设权重集合。In this embodiment, the preset weight set is initialized according to both the environment feature and the audio collector feature. However, as another embodiment, the preset weight set may be initialized only according to one of the environment feature and the audio collector feature.
在预设权重集合初始化完成后,在图14的步骤S212中,判断所设定的预设权重集合中的各语种的预设权重是否在权重范围内。该权重范围为预先设定的,其具体数值可以通过测试来确定,这一内容将在后面描述。After the initialization of the preset weight set is completed, in step S212 of FIG. 14 , it is determined whether the preset weights of each language in the set preset weight set are within the weight range. The weight range is preset, and its specific value can be determined through testing, which will be described later.
当根据拟环境数据集设定的预设权重集合中的各语种的预设权重在权重范围内时,则在该环境(上述环境特征与音频采集器特征)下语种识别模型的识别结果的可信 度较高。当根据拟环境数据集设定的置信度权重不在权重范围内时(步骤S212中的“否”),在该环境下的语种识别模型的结果的可信度较低。此时,在本实施例中,如图14中的步骤S214、步骤S217等所示,可以通过历史语种记录和用户指定语种来来确定用户的输入语音的语种。When the preset weights of each language in the preset weight set set according to the quasi-environmental data set are within the weight range, the recognition result of the language recognition model under the environment (the above-mentioned environmental characteristics and audio collector characteristics) can be guaranteed. High reliability. When the confidence weight set according to the simulated environment data set is not within the weight range ("No" in step S212), the reliability of the result of the language recognition model in this environment is low. At this point, in this embodiment, as shown in steps S214 and S217 in FIG. 14 , the language of the user's input voice can be determined through historical language records and the language specified by the user.
具体而言,在步骤S214中,判断是否有历史语种记录,当有历史语种记录时,将历史记录中的输入语音与本次用户的输入语音进行比较,得到特征相似度,由此来确定本次用户的输入语音的语种。Specifically, in step S214, it is judged whether there is a historical language record, and if there is a historical language record, the input voice in the historical record is compared with the input voice of the user this time to obtain the feature similarity, thereby determining the current language. The language of the input voice of the secondary user.
在没有历史语种记录时,在步骤S217中,根据声纹进行查询,而判断用户是否指定了语种,当有用户指定语种时,根据用户指定语种确定用户的输入语音的识别语种。用户指定语种可能是一个也可能是多个,当仅有一个用户指定语种时,将该用户指定语种确定为输入语音的识别语种;在存在多个用户指定语种时,例如可以将出现次数最多的语种确定为输入语音的识别语种,例如,在查询到的用户指定语种为{汉语:3次;英语:1次;德语:1次}时,将汉语确定为输入语音的识别语种。当没有用户指定语种时,在步骤S219中,根据语种识别模型的识别结果确定输入语音的语种。When there is no historical language record, in step S217, an inquiry is made based on the voiceprint to determine whether the user has specified a language, and if there is a user specified language, the recognition language of the user's input voice is determined according to the user specified language. There may be one or more user-specified languages. When there is only one user-specified language, the user-specified language is determined as the recognition language of the input voice; when there are multiple user-specified languages, for example, the one that appears most frequently The language is determined as the recognition language of the input voice. For example, when the user-specified language inquired is {Chinese: 3 times; English: 1 time; German: 1 time}, Chinese is determined as the recognition language of the input voice. When no user specifies a language, in step S219, the language of the input speech is determined according to the recognition result of the language recognition model.
在图14中,步骤S214在步骤S217之前执行,然而通过历史语种记录来确定语种的处理和通过用户指定语种来确定语种的处理的执行顺序没有限制。In FIG. 14 , step S214 is executed before step S217 , however, there is no limitation on the execution order of the processing of determining the language through the historical language record and the processing of determining the language through the user's designation of the language.
采用如上的技术手段,当预设权重集合不在权重范围内时,根据历史语种记录或用户指定语种来预测用户的输入语音的语种,从而,能够提高语音交互系统100对输入语音的语种预测的可靠性。Using the above technical means, when the preset weight set is not within the weight range, the language of the user's input voice is predicted according to the historical language records or the language specified by the user, thereby improving the reliability of the voice interaction system 100 in predicting the language of the input voice sex.
另外,当根据拟环境数据集设定的预设权重集合中的各语种的权重值在权重范围内的情况下(步骤S212中的“是”),当检测到用户的输入语音时,在步骤S200中,利用语种识别模型对输入语音进行语种识别。在步骤S221中,判断根据语种识别模型得到的多语种的识别置信度集合中是否存在大于阈值λ的识别置信度。在存在时(步骤S221中的“是”),在步骤S222中,通过声纹确定用户身份。作为其他实施例,也可以通过人脸识别或虹膜识别等方式来确定用户身份。之后,在步骤S223中,更新用户的历史语种记录与当前对话轮语种记录(即:将本次语言加入记录中)。之后,在步骤S225中,将多语种的识别置信度集合作为语种识别结果输出给命令解析与执行模块150。这里,当前对话轮是指持续聆听(接收)用户的输入语音的一个周期,例如语种识别系统或语音交互系统系统的一次开启到关闭的期间。In addition, when the weight value of each language in the preset weight set set according to the simulated environment data set is within the weight range ("Yes" in step S212), when the user's input voice is detected, in step S212 In S200, use the language recognition model to perform language recognition on the input speech. In step S221, it is judged whether there is a recognition confidence value greater than the threshold λ in the multilingual recognition confidence set obtained according to the language recognition model. If there is ("Yes" in step S221), in step S222, the user identity is determined through the voiceprint. As another embodiment, the identity of the user may also be determined by means of face recognition or iris recognition. Afterwards, in step S223, the user's historical language record and the current dialogue wheel language record are updated (that is, the current language is added to the record). Afterwards, in step S225 , the multilingual recognition confidence set is output to the command analysis and execution module 150 as the language recognition result. Here, the current dialogue wheel refers to a cycle of continuously listening to (receiving) the user's input voice, for example, a period from a turn-on to turn-off of a language recognition system or a voice interaction system.
在多语种的识别置信度集合中不存在大于阈值λ的识别置信度时(步骤S221中的“否”),语种置信度修正模块160调用基于音频特征调整模块162或基于视频特征调整模块163对多语种的识别置信度集合进行修正处理。在本实施例中,先调用基于音频特征调整模块162进行修正处理,当基于音频特征调整模块162修正后的多语种的识别置信度集合中不存在大于阈值λ的识别置信度时,再调用基于视频特征调整模块163进行修正处理。作为其他实施例,也可以先调用基于视频特征调整模块163。When there is no recognition confidence greater than the threshold λ in the multilingual recognition confidence set ("No" in step S221), the language confidence correction module 160 calls the audio feature-based adjustment module 162 or the video feature-based adjustment module 163 to Multilingual recognition confidence sets are corrected. In this embodiment, the audio feature adjustment module 162 is first called to perform correction processing, and when there is no recognition confidence greater than the threshold λ in the multilingual recognition confidence set corrected based on the audio feature adjustment module 162, then the call based on The video feature adjustment module 163 performs correction processing. As another embodiment, the video feature-based adjustment module 163 may also be called first.
下面参照图15对本实施例中基于音频特征调整模块162进行的处理进行说明。The processing based on the audio feature adjustment module 162 in this embodiment will be described below with reference to FIG. 15 .
如图15所示,在步骤S231中,通过声纹确定用户身份,在步骤S232中查询用户的历史语种记录,当存在历史语种记录时,计算历史语种记录中各个语言的分布。 例如,参照图15中的右侧部分,在历史语种记录是{中文:8条;英语:1条;韩语:0条;日语:1条;德语:0条}时,根据该历史语种记录计算各个语言的分布(也可以称之为按照权重的数值形式归一化)为{中文:0.8;英语:0.1;韩语:0.0;日语:0.1;德语:0.0}。关于历史语种记录的时间范围,可以自由设定,例如当前对话轮、几天、几个月或更长。这里,作为一个方式,可以预先存储有所有时间节点历史语种记录与当前对话轮历史语种记录(简称当前对话轮语种记录),根据所有时间节点历史语种记录与当前对话轮历史语种记录分别执行下述步骤S236等中的处理,当根据二者得到的结果发生冲突时,以根据当前对话轮历史语种记录得到的结果为优先,这是考虑到根据当前对话轮语种记录得到的结果的可信度相对而言更高。或者,也可以对二者赋予不同的权重值进行下述步骤S236中的计算。As shown in Figure 15, in step S231, the user identity is determined through the voiceprint, and in step S232, the user's historical language records are queried, and if there are historical language records, the distribution of each language in the historical language records is calculated. For example, referring to the right part in Fig. 15, when the historical language records are {Chinese: 8; English: 1; Korean: 0; Japanese: 1; German: 0}, calculate according to the historical language records The distribution of each language (which can also be referred to as normalization according to the numerical form of the weight) is {Chinese: 0.8; English: 0.1; Korean: 0.0; Japanese: 0.1; German: 0.0}. The time range of historical language records can be set freely, such as the current dialogue round, a few days, a few months or longer. Here, as a method, all time node historical language records and current dialogue wheel historical language records (abbreviated as current dialogue wheel language records) can be pre-stored, and the following are respectively executed according to all time node historical language records and current dialogue wheel historical language records In the processing in step S236 etc., when there is a conflict between the results obtained according to the two, the result obtained according to the historical language record of the current dialogue wheel is given priority, which is considering that the credibility of the result obtained according to the language record of the current dialogue wheel is relatively high. is higher. Alternatively, different weight values may be assigned to the two to perform the calculation in step S236 described below.
之后,在步骤S236中,用历史语种记录中各个语种的分布对多语种的识别置信度集合进行修正计算,而得到修正后的多语种的识别置信度(对应本申请中的第二识别置信度)集合。例如,参照图15中的右侧部分,在多语种的初始置信度集合为{中文:0.7;英语:0.1;韩语:0.1;日语:0.05;德语:0.05}、预设置信度权重为{中文:0.25;英语:0.25;韩语:0.25;日语:0.25;德语:0.25}、语言分布为{中文:0.8;英语:0.1;韩语:0.0;日语:0.1;德语:0.0}时,计算得到的修正后的识别置信度集合(归一化后)为{中文:0.973;英语:0.017;韩语:0.000;日语:0.010;德语:0.000}。Afterwards, in step S236, use the distribution of each language in the historical language records to correct and calculate the multilingual recognition confidence set, and obtain the corrected multilingual recognition confidence (corresponding to the second recognition confidence in this application). )gather. For example, referring to the right part of Figure 15, the multilingual initial confidence set is {Chinese: 0.7; English: 0.1; Korean: 0.1; Japanese: 0.05; German: 0.05}, and the preset reliability weight is {Chinese : 0.25; English: 0.25; Korean: 0.25; Japanese: 0.25; German: 0.25}, the calculated correction when the language distribution is {Chinese: 0.8; English: 0.1; Korean: 0.0; Japanese: 0.1; German: 0.0} The final recognition confidence set (after normalization) is {Chinese: 0.973; English: 0.017; Korean: 0.000; Japanese: 0.010; German: 0.000}.
之后,在步骤S237中,判断修正后的多语种的识别置信度集合中是否存在大于阈值λ的修正置信度,在存在时(步骤S237中的“是”),一方面,在步骤S239中,将修正后的多语种的识别置信度集合作为语种识别结果输出给命令解析与执行模块150,此外,还更新历史语种记录和当前对话轮语种记录(具体处理方式参见图14中的步骤S222、步骤S223);另一方面,在步骤S238中,执行调整置信度权重的处理。具体而言,根据修正后的识别置信度集合进行修正,使修正后的识别置信度中大于阈值λ的识别置信度的语种的置信度权重相对于其他语种的置信度权重增大。例如,参照图15,将旧置信度权重集合{中文:0.25;英语:0.25;韩语:0.25;日语:0.25;德语:0.25}修正为新置信度权重集合(或者称之为修正权重集合){中文:0.29;英语:0.24;韩语:0.24;日语:0.24;德语:0.24}。Afterwards, in step S237, it is judged whether there is a correction confidence greater than the threshold λ in the multilingual recognition confidence set after the judgment, and if there is ("Yes" in step S237), on the one hand, in step S239, The corrected multilingual recognition confidence set is output to the command parsing and execution module 150 as the language recognition result, in addition, the historical language record and the current dialogue wheel language record are updated (see step S222, step S222 in FIG. S223); On the other hand, in step S238, the process of adjusting the confidence weight is performed. Specifically, the correction is performed according to the corrected recognition confidence set, and the confidence weights of languages whose recognition confidence is greater than the threshold λ among the corrected recognition confidences are increased relative to the confidence weights of other languages. For example, with reference to Figure 15, the old confidence weight set {Chinese: 0.25; English: 0.25; Korean: 0.25; Japanese: 0.25; German: 0.25} is amended to a new confidence weight set (or called a modified weight set){ Chinese: 0.29; English: 0.24; Korean: 0.24; Japanese: 0.24; German: 0.24}.
在步骤S271中,判断修正后的修正权重集合中的各修正权重是否在权重范围内,当判断结果为在权重范围内时,在步骤S272中,用修正置信度权重集合中的值来更新预设权重集合以供语种识别模块120进行之后的语种识别时使用,之后结束处理。当判断结果为不在权重范围内时,在步骤S271中,不更新预设权重集合,结束处理。In step S271, it is judged whether each corrected weight in the corrected corrected weight set is within the weight range. The set of weights is set to be used by the language identification module 120 for subsequent language identification, and then the processing ends. When the judging result is not within the weight range, in step S271, the preset weight set is not updated, and the process ends.
另外,在步骤S235中的判断结果为“否”或者步骤S237中的判断结果为“否”时,语种置信度修正模块160调用综合调整模块164进行处理。In addition, when the judgment result in step S235 is "No" or the judgment result in step S237 is "No", the language confidence correction module 160 calls the comprehensive adjustment module 164 for processing.
下面参照图16对综合调整模块164的处理等进行说明。Next, the processing and the like of the comprehensive adjustment module 164 will be described with reference to FIG. 16 .
如图16所示,在步骤S251中,根据语音识别模块110的输出,判断是否存在大于阈值λ的ASR置信度,当存在时,在步骤S252中,综合调整模块164将多语种的识别置信度集合作为语种识别结果输出给命令解析与执行模块150。这种情况下,命令解析与执行模块150可以将ASR置信度大于阈值λ的ASR置信度对应的语种确定为输入语音的语种(可称之为识别语种),将该语种所对应的候选命令确定为要执行的 目标命令。As shown in Figure 16, in step S251, according to the output of the speech recognition module 110, it is judged whether there is an ASR confidence degree greater than the threshold λ, and if it exists, in step S252, the comprehensive adjustment module 164 will recognize the multilingual recognition confidence The set is output to the command parsing and execution module 150 as a language recognition result. In this case, the command parsing and execution module 150 can determine the language corresponding to the ASR confidence with the ASR confidence greater than the threshold λ as the language of the input speech (which can be called the recognition language), and determine the candidate command corresponding to the language is the target command to execute.
在步骤S251中的判断结果为“否”时,根据语义理解模块140的输出,判断是否存在大于阈值λ的NLU置信度。当存在大于阈值λ的NLU置信度时,在步骤S254中,综合调整模块164将多语种的识别置信度集合作为语种识别结果输出给命令解析与执行模块150。这种情况下,命令解析与执行模块150将大于阈值λ的NLU置信度对应的语种确定为输入语音的语种,将该语种所对应的候选命令确定为要执行的目标命令。When the judgment result in step S251 is "No", according to the output of the semantic understanding module 140, it is judged whether there is an NLU confidence degree greater than the threshold λ. When there is an NLU confidence degree greater than the threshold λ, in step S254 , the comprehensive adjustment module 164 outputs the multilingual recognition confidence set as the language recognition result to the command parsing and execution module 150 . In this case, the command analysis and execution module 150 determines the language corresponding to the NLU confidence greater than the threshold λ as the language of the input speech, and determines the candidate command corresponding to the language as the target command to be executed.
在步骤S253中的判断结果为“否”时,在步骤S256中,根据声纹(用户身份)判断是否存在用户指定语种。这里的用户指定语种是用户设定的语音交互系统100的系统语言的种类。当存在用户指定语种时,根据用户指定语种对多语种的识别置信度集合进行修正计算,而使用户指定语种的识别置信度相对于其他语言的识别置信度增大,如此而得到修正后的多语种的识别置信度(对应本申请中的第二识别置信度)。另外,用户指定语种可能存在多个(数据库中存储的系统语言历史记录中的有多种语言),例如,参照图16中的右侧部分,在用户指定语种即用户曾经设定过的语言包括汉语、英语与德语时,将旧的识别置信度集合{中文:0.75;英语:0.12;韩语:0.11;日语:0.01;德语:0.01}修正为新的识别置信度集合{中文:0.95;英语:0.32;韩语:0.11;日语:0.01;德语:0.21}。When the judgment result in step S253 is "No", in step S256, it is judged according to the voiceprint (user identity) whether there is a language specified by the user. The user-specified language here is the type of system language of the voice interaction system 100 set by the user. When there is a user-specified language, the multilingual recognition confidence set is corrected and calculated according to the user-specified language, so that the recognition confidence of the user-specified language is increased relative to the recognition confidence of other languages, and thus the corrected multilingual Language recognition confidence (corresponding to the second recognition confidence in this application). In addition, there may be multiple user-specified languages (there are multiple languages in the system language history records stored in the database). For example, referring to the right part in FIG. For Chinese, English and German, the old recognition confidence set {Chinese: 0.75; English: 0.12; Korean: 0.11; Japanese: 0.01; German: 0.01} is revised to the new recognition confidence set {Chinese: 0.95; English: 0.32; Korean: 0.11; Japanese: 0.01; German: 0.21}.
这里,用户指定语种是用户操作记录的一例。作为其他例子,还可以举出用户历史播放歌曲的语种等。Here, the language specified by the user is an example of a user operation record. As another example, the language of the user's historically played songs can also be mentioned.
另外,可选地,可以在根据ASR置信度或NLU置信度确定输入语音的识别语种之后,更新多语种的预设权重集合,以用于对之后的输入语音的语种识别。更新的方式与参照图14描述的方式相同,这里不再重复说明。In addition, optionally, after the recognition language of the input speech is determined according to the ASR confidence level or the NLU confidence level, the multilingual preset weight set may be updated to be used for language recognition of the subsequent input speech. The update method is the same as that described with reference to FIG. 14 , and will not be repeated here.
之后,在步骤S259中,判断步骤S258中修正后的多语种的识别置信度集合中是否存在大于阈值λ的识别置信度,当存在时,在步骤S261中将该修正后的多语种的识别置信度集合作为语种识别结果输出给命令解析与识别模块。之后,在步骤S262中,调整多语种的预设置信度权重集合。调整的方法与上面所说明的方法相同,这里不再重复描述。另外,与上述同样,在修正权重集合中的各修正权重在权重范围内时,更新多语种的预设置信度权重集合。Afterwards, in step S259, it is judged whether there is a recognition confidence value greater than the threshold λ in the set of multilingual recognition confidence values corrected in step S258, and if yes, the corrected multilingual recognition confidence value is determined in step S261. The degree set is output to the command parsing and recognition module as the language recognition result. Afterwards, in step S262, the multilingual preset reliability weight sets are adjusted. The adjustment method is the same as the method explained above, and will not be repeated here. In addition, similar to the above, when each correction weight in the correction weight set is within the weight range, the multilingual preset reliability weight set is updated.
另外,在步骤S256中的判断结果是“否”即不存在用户指定语种时,或者在步骤S259中的判断结果是“否”即不存在大于阈值λ的识别置信度时,在步骤S264中,根据用户身份判断是否存在用户的历史语种记录。当判断结果为存在用户的历史语种记录时,在步骤S256中,将用户本次的输入语音与历史语种记录中的的输入语音进行比较,得到特征相似度,根据特征相似度来找到与未知的输入语音最相近的语种。In addition, when the judgment result in step S256 is "No", that is, there is no user-specified language, or when the judgment result in step S259 is "No", that is, there is no recognition confidence greater than the threshold λ, in step S264, Determine whether there is a user's historical language record based on the user's identity. When the judgment result is that there is a user's historical language record, in step S256, the input voice of the user this time is compared with the input voice in the historical language record to obtain a feature similarity, and according to the feature similarity, find an unknown voice. Enter the language with the closest sound.
另外,在步骤S264中的判断结果为“否”即不存在用户的历史语种记录时,综合调整模块164将多用户的识别语种置信度直接作为语种识别结果输出给命令解析与执行模块150。这种情况下,命令解析与执行模块150可以认为无法识别输入语音的语种,可以例如通过播放语音的方式将此事项反馈给用户。In addition, when the judgment result in step S264 is "No", that is, there is no historical language record of the user, the comprehensive adjustment module 164 directly outputs the recognition language confidence of multiple users as the language recognition result to the command analysis and execution module 150. In this case, the command parsing and execution module 150 may consider that the language of the input voice cannot be recognized, and may, for example, feedback this matter to the user by playing voice.
采用如上的本实施例,根据用户特征包括历史语种记录或用户指定语种等对多语种的识别置信度集合进行调整,如此,能够提高语音交互系统100对输入语音的预测 精度,提高用户对语音交互系统100的智能性的信任程度。With the present embodiment as above, the multilingual recognition confidence set is adjusted according to user characteristics including historical language records or user-specified languages, so that the voice interaction system 100 can improve the prediction accuracy of input voice and improve the user's voice interaction. The degree of confidence in the intelligence of the system 100 .
在上面的描述中提到了“权重范围”,即,在实时场景适应模块161对预设权重集合进行初始化时判断初始化的预设权重是否在权重范围内,在基于音频特征调整模块162、基于视频特征调整模块163和综合调整模块164意图更新预设权重时也会判断预设权重是否在权重范围内。由此可见,该“权重范围”体现了模型本身所具备的鲁棒性范围。In the above description, "weight range" is mentioned, that is, when the real-time scene adaptation module 161 initializes the preset weight set, it is judged whether the initialized preset weight is within the weight range. When the feature adjustment module 163 and the comprehensive adjustment module 164 intend to update the preset weight, they will also determine whether the preset weight is within the weight range. It can be seen that the "weight range" reflects the robustness range of the model itself.
本实施例中还提供了一种对该“权重范围”进行设定的方法,该方法例如在语音交互系统100出厂之前的测试阶段实施,此外,也可以在出厂后的离线检验阶段实施。This embodiment also provides a method for setting the "weight range". This method is implemented, for example, in the testing phase before the voice interaction system 100 leaves the factory. In addition, it can also be implemented in the offline inspection phase after leaving the factory.
该方法主要包括如下步骤:The method mainly includes the following steps:
①采集不同场景下的语种数据集data 1,data 2,……,data n,语种数据集的内容是预先知晓的; ①Collect language data sets data 1 , data 2 ,...,data n in different scenarios, and the content of language data sets is known in advance;
②将一个场景下的语种数据集data i(i∈[1,n])输入语种识别模型对语种识别模型进行测试,获取在该场景下各语种的最优置信度权重(也就是识别准确率最高情况下对应的置信度权重)t i② Input the language data set data i (i∈[1,n]) in a scene into the language recognition model to test the language recognition model, and obtain the optimal confidence weight of each language in the scene (that is, the recognition accuracy rate The corresponding confidence weight in the highest case) t i ;
③全部语种数据集都输入模型执行上述②,即可得到各个语种的最佳置信度权重集合T k={t 1k,t 2k,……,t nk}(k∈[1,m]); ③ All language data sets are input into the model to execute the above ②, and the best confidence weight set T k ={t 1k ,t 2k ,...,t nk }(k∈[1,m]) for each language can be obtained;
④得到各个语种的最佳置信度权重范围F k=[a k,b k],其中a k=min(T k),b k=max(T k)。 ④ Obtain the optimal confidence weight range F k =[a k ,b k ] for each language, where a k =min(T k ), b k =max(T k ).
注:n为语种数据集个数,m为语种个数。Note: n is the number of language datasets, and m is the number of languages.
这里,语种数据集data 1,data 2,……,data n对应本申请中的测试数据集。 Here, the language data sets data 1 , data 2 ,..., data n correspond to the test data sets in this application.
针对以上方法,本实施例给出如图10所示的实现方案,但不仅限该方案。下面参照图10对该方案进行说明。首先,随机初始化k种不同的置信度权重集合,该k种不同的置信度权重集合对应本申请中的多个测试权重集合;接着,将一个场景的语种数据集data i输入语种识别模型得到输出,并分别与k种置信度权重集合(这里的集合可以理解成矩阵)对应相乘后,再归一化,得到修正后的多语种的语种置信度集合;然后,根据语种置信度集合以及已知的语种数据集data i内容得到每个多语种的置信度权重集合的准确率(图中的acc),那么准确率高的置信度权重集合(例如图中的{中文:0.21;英语:0.19;韩语:0.22;日语:0.20;德语:0.18})便是针对语种数据集data i(即:场景i)的最优置信度权重集合。 For the above method, this embodiment provides an implementation solution as shown in FIG. 10 , but it is not limited to this solution. This scheme will be described below with reference to FIG. 10 . First, randomly initialize k different sets of confidence weights, which correspond to multiple test weight sets in this application; then, input the language data set data i of a scene into the language recognition model to obtain the output , and respectively multiplied with k sets of confidence weights (the set here can be understood as a matrix), and then normalized to obtain the revised multilingual language confidence set; then, according to the language confidence set and the already The accuracy of each multilingual confidence weight set (acc in the figure) is obtained from the content of the known language data set data i , then the confidence weight set with high accuracy (for example {Chinese: 0.21; English: 0.19 ; Korean: 0.22; Japanese: 0.20; German: 0.18}) is the optimal set of confidence weights for the language data set data i (ie: scene i).
最后,对每一场景的语种数据集重复上述处理,便可以获得每个单语种的置信度权重集合,并获得每个单语种的置信度权重范围,例如图中示出的汉语的置信度权重范围[c a,c b]、英语的置信度权重范围[e a,e b]、韩语的置信度权重范围[h a,h b]、日语的置信度权重范围[r a,r b]、德语的置信度权重范围[d a,d b]。 Finally, by repeating the above processing for the language data set of each scene, the confidence weight set of each monolingual can be obtained, and the confidence weight range of each monolingual can be obtained, such as the confidence weight of Chinese shown in the figure Range [c a ,c b ], English confidence weight range [e a ,e b ], Korean confidence weight range [h a ,h b ], Japanese confidence weight range [r a ,r b ] , German confidence weight range [d a , d b ].
采用如上的技术手段,在利用语种识别模型进行语种识别之前,通过大量的语种数据集对语种识别模型进行测试,来设定多语种的预设权重集合的权重范围,也就是规定了语种识别模型的鲁棒性范围,使语种识别模型在此范围内工作,从而保证了语种识别结果的可靠性。Using the above technical means, before using the language recognition model for language recognition, the language recognition model is tested through a large number of language data sets to set the weight range of the multilingual preset weight set, that is, to specify the language recognition model The range of robustness enables the language recognition model to work within this range, thereby ensuring the reliability of the language recognition results.
注意,上述仅为本申请的较佳实施例及所运用的技术原理。本领域技术人员会理解,本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此,虽然通过以上实施例 对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例,在不脱离本申请的构思的情况下,还可以包括更多其他等效实施例,均属于本申请的保护范畴。Note that the above are only preferred embodiments and technical principles used in this application. Those skilled in the art will understand that the present application is not limited to the specific embodiments described herein, and various obvious changes, readjustments and substitutions can be made by those skilled in the art without departing from the protection scope of the present application. Therefore, although the present application has been described in detail through the above embodiments, the present application is not limited to the above embodiments, and may include more other equivalent embodiments without departing from the concept of the present application, all of which belong to protection scope of this application.

Claims (47)

  1. 一种语音处理方法,其特征在于,包括:A voice processing method, characterized in that, comprising:
    获取用户的输入语音信息;Obtain the user's input voice information;
    根据所述输入语音信息,确定所述输入语音信息对应的多个第一置信度,所述多个第一置信度分别对应于多个语种;According to the input voice information, determine a plurality of first confidence levels corresponding to the input voice information, and the plurality of first confidence levels respectively correspond to a plurality of languages;
    根据所述用户的用户特征修正所述多个第一置信度为多个第二置信度;modifying the plurality of first confidence levels into a plurality of second confidence levels according to user characteristics of the user;
    根据所述多个第二置信度,确定所述输入语音信息的语种。The language of the input voice information is determined according to the plurality of second confidence levels.
  2. 根据权利要求1所述的语音处理方法,其特征在于,所述根据所述用户的用户特征修正所述多个第一置信度为多个第二置信度,具体包括:The speech processing method according to claim 1, wherein the modifying the plurality of first confidence levels into a plurality of second confidence levels according to the user characteristics of the user specifically comprises:
    在所述多个第一置信度小于第一阈值时,根据所述用户特征修正所述多个第一置信度为所述多个第二置信度。When the multiple first confidence levels are less than a first threshold, modify the multiple first confidence levels to the multiple second confidence levels according to the user characteristics.
  3. 根据权利要求1或2所述的语音处理方法,其特征在于,所述用户特征包括历史语种记录、用户指定语种中的一个或多个。The speech processing method according to claim 1 or 2, wherein the user features include one or more of historical language records and user-specified languages.
  4. 根据权利要求3所述的语音处理方法,其特征在于,所述历史语种记录与所述用户指定语种是根据所述输入语音信息的声纹特征查询得到的。The speech processing method according to claim 3, characterized in that, the historical language record and the user-designated language are obtained by querying the voiceprint feature of the input speech information.
  5. 根据权利要求1-4中任一项所述的语音处理方法,其特征在于,所述多个第一置信度由多个初始置信度和多个预设权重确定;The speech processing method according to any one of claims 1-4, wherein the plurality of first confidence levels are determined by a plurality of initial confidence levels and a plurality of preset weights;
    所述语音处理方法还包括:根据所述多个第二置信度,更新所述多个预设权重。The speech processing method further includes: updating the plurality of preset weights according to the plurality of second confidence levels.
  6. 根据权利要求5所述的语音处理方法,其特征在于,所述根据所述多个第二置信度,更新所述多个预设权重,具体包括:The speech processing method according to claim 5, wherein said updating said plurality of preset weights according to said plurality of second confidence degrees specifically comprises:
    在所述多个第二置信度中存在大于第一阈值的第二置信度时,根据所述多个第二置信度,更新所述多个预设权重。When there is a second confidence degree greater than a first threshold in the plurality of second confidence degrees, the plurality of preset weights are updated according to the plurality of second confidence degrees.
  7. 根据权利要求1-6中任一项所述的语音处理方法,其特征在于,还包括:根据所述输入语音信息与所述输入语音信息的语种确定所述输入语音信息的语义。The speech processing method according to any one of claims 1-6, further comprising: determining the semantics of the input speech information according to the input speech information and the language of the input speech information.
  8. 根据权利要求1-7中任一项所述的语音处理方法,其特征在于,所述多个语种是预先设定的。The speech processing method according to any one of claims 1-7, wherein the plurality of languages are preset.
  9. 根据权利要求1-8中任一项所述的语音处理方法,其特征在于,所述多个第一置信度由多个初始置信度和多个预设权重确定;The speech processing method according to any one of claims 1-8, wherein the plurality of first confidence levels are determined by a plurality of initial confidence levels and a plurality of preset weights;
    所述语音处理方法还包括:在获取用户的输入语音信息之前,根据场景特征设定所述多个预设权重。The speech processing method further includes: before acquiring the user's input speech information, setting the plurality of preset weights according to scene characteristics.
  10. 根据权利要求9所述的语音处理方法,其特征在于,所述场景特征包括环境特征和/或音频采集器特征。The speech processing method according to claim 9, wherein the scene features include environment features and/or audio collector features.
  11. 根据权利要求10所述的语音处理方法,其特征在于,所述环境特征包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,所述音频采集器特征包括麦克风排布信息。The speech processing method according to claim 10, wherein the environmental characteristics include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics include microphone arrangement information.
  12. 根据权利要求9-11中任一项所述的语音处理方法,其特征在于,所述根据场景特征设定所述多个预设权重,具体包括:The speech processing method according to any one of claims 9-11, wherein the setting of the plurality of preset weights according to scene characteristics specifically includes:
    获取预先采集的第一语音数据与所述第一语音数据的预先记录的第一语种信息;Acquiring pre-collected first voice data and pre-recorded first language information of the first voice data;
    根据所述第一语音数据与所述场景特征确定第二语音数据;determining second voice data according to the first voice data and the scene feature;
    根据所述第二语音数据确定所述第二语音数据的第二语种信息;determining the second language information of the second voice data according to the second voice data;
    根据所述第一语种信息与所述第二语种信息设定所述多个预设权重。The plurality of preset weights are set according to the first language information and the second language information.
  13. 根据权利要求12所述的语音处理方法,其特征在于,所述根据所述第二语音数据确定所述第二语音数据的第二语种信息,具体包括:The speech processing method according to claim 12, wherein the determining the second language information of the second speech data according to the second speech data specifically comprises:
    获取多个测试权重组,所述多个测试权重组中的任一个包括多个测试权重;Acquiring a plurality of test weight groups, any one of the plurality of test weight groups includes a plurality of test weights;
    根据所述第二语音数据与所述多个测试权重组确定多个所述第二语种信息,所述多个第二语种信息与所述多个测试权重组分别对应;determining a plurality of second language information according to the second voice data and the plurality of test weight groups, the plurality of second language information corresponding to the plurality of test weight groups;
    所述根据所述第一语种信息与所述第二语种信息设定所述多个预设权重,具体包括:The setting of the plurality of preset weights according to the first language information and the second language information specifically includes:
    根据所述第一语种信息与所述多个第二语种信息确定所述多个第二语种信息的多个准确率;determining a plurality of accuracy rates of the plurality of second language information according to the first language information and the plurality of second language information;
    根据所述准确率最高的所述第二语种信息对应的所述测试权重组设定所述多个预设权重。The plurality of preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
  14. 根据权利要求9-13中任一项所述的语音处理方法,其特征在于,所述设定所述多个预设权重,具体包括:在权重范围内设定所述多个预设权重。The speech processing method according to any one of claims 9-13, wherein the setting the multiple preset weights specifically comprises: setting the multiple preset weights within a weight range.
  15. 根据权利要求5或6所述的语音处理方法,其特征在于,所述更新所述多个预设权重,具体包括:在权重范围内更新所述多个预设权重。The speech processing method according to claim 5 or 6, wherein said updating said plurality of preset weights specifically comprises: updating said plurality of preset weights within a weight range.
  16. 根据权利要求14或15所述的语音处理方法,其特征在于,所述权重范围是按照如下方式确定的:The speech processing method according to claim 14 or 15, wherein the weight range is determined as follows:
    获取预先采集的多个测试语音数据组与多个测试语音数据组的预先记录的第一语种信息,所述多个测试语音数据组中的任一个包括多个测试语音数据;Acquiring a plurality of pre-collected test voice data groups and pre-recorded first language information of the plurality of test voice data groups, any one of the plurality of test voice data groups includes a plurality of test voice data;
    获取多个测试权重组,所述多个测试权重组中的任一个包括多个测试权重;Acquiring a plurality of test weight groups, any one of the plurality of test weight groups includes a plurality of test weights;
    根据所述多个测试语音数据组、所述第一语音信息与所述多个测试权重组确定所 述权重范围。The weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
  17. 一种语音处理方法,其特征在于,包括:A voice processing method, characterized in that, comprising:
    获取用户的输入语音信息;Obtain the user's input voice information;
    根据所述输入语音信息,确定所述输入语音信息对应的多个第三置信度,所述多个第三置信度分别对应于多个语种;According to the input voice information, determine a plurality of third confidence levels corresponding to the input voice information, and the plurality of third confidence levels respectively correspond to a plurality of languages;
    根据场景特征修正所述多个第三置信度为多个第四置信度;Correcting the plurality of third confidence levels into a plurality of fourth confidence levels according to scene characteristics;
    根据所述多个第四置信度,确定所述输入语音信息的语种。The language of the input voice information is determined according to the plurality of fourth confidence levels.
  18. 根据权利要求17所述的语音处理方法,其特征在于,所述场景特征包括环境特征和/或音频采集器特征。The speech processing method according to claim 17, wherein the scene features include environment features and/or audio collector features.
  19. 根据权利要求17或18所述的语音处理方法,其特征在于,所述环境特征包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,所述音频采集器特征包括麦克风排布信息。The speech processing method according to claim 17 or 18, wherein the environmental features include one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude, and the audio collector features include a microphone Arrange information.
  20. 根据权利要求17-19中任一项所述的语音处理方法,其特征在于,所述根据场景特征修正所述多个第三置信度为多个第四置信度,具体包括:The speech processing method according to any one of claims 17-19, wherein the modifying the plurality of third confidence levels into a plurality of fourth confidence levels according to scene characteristics specifically includes:
    根据所述场景特征设定多个预设权重;setting a plurality of preset weights according to the scene characteristics;
    根据所述多个预设权重修正所述多个第三置信度为所述多个第四置信度。Correcting the plurality of third confidence levels into the plurality of fourth confidence levels according to the plurality of preset weights.
  21. 根据权利要求20所述的语音处理方法,其特征在于,所述根据场景特征设定所述多个预设权重,具体包括:The speech processing method according to claim 20, wherein said setting said plurality of preset weights according to scene characteristics specifically comprises:
    获取预先采集的第一语音数据与所述第一语音数据的预先记录的第一语种信息;Acquiring pre-collected first voice data and pre-recorded first language information of the first voice data;
    根据所述第一语音数据与所述场景特征确定第二语音数据;determining second voice data according to the first voice data and the scene feature;
    根据所述第二语音数据确定所述第二语音数据的第二语种信息;determining the second language information of the second voice data according to the second voice data;
    根据所述第一语种信息与所述第二语种信息设定所述多个预设权重。The plurality of preset weights are set according to the first language information and the second language information.
  22. 根据权利要求21所述的语音处理方法,其特征在于,所述根据所述第二语音数据确定所述第二语音数据的第二语种信息,具体包括:The speech processing method according to claim 21, wherein the determining the second language information of the second speech data according to the second speech data specifically comprises:
    获取多个测试权重组,所述测试权重组包括多个测试权重;Acquiring multiple test weight groups, where the test weight groups include multiple test weights;
    根据所述第二语音数据与所述多个测试权重组确定多个所述第二语种信息,所述多个第二语种信息与所述多个测试权重组分别对应;determining a plurality of second language information according to the second voice data and the plurality of test weight groups, the plurality of second language information corresponding to the plurality of test weight groups;
    所述根据所述第一语种信息与所述第二语种信息设定所述多个预设权重,具体包括:The setting of the plurality of preset weights according to the first language information and the second language information specifically includes:
    根据所述第一语种信息与所述多个第二语种信息确定所述多个第二语种信息的多个准确率;determining a plurality of accuracy rates of the plurality of second language information according to the first language information and the plurality of second language information;
    根据所述准确率最高的所述第二语种信息对应的所述测试权重组设定所述多个预设权重。The plurality of preset weights are set according to the test weight group corresponding to the second language information with the highest accuracy rate.
  23. 一种语音处理装置,其特征在于,包括处理模块与收发模块,A voice processing device, characterized in that it includes a processing module and a transceiver module,
    所述收发模块用于获取用户的输入语音信息;The transceiver module is used to obtain the input voice information of the user;
    所述处理模块用于,根据所述输入语音信息,确定所述输入语音信息对应的多个第一置信度,所述多个第一置信度分别对应于多个语种,The processing module is configured to, according to the input voice information, determine a plurality of first confidence levels corresponding to the input voice information, the plurality of first confidence levels respectively corresponding to multiple languages,
    所述处理模块还用于,根据所述用户的用户特征修正所述多个第一置信度为多个第二置信度,根据所述多个第二置信度,确定所述输入语音信息的语种。The processing module is further configured to correct the plurality of first confidence levels into a plurality of second confidence levels according to the user characteristics of the user, and determine the language of the input voice information according to the plurality of second confidence levels .
  24. 根据权利要求23所述的语音处理装置,其特征在于,所述处理模块具体用于,在所述多个第一置信度小于第一阈值时,根据所述用户特征修正所述多个第一置信度为所述多个第二置信度。The speech processing device according to claim 23, wherein the processing module is specifically configured to, when the multiple first confidence levels are smaller than a first threshold, modify the multiple first confidence levels according to the user characteristics. The confidence level is the plurality of second confidence levels.
  25. 根据权利要求23或24所述的语音处理装置,其特征在于,所述用户特征包括历史语种记录、用户指定语种中的一个或多个。The speech processing device according to claim 23 or 24, wherein the user features include one or more of historical language records and user-specified languages.
  26. 根据权利要求25所述的语音处理装置,其特征在于,所述历史语种记录与所述用户指定语种是根据所述输入语音信息的声纹特征查询得到的。The speech processing device according to claim 25, characterized in that, the historical language record and the user-designated language are obtained according to the voiceprint feature of the input voice information.
  27. 根据权利要求23-26中任一项所述的语音处理装置,其特征在于,所述多个第一置信度由多个初始置信度和多个预设权重确定;The speech processing device according to any one of claims 23-26, wherein the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights;
    所述处理模块还用于,根据所述多个第二置信度,更新所述多个预设权重。The processing module is further configured to update the plurality of preset weights according to the plurality of second confidence levels.
  28. 根据权利要求27所述的语音处理装置,其特征在于,所述处理模块具体用于,在所述多个第二置信度中存在大于第一阈值的第二置信度时,根据所述多个第二置信度,更新所述多个预设权重。The speech processing device according to claim 27, wherein the processing module is specifically configured to: when there is a second confidence degree greater than the first threshold in the plurality of second confidence degrees, according to the plurality of second confidence degrees The second confidence level is to update the plurality of preset weights.
  29. 根据权利要求23-28中任一项所述的语音处理装置,其特征在于,所述处理模块还用于,根据所述输入语音信息与所述输入语音信息的语种确定所述输入语音信息的语义。The speech processing device according to any one of claims 23-28, wherein the processing module is further configured to determine the language of the input speech information according to the input speech information and the language of the input speech information. semantics.
  30. 根据权利要求23-29中任一项所述的语音处理装置,其特征在于,所述多个语种是预先设定的。The speech processing device according to any one of claims 23-29, wherein the plurality of languages are preset.
  31. 根据权利要求23-30中任一项所述的语音处理装置,其特征在于,所述多个第一置信度由多个初始置信度和多个预设权重确定;The speech processing device according to any one of claims 23-30, wherein the multiple first confidence levels are determined by multiple initial confidence levels and multiple preset weights;
    所述处理模块还用于,在获取用户的输入语音信息之前,根据场景特征设定所述多个预设权重。The processing module is further configured to set the plurality of preset weights according to scene characteristics before acquiring the user's input voice information.
  32. 根据权利要求31所述的语音处理装置,其特征在于,所述场景特征包括环境 特征和/或音频采集器特征。The speech processing device according to claim 31, wherein the scene features include environment features and/or audio collector features.
  33. 根据权利要求32所述的语音处理装置,其特征在于,所述环境特征包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,所述音频采集器特征包括麦克风排布信息。The speech processing device according to claim 32, wherein the environmental characteristics include one or more of environmental signal-to-noise ratio, power supply direct current and alternating current information, or environmental vibration amplitude, and the audio collector characteristics include microphone arrangement information.
  34. 根据权利要求31-33中任一项所述的语音处理装置,其特征在于,所述处理模块具体用于,获取预先采集的第一语音数据与所述第一语音数据的预先记录的第一语种信息,根据所述第一语音数据与所述场景特征确定第二语音数据,根据所述第二语音数据确定所述第二语音数据的第二语种信息,根据所述第一语种信息与所述第二语种信息设定所述多个预设权重。The speech processing device according to any one of claims 31-33, wherein the processing module is specifically configured to acquire the pre-collected first speech data and the pre-recorded first speech data of the first speech data. language information, determining the second voice data according to the first voice data and the scene feature, determining the second language information of the second voice data according to the second voice data, and determining the second language information of the second voice data according to the first language information and the The plurality of preset weights are set for the second language information.
  35. 根据权利要求34所述的语音处理装置,其特征在于,所述处理模块具体用于,获取多个测试权重组,所述多个测试权重组中的任一个包括多个测试权重,根据所述第二语音数据与所述多个测试权重组确定多个所述第二语种信息,所述多个第二语种信息与所述多个测试权重组分别对应,根据所述第一语种信息与所述多个第二语种信息确定所述多个第二语种信息的多个准确率,根据所述准确率最高的所述第二语种信息对应的所述测试权重组设定所述多个预设权重。The speech processing device according to claim 34, wherein the processing module is specifically configured to obtain a plurality of test weight groups, any one of the plurality of test weight groups includes a plurality of test weights, according to the The second voice data and the plurality of test weight groups determine a plurality of the second language information, the plurality of second language information corresponds to the plurality of test weight groups respectively, and according to the first language information and the plurality of test weight groups, The plurality of second language information determines multiple accuracy rates of the plurality of second language information, and sets the plurality of presets according to the test weight group corresponding to the second language information with the highest accuracy rate Weights.
  36. 根据权利要求31-35中任一项所述的语音处理装置,其特征在于,所述处理模块具体用于,在权重范围内设定所述多个预设权重。The speech processing device according to any one of claims 31-35, wherein the processing module is specifically configured to set the plurality of preset weights within a weight range.
  37. 根据权利要求27或28所述的语音处理装置,其特征在于,所述处理模块具体用于,在权重范围内更新所述多个预设权重。The speech processing device according to claim 27 or 28, wherein the processing module is specifically configured to update the plurality of preset weights within a weight range.
  38. 根据权利要求36或37所述的语音处理装置,其特征在于,所述权重范围是按照如下方式确定的:The speech processing device according to claim 36 or 37, wherein the weight range is determined in the following manner:
    获取预先采集的多个测试语音数据组与多个测试语音数据组的预先记录的第一语种信息,所述多个测试语音数据组中的任一个包括多个测试语音数据;Acquiring a plurality of pre-collected test voice data groups and pre-recorded first language information of the plurality of test voice data groups, any one of the plurality of test voice data groups includes a plurality of test voice data;
    获取多个测试权重组,所述多个测试权重组中的任一个包括多个测试权重;Acquiring a plurality of test weight groups, any one of the plurality of test weight groups includes a plurality of test weights;
    根据所述多个测试语音数据组、所述第一语音信息与所述多个测试权重组确定所述权重范围。The weight range is determined according to the multiple test voice data groups, the first voice information and the multiple test weight groups.
  39. 一种语音处理装置,其特征在于,包括处理模块与收发模块,A voice processing device, characterized in that it includes a processing module and a transceiver module,
    所述收发模块用于获取用户的输入语音信息;The transceiver module is used to obtain the input voice information of the user;
    所述处理模块用于,根据所述输入语音信息,确定所述输入语音信息对应的多个第三置信度,所述多个第三置信度分别对应于多个语种,The processing module is configured to, according to the input voice information, determine a plurality of third confidence levels corresponding to the input voice information, the plurality of third confidence levels respectively corresponding to a plurality of languages,
    所述处理模块还用于,根据场景特征修正所述多个第三置信度为多个第四置信度,根据所述多个第四置信度,确定所述输入语音信息的语种。The processing module is further configured to modify the plurality of third confidence levels into a plurality of fourth confidence levels according to scene characteristics, and determine the language of the input voice information according to the plurality of fourth confidence levels.
  40. 根据权利要求39所述的语音处理装置,其特征在于,所述场景特征包括环境特征和/或音频采集器特征。The speech processing device according to claim 39, wherein the scene features include environment features and/or audio collector features.
  41. 根据权利要求39或40所述的语音处理装置,其特征在于,所述环境特征包括环境信噪比、电源直流交流电信息或环境振动幅度中的一个或多个,所述音频采集器特征包括麦克风排布信息。The speech processing device according to claim 39 or 40, wherein the environmental features include one or more of environmental signal-to-noise ratio, power supply DC and AC information, or environmental vibration amplitude, and the audio collector features include a microphone Arrange information.
  42. 根据权利要求39-41中任一项所述的语音处理装置,其特征在于,所述处理模块具体用于,根据所述场景特征设定多个预设权重,根据所述多个预设权重修正所述多个第三置信度为所述多个第四置信度。The speech processing device according to any one of claims 39-41, wherein the processing module is specifically configured to set a plurality of preset weights according to the scene characteristics, and set a plurality of preset weights according to the plurality of preset weights. Modifying the plurality of third confidence levels into the plurality of fourth confidence levels.
  43. 根据权利要求42所述的语音处理装置,其特征在于,所述处理模块具体用于,获取预先采集的第一语音数据与所述第一语音数据的预先记录的第一语种信息,根据所述第一语音数据与所述场景特征确定第二语音数据,根据所述第二语音数据确定所述第二语音数据的第二语种信息,根据所述第一语种信息与所述第二语种信息设定所述多个预设权重。The speech processing device according to claim 42, wherein the processing module is specifically configured to obtain the pre-collected first speech data and the pre-recorded first language information of the first speech data, according to the The first voice data and the scene feature determine the second voice data, determine the second language information of the second voice data according to the second voice data, and set the second language information according to the first language information and the second language information. Determine the plurality of preset weights.
  44. 根据权利要求43所述的语音处理装置,其特征在于,所述处理模块具体用于,获取多个测试权重组,所述测试权重组包括多个测试权重,根据所述第二语音数据与所述多个测试权重组确定多个所述第二语种信息,所述多个第二语种信息与所述多个测试权重组分别对应,根据所述第一语种信息与所述多个第二语种信息确定所述多个第二语种信息的多个准确率,根据所述准确率最高的所述第二语种信息对应的所述测试权重组设定所述多个预设权重。The speech processing device according to claim 43, wherein the processing module is specifically configured to obtain a plurality of test weight groups, the test weight groups include a plurality of test weights, and according to the second speech data and the The plurality of test weight groups determine a plurality of the second language information, the plurality of second language information corresponds to the plurality of test weight groups, and according to the first language information and the plurality of second language information The information determines multiple accuracy rates of the multiple second language information, and sets the multiple preset weights according to the test weight group corresponding to the second language information with the highest accuracy rate.
  45. 一种计算设备,其特征在于,包括处理器与存储器,所述存储器存储有计算机程序指令,所述计算机程序指令当被所述处理器执行时使得所述处理器执行权利要求1-22中任一项所述的方法。A computing device, characterized by comprising a processor and a memory, the memory storing computer program instructions, the computer program instructions, when executed by the processor, cause the processor to perform any of claims 1-22 one of the methods described.
  46. 一种计算机可读存储介质,其特征在于,存储有计算机程序指令,所述计算机程序指令当被计算机执行时使所述计算机执行权利要求1-22中任一项所述的方法。A computer-readable storage medium, characterized in that computer program instructions are stored, and when executed by a computer, the computer program instruction causes the computer to perform the method according to any one of claims 1-22.
  47. 一种计算机程序产品,其特征在于,包括计算机程序指令,所述计算机程序指令当被计算机执行时使所述计算机执行权利要求1-22中任一项所述的方法。A computer program product, characterized by comprising computer program instructions, which when executed by a computer cause the computer to perform the method according to any one of claims 1-22.
PCT/CN2021/101400 2021-06-22 2021-06-22 Speech processing method and apparatus, and system WO2022266825A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202180001914.8A CN113597641A (en) 2021-06-22 2021-06-22 Voice processing method, device and system
PCT/CN2021/101400 WO2022266825A1 (en) 2021-06-22 2021-06-22 Speech processing method and apparatus, and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/101400 WO2022266825A1 (en) 2021-06-22 2021-06-22 Speech processing method and apparatus, and system

Publications (1)

Publication Number Publication Date
WO2022266825A1 true WO2022266825A1 (en) 2022-12-29

Family

ID=78242898

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/101400 WO2022266825A1 (en) 2021-06-22 2021-06-22 Speech processing method and apparatus, and system

Country Status (2)

Country Link
CN (1) CN113597641A (en)
WO (1) WO2022266825A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004004953A (en) * 2003-07-30 2004-01-08 Matsushita Electric Ind Co Ltd Voice synthesizer and voice synthetic method
CN107832286A (en) * 2017-09-11 2018-03-23 远光软件股份有限公司 Intelligent interactive method, equipment and storage medium
CN109522564A (en) * 2018-12-17 2019-03-26 北京百度网讯科技有限公司 Voice translation method and device
CN110085210A (en) * 2019-03-15 2019-08-02 平安科技(深圳)有限公司 Interactive information test method, device, computer equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104681023A (en) * 2015-02-15 2015-06-03 联想(北京)有限公司 Information processing method and electronic equipment
CN108172212B (en) * 2017-12-25 2020-09-11 横琴国际知识产权交易中心有限公司 Confidence-based speech language identification method and system
US20210365641A1 (en) * 2018-06-12 2021-11-25 Langogo Technology Co., Ltd Speech recognition and translation method and translation apparatus
CN112185348B (en) * 2020-10-19 2024-05-03 平安科技(深圳)有限公司 Multilingual voice recognition method and device and electronic equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004004953A (en) * 2003-07-30 2004-01-08 Matsushita Electric Ind Co Ltd Voice synthesizer and voice synthetic method
CN107832286A (en) * 2017-09-11 2018-03-23 远光软件股份有限公司 Intelligent interactive method, equipment and storage medium
CN109522564A (en) * 2018-12-17 2019-03-26 北京百度网讯科技有限公司 Voice translation method and device
CN110085210A (en) * 2019-03-15 2019-08-02 平安科技(深圳)有限公司 Interactive information test method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN113597641A (en) 2021-11-02

Similar Documents

Publication Publication Date Title
US11676575B2 (en) On-device learning in a hybrid speech processing system
US9953648B2 (en) Electronic device and method for controlling the same
US10332513B1 (en) Voice enablement and disablement of speech processing functionality
US9305569B2 (en) Dialogue system and method for responding to multimodal input using calculated situation adaptability
US20230186912A1 (en) Speech recognition method, apparatus and device, and storage medium
JP2022549238A (en) Semantic understanding model training method, apparatus, electronic device and computer program
US8543399B2 (en) Apparatus and method for speech recognition using a plurality of confidence score estimation algorithms
CN111028827A (en) Interaction processing method, device, equipment and storage medium based on emotion recognition
US10685664B1 (en) Analyzing noise levels to determine usability of microphones
US11574637B1 (en) Spoken language understanding models
KR20160132748A (en) Electronic apparatus and the controlling method thereof
US11756551B2 (en) System and method for producing metadata of an audio signal
US20200162911A1 (en) ELECTRONIC APPARATUS AND WiFi CONNECTING METHOD THEREOF
KR20210095431A (en) Electronic device and control method thereof
CN114925163A (en) Intelligent equipment and intention recognition model training method
WO2022266825A1 (en) Speech processing method and apparatus, and system
CN115083412B (en) Voice interaction method and related device, electronic equipment and storage medium
CN115132195B (en) Voice wakeup method, device, equipment, storage medium and program product
US11664018B2 (en) Dialogue system, dialogue processing method
KR20140035164A (en) Method operating of speech recognition system
CN117708305B (en) Dialogue processing method and system for response robot
US20100292988A1 (en) System and method for speech recognition
US20240212681A1 (en) Voice recognition device having barge-in function and method thereof
US11527247B2 (en) Computing device and method of operating the same
US11893996B1 (en) Supplemental content output

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21946327

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21946327

Country of ref document: EP

Kind code of ref document: A1