WO2020043040A1 - 语音识别方法和设备 - Google Patents

语音识别方法和设备 Download PDF

Info

Publication number
WO2020043040A1
WO2020043040A1 PCT/CN2019/102485 CN2019102485W WO2020043040A1 WO 2020043040 A1 WO2020043040 A1 WO 2020043040A1 CN 2019102485 W CN2019102485 W CN 2019102485W WO 2020043040 A1 WO2020043040 A1 WO 2020043040A1
Authority
WO
WIPO (PCT)
Prior art keywords
dialect
speech
data
confidence
speech recognition
Prior art date
Application number
PCT/CN2019/102485
Other languages
English (en)
French (fr)
Inventor
薛少飞
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020043040A1 publication Critical patent/WO2020043040A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing

Definitions

  • the present application belongs to the technical field of speech recognition, and particularly relates to a method and a device for speech recognition.
  • Mode 1) Remind the user to select the language type used, and then switch to the corresponding language type speech recognition model for speech recognition based on the user's selection;
  • Mode 2) The machine first The speech data is used to determine the language type, determine the language type used by the user, and then switch to the speech recognition model corresponding to the language type used by the user determined by the machine for speech recognition.
  • Method 1) requires the user to actively select the type of language to be used, which can not realize the user's non-perceived operation, and the user experience is poor;
  • Method 2) the choice of voice model depends on Because the machine makes a one-time decision on speech data, if the language type determined by the machine is wrong, it will directly affect the accuracy of subsequent speech recognition.
  • the purpose of this application is to provide a speech recognition method and device to improve the accuracy of speech recognition.
  • the present application provides a speech recognition method and device that are implemented as follows:
  • a speech recognition method includes:
  • Fusion judgment is performed on the confidence degree and the correlation degree to determine a dialect recognition result of the voice data.
  • a speech recognition device includes a sound collector and a processor, the sound collector is coupled with the processor, wherein:
  • the sound collector is used to collect voice data
  • the processor is configured to recognize the speech data in parallel through multiple dialect speech recognition model components, to obtain recognition results and confidence values for multiple language dialects; and to determine the use for multiple languages through a scene correlation determination model
  • the recognition result of the dialect belongs to the relevance value of the target scene; a fusion judgment is performed on the confidence and the relevance to determine the dialect recognition result of the voice data.
  • a ticket vending machine for a subway station includes a sound collector and a processor, and the sound collector is coupled with the processor, wherein:
  • the sound collector is used to collect voice data
  • the processor is configured to recognize the speech data in parallel through multiple dialect speech recognition model components, to obtain recognition results and confidence values for multiple language dialects; and to determine the use for multiple languages through a scene correlation determination model
  • the recognition result of the dialect belongs to the relevance value of the target scene; a fusion judgment is performed on the confidence and the relevance to determine the dialect recognition result of the voice data.
  • a speech recognition method includes:
  • the most reliable speech recognition result is used as the speech data recognition result.
  • a computer-readable storage medium stores computer instructions thereon, the steps of the above method being implemented when the instructions are executed.
  • the speech recognition method and device provided by the present application recognize speech data in parallel through multiple dialect speech recognition model components, obtain recognition results and confidence values for multiple language dialects, and determine the model by using a scene correlation degree determination model.
  • the recognition results of multiple language dialects belong to the relevance value of the target scene, and then the fusion and correlation are judged to determine the dialect recognition results of the speech data. Because the dialect determination and scene determination are adopted, the accuracy of the dialect determination is improved, and the existing selection of the recognition model is solved. If the recognition model is selected incorrectly, it will lead to subsequent speech recognition results. The problem of errors can reach the technical effect of effectively improving the accuracy of speech recognition results.
  • FIG. 1 is a schematic structural diagram of a voice recognition device provided by the present application.
  • FIG. 2 is a schematic diagram of a self-service ticket purchasing machine using a voice recognition device provided by the present application
  • FIG. 3 is a schematic diagram of a connection between a sound collector and a processor in a self-service ticket purchase machine using the voice recognition device provided by the present application;
  • FIG. 4 is a schematic diagram of a discrimination process of a voice recognition device provided by the present application.
  • FIG. 5 is a schematic diagram of a discrimination process of a voice recognition device provided by the present application.
  • FIG. 6 is a schematic diagram of interception of inspection data by a voice recognition device provided by the present application.
  • FIG. 7 is a schematic flowchart of steps of a speech recognition method provided by the present application.
  • FIG. 8 is a schematic flowchart of another step of the speech recognition method provided by the present application.
  • FIG. 9 is a schematic structural diagram of a voice recognition device provided by the present application.
  • the existing speech recognition methods often first determine the language type of the user's speech data, and then select the corresponding language type for speech recognition based on the result of the language type discrimination to obtain the final speech recognition result.
  • speech recognition is performed according to a speech recognition model corresponding to the language type. Therefore, if the determined language type is wrong, the accuracy of subsequent speech recognition results will be very low.
  • the system judges the language type of the user's speech data, the system misjudges the Shanghai dialect used by the user as Suzhou dialect, and the subsequent speech recognition results obtained through the Suzhou dialect-based speech recognition model often have a relatively low accuracy rate. The error is also relatively large.
  • an embodiment of the present application provides a voice recognition device, and the device may include a sound collector and a processor.
  • the sound collector and the processor can be integrated together; they can also be independent of each other and coupled by wired or wireless means for data transmission.
  • the above voice recognition device can be specifically set and applied in various interactive application scenarios such as subway self-service ticket purchase, smart navigation, smart shopping, smart home, elderly care and so on.
  • the system may be a device provided in a physical device corresponding to some application scenarios, such as a self-service ticket purchase machine, a care robot, a navigator, and the like. It can also be a program or module that can call the relevant functional units in the existing device, such as an APP set on the mobile phone.
  • the user's voice data in the corresponding application scenario can be collected, and the user's voice data and speech recognition results can be discriminated to accurately determine the instructions corresponding to the user's voice data, and then execute the corresponding instructions. For example, completing a passenger's ticket purchase.
  • this application is not limited.
  • the following takes a speech recognition device applied in a subway self-service ticket purchase scenario as an example for detailed description.
  • the sound collector of the system may be a device such as a microphone or a sound collector.
  • the system's sound collector can be set in a self-service ticket machine at a subway station to collect passenger voice data.
  • the above-mentioned sound collector can usually be in a standby state.
  • the user can select the voice input logo or symbol in the display interface of the self-service ticket purchase machine shown in FIG. 2 to trigger the sound collector to enter the working state.
  • To collect user's voice data It is also possible to automatically detect keywords and automatically start collecting voice data when keywords (such as buying a ticket) are detected.
  • it can also be combined with intelligent identification technology to determine whether passengers have the willingness to buy tickets and whether they have triggered a voice ticket purchase process.
  • a passenger can switch to a voice input model by clicking a voice input symbol in a display interface of the self-service ticket purchase machine, thereby triggering a sound collector in the self-service ticket purchase machine to enter a working state to collect the passenger's Voice data, and send the collected voice data to the processor for further analysis and processing.
  • the system may further include a noise reduction device, such as a noise filter.
  • a noise reduction device such as a noise filter.
  • One end of the noise reduction device can be coupled with a sound collector, and the other end can be coupled with a processor, so that the voice data collected by the sound collector can be processed for noise reduction before being sent to the processor.
  • the processor may be a single server, a server cluster, a cloud processor, or the like.
  • the specific mode may be selected according to actual needs.
  • the above-mentioned processor may be specifically built in the self-service ticket purchase machine, and receives the voice data collected by the voice collector through a connection with the voice collector.
  • the processor can also be a total server. That is, the sound collectors of different self-purchasing ticket machines are coupled to the server through wired or wireless means. For example, cloud servers. Specifically, as shown in FIG. 3, the sound collectors installed in different self-service ticket purchasing machines can be connected to the processor through TCP or IP to transmit the collected voice data to the processor.
  • the processor may be provided with multiple language type dialect speech recognition model components, for example, Shanghai dialect recognition model component, Suzhou dialect recognition model component, Tianjin dialect recognition model component, and Mandarin recognition model component may be provided. , Cantonese recognition model components, and more.
  • the speech data can be identified by the Shanghai dialect recognition model component, the Suzhou dialect recognition model component, the Tianjin dialect recognition model component, the Mandarin recognition model component, and the Cantonese recognition model component, respectively, to obtain the speech of each dialect model. Recognition results and confidence values.
  • the speech recognition result of each dialect model can be judged based on the target scene, and it can be determined which recognition result is more relevant to the scene. Then, based on the confidence and correlation, the dialect recognition result of the speech data is determined. Specifically, the possibility of belonging to each dialect can be scored according to the confidence degree and the correlation degree, and the recognition result with the highest score can be determined as the final speech recognition result.
  • the processor may further determine the dialect type of the voice data by using a pre-trained voice classification model to obtain that the voice data belongs to multiple dialects, respectively.
  • the discrimination score of each dialect type in the type (that is, a discrimination result of a dialect type based on speech data). For example, take the self-service ticket machine of Shanghai Metro Station as an example. Considering that the widely used local language types in Shanghai are Mandarin and Shanghai dialect, after receiving the voice data, the processor can obtain a discrimination score that the voice data belongs to Mandarin and a discrimination that the voice data belongs to Shanghai dialect through the above-mentioned voice classification model. fraction.
  • the language type corresponding to the speech data is directly determined according to the discrimination scores, and then only the speech recognition model based on the determined language type is used for speech recognition to obtain the final speech recognition result.
  • the language type corresponding to the voice data is determined only based on the language type discrimination score for the voice data, and the error is often large; and once a discrimination error occurs here, it will have a more obvious impact on subsequent speech recognition. The resulting speech recognition results are often inaccurate.
  • the speech data itself is discriminated by a language classification model, and speech recognition is performed on speech data using a speech recognition model based on a possible language type to obtain multiple data.
  • the recognition results of each of the possible language types, and the recognition results of multiple language types are further judged to obtain a judgment result based on the credibility of the speech recognition results.
  • the processor can divide the passenger's voice data into three groups.
  • the first set of data can be used to input a language classification model to discriminate the language type to which the voice data belongs, to obtain a discrimination score (which can be recorded as score 1) for the passenger voice data that belongs to Mandarin and a Shanghai dialect. (Can be recorded as a score of 2).
  • the second set of data is used to input a Mandarin speech recognition model, which is used to perform speech recognition on the speech data using a speech recognition model trained based on Mandarin to obtain a Mandarin recognition result (which can be recorded as result 1).
  • the third set of data is used to input the Shanghai dialect speech recognition model, which is used to perform speech recognition on the speech data using the speech recognition model obtained based on Shanghai training to obtain the recognition result of Shanghai dialect (can be recorded as result 2). Then, through the discrimination of the recognition result (for example, the recognition of the scene correlation or confidence of the recognition result), the credibility of the result 1 and the result 2 are further determined, and a discrimination score for the result 1 can be obtained (can be recorded as a score 3) and the discrimination score for result 2 (can be recorded as score 4). Combined with the two different parameters of the discrimination score for the speech data and the discrimination score for the recognition result, comprehensive discrimination is performed to select a more accurate voice recognition result from the recognition results of the two language types.
  • a comprehensive evaluation score (which can be recorded as score 5) for characterizing the accuracy of the recognition result of Mandarin can be obtained in a weighted manner according to the score 1 and the score 3.
  • a comprehensive evaluation score (which can be recorded as the score 6) for characterizing the accuracy of the recognition of the Shanghai dialect can be obtained by weighting, and then determined according to the magnitude relationship between the score 5 and the score 6.
  • the recognition result of language type with relatively high accuracy is taken as the final speech recognition result.
  • the comprehensive discrimination methods listed above are only a schematic illustration. In specific implementation, other methods can also be selected for comprehensive discrimination according to specific application scenarios and implementation conditions. This application is not limited in this application.
  • the credibility of the recognized speech content may be determined based on the scene in which the speech is located, the syntactic structure of the recognized sentence, and the like.
  • speech recognition devices are mostly used in some specific application scenarios. If the identified speech recognition result has a large deviation from the scene, it can be considered that the reliability of the speech recognition result is low. The recognition result is more consistent with the scene, so it can be considered that the speech recognition result has higher credibility. Take the subway ticket machine as an example. If the voice result recognized by model A is: I want to buy a subway ticket, the voice result recognized by model B is: I want to buy a high-speed train ticket, because it is a subway ticket machine, obviously A The speech results recognized by the model are more reliable.
  • a pre-trained scene correlation discrimination model can be used to determine the correlation between the recognition results of multiple language types and the application scenarios of the system to obtain the evaluation scores of the scene correlation of multiple language types. , Which is the discrimination score for the recognition result.
  • a plurality of scene keywords or key sentences related to the target scene can also be set in advance according to the specific application scenario, and then the scene keywords or key sentences can be detected on the speech recognition result. When one is detected in the speech recognition result, When there are multiple scene keywords or key sentences, it can be judged that the speech recognition result has a high degree of correlation with the application scene.
  • the scene of the recognition result may be considered to be relevant
  • the degree evaluation score is higher, that is, the discrimination score of the recognition result is higher.
  • the above-mentioned scene keywords may specifically include, but are not limited to, at least one of the following: a destination station, a starting station, a ticket, and the like.
  • the above-mentioned judging manner of judging the scene correlation by using the scene correlation discrimination model or the scene keywords is only a schematic description. In specific implementation, other appropriate methods may also be used to determine the scene correlation according to the specific application scenario and implementation conditions. This application is not limited in this application.
  • the recognition of the scene relevance of the recognition results of multiple language types may be performed to obtain the scene relevance evaluation scores of the recognition results of each language type. It is also possible to determine the confidence of the recognition results of multiple language types at the same time, and obtain the confidence evaluation scores of the recognition results of each language type. Then, the correlation evaluation score and the confidence evaluation score of the recognition result of the same language type are used as the discrimination score of the recognition result of the language type.
  • the above-mentioned confidence level of the recognition result of each language type can be specifically understood as the accuracy rate when the speech recognition model of each language type recognizes the speech data of the corresponding language type.
  • the confidence of the recognition result of Suzhou dialect can be understood as the accuracy of using the Suzhou speech recognition model to recognize the speech data of Suzhou dialect.
  • the above-mentioned judging manners of the credibility of the recognition results for multiple language types are merely for better explaining the implementation manners of the present application.
  • the recognition result may be subjected to syntax structure discrimination, and the credibility of the recognition result may be judged based on the syntax structure discrimination result of the recognition result.
  • the recognition result conforming to the syntax structure may be determined as a recognition result with high credibility according to the discrimination result of the syntax structure.
  • the recognition result 1 obtained by the A language type speech recognition model is "a subway ticket to Yushan Station”
  • the recognition result 2 obtained by the B language type speech recognition model is " Uncle Dao said to Yushan Station.
  • the foregoing preliminary language judgment may be compared with the discrimination scores of the speech data obtained by the language classification model for each language type, and a preset number (for example, 2) of language types with relatively high discrimination scores are selected as The language type to be determined; the processor only uses the speech recognition model of the language type to be determined to perform speech recognition on the speech data to obtain a preset number of recognition results of the language type to be determined (that is, to obtain a relatively small number of recognition results); Furthermore, the recognition result of the preset number of language types to be determined is determined only; and the recognition score of the language type to be determined for speech data and the recognition result of the recognition type of the language type to be determined are combined from the preset number.
  • a preset number for example, 2 of language types with relatively high discrimination scores are selected as The language type to be determined
  • the processor only uses the speech recognition model of the language type to be determined to perform speech recognition on the speech data to obtain a preset number of recognition results of the language type to be determined (that is, to obtain a relatively small
  • the recognition result of the language type with the highest accuracy is determined as the final speech recognition result.
  • the implementation of the above-mentioned preliminary language judgment is only a schematic description.
  • other suitable implementation methods may also be used to perform preliminary language judgment on the voice data according to the specific situation, so as to reduce the language that needs to be further determined in the future. Number of types of speech recognition models.
  • a part of the voice data can be intercepted from the voice data as the test data to determine the language type of the voice data.
  • the voice data in the middle part is usually relatively coherent and the accent features are more prominent, and the first preset time point in the voice data (for example, 5 seconds after the start of the voice data) can be intercepted.
  • the voice data between the second preset time point (for example, 5 seconds before the end of the voice data) is used as the test data, and the language type is determined only for the part of the data to obtain a judgment score that the voice data belongs to each language type.
  • Some of the voice data input by the user may be relatively disturbed by external noise.
  • a relatively clear part of the data can be extracted from the voice data as test data.
  • the voice data may be detected first, and the voice data within a preset range of the stress position (for example, 20 seconds before the stress position to 20 seconds after the stress position) may be intercepted as the test data, and then the language for the voice data may be performed.
  • Type discrimination for example, 20 seconds before the stress position to 20 seconds after the stress position
  • the processor may send the final speech recognition result to a corresponding execution server to execute a corresponding user instruction according to the speech recognition result.
  • the processor may send the passenger's voice recognition results to a server that processes the ticket sales service in the self-service ticket purchase machine, and the server may sell the subway ticket required by the passenger to the passenger according to the voice recognition result to complete self-service ticket sales.
  • the multilingual speech recognition method provided in this application not only judges the speech data itself but also the speech recognition results obtained based on different language types, and then synthesizes the discrimination results for multiple data to select an accurate
  • the speech recognition result corresponding to a higher degree language type is used as the final speech recognition result, thereby effectively improving the accuracy of speech recognition.
  • FIG. 7 is a schematic flowchart of a method of a speech recognition method according to an embodiment of the present application.
  • this application provides method operation steps or device structures as shown in the following embodiments or drawings, based on conventional or no creative labor, the method or device may include more or fewer operation steps or module units. .
  • the execution order of these steps or the module structure of the device is not limited to the execution order or the module structure shown in the embodiments of the present application and shown in the accompanying drawings.
  • the method or the module structure shown in the embodiment or the drawings may be connected to execute sequentially or in parallel (for example, a parallel processor or multi-threaded processing). Environment, or even a distributed processing environment).
  • a voice recognition method provided by an embodiment of the present application may include the following steps:
  • S702 Recognize the voice data through multiple dialect speech recognition model components to obtain multiple speech recognition results.
  • S703 Determine the credibility of each speech recognition result in the plurality of speech recognition results
  • the credibility in the embodiment of the present application may be specifically understood as a parameter for evaluating the closeness of the speech recognition result to the true semantics.
  • different speech recognition models are used to identify different types of languages.
  • the method may further include: The language type corresponding to the speech recognition model corresponding to the most reliable speech recognition result is used as the language type of the speech data.
  • determining the credibility of each of the plurality of speech recognition results may specifically include: determining the credibility of each of the plurality of speech recognition results according to at least one of the following: : The correlation between the speech recognition result and the scene, and the syntax structure of the speech recognition result.
  • determining the credibility of each of the plurality of speech recognition results may specifically include: determining the credibility of each of the plurality of speech recognition results according to at least one of the following: : The correlation between the speech recognition result and the scene, and the syntax structure of the speech recognition result.
  • the method before the speech data is identified through multiple speech recognition models and multiple speech recognition results are obtained, the method further includes the following: identifying that the speech data belongs to each language through a language classification model Confidence of the type.
  • the above-mentioned determining the credibility of each speech recognition result in the plurality of speech recognition results may include: combining the recognition that the speech data belongs to each language type through a language classification model To determine the credibility of each speech recognition result in the plurality of speech recognition results.
  • the confidence that the speech data belongs to each language type is identified through a language classification model
  • the specific implementation may include: intercepting between the first preset time point and the second preset time point in the voice data Data is used as the test data; or, data within a preset range of the accent position in the voice data is intercepted as the test data; and a confidence level that the test data belongs to each language type is identified through a language classification model.
  • a voice recognition method is also provided in this example. As shown in FIG. 8, the method may include:
  • Step 801 Acquire voice data.
  • Step 802 Recognize the voice data in parallel through multiple dialect speech recognition model components, and obtain recognition results and confidence values for multiple language dialects;
  • Step 803 Determine the relevance value of the recognition result belonging to the target scene by using the scene relevance discrimination model
  • Step 804 Perform fusion judgment on the confidence degree and the correlation degree, and determine a dialect recognition result of the voice data.
  • step 804 performing fusion judgment on the confidence degree and the correlation degree, and determining a dialect recognition result of the voice data may include:
  • S1 acquiring confidence values of the language data for multiple language dialects and correlation values belonging to a target scene
  • S3 Determine the dialect recognition result of the voice data according to the confidence weight value, the correlation weight value, the confidence value for a plurality of language dialects, and the correlation value belonging to the target scene.
  • the likelihood that the speech data belongs to various dialects can be scored according to the confidence weight value, the correlation weight value, the confidence value for multiple language dialects, and the relevance value belonging to the target scene.
  • a scoring method can also be adopted, that is, the speech data is recognized in parallel by multiple dialect speech recognition model components to obtain confidence values for multiple language dialects, It may include: scoring the speech data in parallel by the plurality of dialect speech recognition model components; and using the scoring result as a confidence value for the plurality of language dialects.
  • the voice data may not have valid voice for a period of time at the beginning, therefore, a section of voice after the start data can be intercepted as a basis for determining recognition.
  • the captured voice is relatively larger in data volume and can effectively reduce data processing Measurement, you can get more accurate results.
  • the speech data is recognized in parallel by multiple dialect speech recognition model components to obtain confidence values for multiple language dialects, which may include: intercepting data from the speech data a predetermined number of seconds after the start of speech As sample data, the sample data is recognized in parallel by multiple dialect speech recognition model components to obtain confidence values for multiple language dialects.
  • a language (dialect) classification model may be established through learning and training in advance to identify the language type of the voice data.
  • the above-mentioned language classification model can be established in the following ways: obtaining sample data; extracting I-vectors of speech in different languages (dialects) for each piece of data in the sample data; according to which language (dialects) the I-vector of the speech belongs to ) Type learning, training a multi-classification model, such as neural networks, etc., to obtain a language classification model that can be used to distinguish the speech type of speech data.
  • a credibility discrimination model may be established through learning and training in advance to determine the credibility of each speech recognition result among the plurality of speech recognition results.
  • a scene correlation discrimination model as an example: obtain sample data, and vectorize positive examples (such as recognition results belonging to the target scene) and negative examples (such as recognition results that do not belong to the target scene) in the sample data.
  • the positive and negative examples described above can be vectorized in a one-hot or wordvec manner.
  • the vectorized data is then trained to obtain a binary classification model.
  • the two classification model can be used to determine whether the speech recognition result belongs to the corresponding target scene.
  • the specific implementation may be performed from the multiple language types first. Determine the language type to be determined.
  • a language type to be determined may be determined from the multiple language types according to a discrimination result of the language types.
  • the method before the language type discrimination is performed on the voice data, when the method is specifically implemented, the method may further include the following content: acquiring voice data. Specifically, in order to reduce the workload and improve the recognition efficiency, after acquiring the voice data, the method may further include: intercepting inspection data from the voice data. The above inspection data may be specifically used to identify a language type corresponding to the voice data. In this way, it is possible to avoid analyzing and processing the complete voice data, and only to recognize the language type of the intercepted part of the voice data, which reduces the workload and improves the recognition efficiency.
  • the data between the first preset time point and the second preset time point in the voice data may be intercepted as the inspection data; or the preset range of the accent position in the voice data may be intercepted.
  • Data as the inspection data.
  • other suitable interception methods can be selected to intercept the inspection data. This application is not limited in this application.
  • the above-mentioned screening of speech recognition results that meet requirements from the multiple speech recognition results according to the language type determination result and the credibility determination result may include:
  • S1 Perform multi-modal fusion judgment according to the discrimination result of the language type and the discrimination result of the credibility, and obtain a multi-modal fusion judgment result;
  • weighted scoring may be performed according to the discrimination result of the language type and the discrimination result of the credibility to obtain a multi-modal fusion judgment result. It is also possible to train a binary classification model as a multi-modal fusion judgment model according to the characteristics of different discrimination results in advance, which is used for multi-modal fusion judgment to obtain the above-mentioned multi-modal fusion judgment result.
  • FIG. 9 is a hardware block diagram of a voice recognition device according to an embodiment of the present application.
  • the system may specifically include a sound collector 111 and a processor 112 (the processor 112 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), where the sound collector 111 may be coupled to the processor 112 through an internal cable.
  • the processor 112 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), where the sound collector 111 may be coupled to the processor 112 through an internal cable.
  • a processing device such as a microprocessor MCU or a programmable logic device FPGA
  • the sound collector 111 may be a device such as a microphone or a sound collector, and is configured to collect voice data.
  • the processor may be specifically configured to recognize the voice data through multiple voice recognition models to obtain multiple voice recognition results; determine the credibility of each voice recognition result in the multiple voice recognition results; The highest voice recognition result is used as the recognition result of the voice data.
  • the structure shown in FIG. 9 is only for illustration, and it does not limit the structure of the electronic device.
  • the above-mentioned system may further include a structure such as a memory 113.
  • the memory 113 may be used to store software programs and modules of application software, such as program instructions / modules of a voice recognition device in the embodiment of the present invention.
  • the processor 112 executes various functions by running the software programs and modules stored in the memory 113.
  • Application and data processing that is, the speech recognition method for implementing the above application program.
  • the memory 113 may include a high-speed random access memory, and may further include a non-volatile memory, such as one or more magnetic storage devices, a flash memory, or other non-volatile solid-state memory.
  • the memory 113 may further include memory remotely disposed with respect to the processor 112, and these remote memories may be connected to a computer terminal through a network. Examples of the above network include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the above-mentioned voice recognition device may be specifically applied to various interactive application scenarios such as subway self-service ticket purchase, smart navigation, smart shopping, smart home, elderly care and so on.
  • the processor 112 may be further configured to use the most reliable speech recognition result.
  • the language type corresponding to the corresponding speech recognition model is used as the language type of the speech data.
  • the multilingual speech recognition device provided by this application recognizes target speech data through multiple speech recognition models, and then selects the speech recognition result with the highest credibility among the recognition results as the final recognition result, avoiding the need to perform the recognition model first. If the recognition model is selected incorrectly due to selection, the subsequent speech recognition results will be wrong, and the technical effect of effectively improving the accuracy of speech recognition results is achieved.
  • the above voice recognition device may further include: an acquisition module, a recognition module, and a determination module, of which:
  • An acquisition module which can be used to acquire voice data
  • a recognition module which can be used to recognize the voice data through multiple voice recognition models to obtain multiple voice recognition results
  • the determining module may be configured to determine the credibility of each of the plurality of speech recognition results, and use the speech recognition result with the highest credibility as the recognition result of the speech data.
  • different speech recognition models are used to identify different types of languages.
  • the determination module may further use the most reliable speech recognition model.
  • the language type corresponding to the speech recognition model corresponding to the speech recognition result of is the language type of the speech data.
  • the determining module may determine the availability of each speech recognition result in the plurality of speech recognition results according to at least one of the following. Reliability: the correlation between the speech recognition result and the scene, the syntax structure of the speech recognition result, etc.
  • Reliability the correlation between the speech recognition result and the scene, the syntax structure of the speech recognition result, etc.
  • the device further includes a language type discrimination module, which can be specifically used to identify the speech data by the recognition module through multiple speech recognition models and obtain multiple speech recognition results. Confidence that the speech data belongs to each language type is identified through a language classification model.
  • the reliability of each speech recognition result in the plurality of speech recognition results may be determined by combining the confidence that the speech data belongs to each language type through the language classification model.
  • the inspection data may be first obtained in one of the following ways: Intercepting between the first preset time point and the second preset time point in the voice data Data is used as the test data; or, data within a preset range of the accent position in the voice data is intercepted as the test data; and then a confidence level that the test data belongs to each language type is identified through a language classification model.
  • the system may further include a language type preliminary selection module before determining the credibility of multiple speech recognition results and obtaining the credibility determination results. Determining a language type to be determined from the plurality of language types.
  • a language type to be determined may be determined from the plurality of language types according to a determination result of the language type.
  • taking the speech recognition including two dialects as an example how to accurately perform speech recognition by using the above-mentioned speech recognition method. Specifically, it includes the following steps:
  • S1 input speech data to a language (dialect) classification model, and obtain a discrimination score of which language (dialect) the speech data belongs to;
  • S2 input the speech data into two language (dialect) type speech recognition models, obtain the recognition result obtained by the speech data under the two language type speech recognition models, and perform confidence judgment on the recognition results, Get the discriminant scores of the confidence of the two recognition results;
  • S3 The speech recognition results obtained from the speech data in the two language types of speech recognition models are respectively input to the scene correlation discrimination model, and the discrimination scores of the two recognition results and the correlation of the target scene are obtained;
  • S4 input the discrimination score of which language (dialect) the speech data belongs to, the discrimination score of the confidence of the two recognition results, and the discrimination score of the correlation between the two recognition results and the target scene, respectively, into the multimodal fusion Discriminant model to determine which language type of speech recognition results are satisfactory speech recognition results;
  • S5 Show the user a speech recognition result that meets the requirements, or perform subsequent semantic understanding based on the speech recognition result.
  • An embodiment of the present application further provides a computer storage medium, where the computer storage medium stores computer program instructions, which are implemented when the computer program instructions are executed: obtaining voice data; and using a plurality of voice recognition models for the voice Recognize the data to obtain multiple speech recognition results; determine the credibility of each of the multiple speech recognition results; and use the most reliable speech recognition result as the recognition result of the speech data.
  • the devices or modules described in the foregoing embodiments may be specifically implemented by a computer chip or entity, or may be implemented by a product having a certain function.
  • the functions are divided into various modules and described separately.
  • the functions of each module may be implemented in the same or multiple software and / or hardware.
  • a module that implements a certain function may also be implemented by combining multiple submodules or subunits.
  • the method, device or module described in this application may be implemented in a computer-readable program code by the controller in any suitable manner.
  • the controller may adopt, for example, a microprocessor or processor and the storage may be processed by the (micro) Computer-readable program code (such as software or firmware), computer-readable media, logic gates, switches, Application Specific Integrated Circuits (ASICs), programmable logic controllers, and embedded microcontrollers
  • microcontrollers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320.
  • the memory controller can also be implemented as part of the control logic of the memory.
  • controller logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded controllers by logically programming the method steps. Microcontrollers, etc. to achieve the same function. Therefore, the controller can be considered as a hardware component, and the device included in the controller for implementing various functions can also be considered as a structure within the hardware component. Or even, the means for implementing various functions can be regarded as a structure that can be both a software module implementing the method and a hardware component.
  • program modules include routines, programs, objects, components, data structures, classes, etc. that perform specific tasks or implement specific abstract data types.
  • program modules may be located in local and remote computer storage media, including storage devices.
  • the present application can be implemented by means of software plus necessary hardware. Based on such an understanding, the technical solution of the present application in essence or a part that contributes to the existing technology may be embodied in the form of a software product, or may be reflected in the implementation process of data migration.
  • the computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disc, etc., and includes a number of instructions to enable a computer device (which can be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute this software. Apply for the method described in each embodiment or some parts of the embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别方法和设备,其中,该方法包括:获取语音数据(S801);通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的识别结果和置信度值(S802);通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值(S803);对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果(S804)。通过方言确定和场景确定相融合的方式,从而提升了方言确定的准确性,解决了现有的先进行识别模型的选择而导致的如果识别模型选择错误,将会导致后续的语音识别结果都会出错的问题,达到了有效提升语音识别结果准确性的技术效果。

Description

语音识别方法和设备
本申请要求2018年08月30日递交的申请号为201811000407.9、发明名称为“语音识别方法和设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于语音识别技术领域,尤其涉及一种语音识别方法和设备。
背景技术
人机交互技术的普及给人们的生活和工作带来了越来越多的便利。例如,基于语音识别技术,人们可以通过购票设备实现自助购票,以地铁站购买地铁票为例,乘客只需要对售票机说出所要前往的目的地或者购票的张数,售票机就可以根据乘客的语音数据,自动购买乘客所需要的地铁票。
然而考虑到,不同地方的人使用的语言(例如方言)在发音上往往会存在较大的差异,进而会影响机器语音识别结果的准确度。针对该问题,目前采用的处理方式有:方式1)提醒用户选择所使用的语言类型,然后基于用户的选择切换至对应的语言类型语音识别模型进行语音识别;方式2)由机器先对用户的语音数据进行语言类型判别,确定出用户使用的语言类型,再切换至机器确定出的用户所使用的语言类型对应的语音识别模型进行语音识别。
然而,上述两种方式都存在着不同程度的问题,方式1)需要用户先主动选择所使用的语言类型,无法实现用户的无感知操作,用户体验较差;方式2)语音模型的选择需要依赖于机器针对语音数据一次性判定的结果,如果机器判定的语言类型错误,那么将直接影响后续的语音识别的准确度。
针对上述问题,目前尚未提出有效的解决方案。
发明内容
本申请目的在于提供一种语音识别方法和设备,以提高语音识别的准确率。
本申请提供一种语音识别方法和设备是这样实现的:
一种语音识别方法,包括:
获取语音数据;
通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言 方言的识别结果和置信度值;
通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值;
对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果。
一种语音识别设备,包括:声音采集器和处理器,所述声音采集器与所述处理器耦合,其中:
所述声音采集器用于采集语音数据;
所述处理器用于通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的识别结果和置信度值;通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值;对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果。
一种地铁站售票机,包括:声音采集器和处理器,所述声音采集器与所述处理器耦合,其中:
所述声音采集器用于采集语音数据;
所述处理器用于通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的识别结果和置信度值;通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值;对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果。
一种语音识别方法,包括:
获取语音数据;
通过多个方言语音识别模型组件对所述语音数据进行识别,得到多个语音识别结果;
确定所述多个语音识别结果中各个语音识别结果的可信度;
将可信度最高的语音识别结果作为所述语音数据的识别结果。
一种计算机可读存储介质,其上存储有计算机指令,所述指令被执行时实现上述方法的步骤。
本申请提供的语音识别方法和设备,通过多个方言语音识别模型组件对语音数据并行进行识别,得到用于多个语言方言的识别结果和置信度值,并通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值,然后再对置信度和相关度进行融合判断,以确定语音数据的方言识别结果。因为是采用方言确定和场景确定相融合的方式,从而提升了方言确定的准确性,解决了现有的先进行识别模型的选择而 导致的如果识别模型选择错误,将会导致后续的语音识别结果都会出错的问题,达到了有效提升语音识别结果准确性的技术效果。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请提供的语音识别设备的架构示意图;
图2是应用本申请提供的语音识别设备的自助购票机的场景示意图;
图3是应用本申请提供的语音识别设备的自助购票机内的声音采集器和处理器间的一种连接示意图;
图4是本申请提供的语音识别设备的一种判别流程示意图;
图5是本申请提供的语音识别设备的一种判别流程示意图;
图6是本申请提供的语音识别设备截取检验数据的示意图;
图7是本申请提供的语音识别方法的步骤流程示意图;
图8是本申请提供的语音识别方法的另一步骤流程示意图;
图9是本申请提供的语音识别设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请保护的范围。
考虑到现有的语音识别方法往往是先对用户的语音数据进行语言类型的判别,再根据语言类型的判别结果,选择对应的语言类型进行语音识别,得到最终的语音识别结果。上述方法由于预先确定出一个语言类型之后,就按照该语言类型对应的语音识别模型进行语音识别,因此,如果所确定的语言类型错误,那么后续的语音识别结果的准确度将会很低。例如,在系统对用户语音数据进行语言类型判别时,系统将用户所使用的上海 话误判为了苏州话,后续通过基于苏州话的语音识别模型得到的语音识别结果往往准确率就相对较低、误差也相对较大。
针对上述问题,考虑到如果先将获取的语音数据放入多种可能的语言类型的多个语言识别模型进行识别,得到多个识别结果后,对识别结果的可信度进行判断,从而确定出哪个识别结果更为准确,就选取哪个识别结果,这样就不会受到由于选择错语音模型而导致的语音识别准确度过低的问题的影响。
请参阅图1所示,本申请实施例提供了一种语音识别设备,该设备可以包括:声音采集器和处理器。其中,声音采集器和处理器可以集成在一起;也可以彼此相互独立,通过有线或无线的方式耦合起来,以便进行数据的传输。
上述语音识别设备具体可以设置应用在地铁自助购票、智能导航、智能购物、智能家居、老人陪护等多种交互应用场景中。具体该系统可以是设置在某些应用场景相应的实体设备中的器件,例如,自助购票机、看护机器人、导航仪等。也可以是能够调用已有设备中相关功能单元的程序或者模块,例如设置在手机端APP等。具体通过上述系统,可以采集对应的应用场景中用户的语音数据,并对用户的语音数据和语音识别结果分别进行判别,以准确地确定出用户语音数据所对应的指令,进而执行相应的指令。例如,完成乘客的购票。对于上述语音识别设备的具体应用,本申请不作限定。
下面以应用在地铁自助购票场景中的语音识别设备为例,进行具体说明。
具体的,该系统的声音采集器可以是麦克风、集音器等设备。该系统的声音采集器可以设置在地铁站的自助购票机内,用于采集乘客的语音数据。
为了减少误触发,上述声音采集器通常可以处于待机状态,用户在购票时可以选择如图2所示的自助购票机的显示界面中的语音输入标识或者符号以触发声音采集器进入工作状态,采集用户的语音数据。也可以是自动检测关键词,在检测到关键词(例如买票)的情况下自动开始采集语音数据。在实现的时候,还可以是结合智能识别技术,确定乘客是否具有买票意愿,是否有触发语音购票流程。
例如,参阅图2所示,乘客可以通过点击自助购票机显示界面中的语音输入符号,切换为语音输入模型,从而可以触发自助购票机内的声音采集器进入工作状态,以采集乘客的语音数据,并将所采集的语音数据发送至处理器,以便进行进一步的分析处理。
在上述采集语音数据的过程中,为了减少周围环境杂音的干扰,提高所采集的语音数据的纯净度,所述系统还可以包括降噪设备,例如噪声滤波器。该降噪设备的一端可以与声音采集器耦合,另一端可以与处理器耦合,从而可以将声音采集器采集到的语音 数据先进行降噪处理然后再发送至处理器。
在一个实施方式中,上述处理器具体可以是单一的服务器,也可以是服务器集群,也可以是云端的处理器等等,具体采用哪种模式可以根据实际需要选择。
上述处理器具体可以是内置在自助购票机内部,通过与声音采集器之间的连接接收声音采集器采集得到的语音数据。
考虑为了降低自助购票机的成本,处理器也可以是一个总的服务器。即不同自助购票机的声音采集器都通过有线或无线的方式与该服务器耦合。例如,云服务器。具体的,可以参阅图3所示,设置在不同自助购票机内的声音采集器可以通过TCP或IP等方式与处理器相连,以将所采集的语音数据传输至处理器。
在一个实施方式中,处理器中可以设置有多个语言类型的方言语音识别模型组件,例如,可以设置有上海话识别模型组件、苏州话识别模型组件、天津话识别模型组件、普通话识别模型组件、粤语识别模型组件等等。在获取到语音数据之后,可以将该语音数据分别通过上海话识别模型组件、苏州话识别模型组件、天津话识别模型组件、普通话识别模型组件、粤语识别模型组件进行识别,得到各个方言模型的语音识别结果和置信度值。
进一步的,如图4所示,可以对各个方言模型的语音识别结果基于目标场景进行相关度判定,确定哪个识别结果与场景的相关度更高。然后基于置信度和相关度,确定出语音数据的方言识别结果。具体的,可以根据置信度和相关度,对属于每种方言的可能性进行打分,将分数最高的识别结果确定为最终的语音识别结果。
在实现的时候,如图5所示,处理器在接收到语音数据后,还可以是通过事先训练好的语音分类模型对该语音数据的方言类型进行判别,得到该语音数据分别属于多个方言类型中的各个方言类型的判别分数(即一种基于语音数据的方言类型的判别结果)。例如,以上海地铁站的自助购票机为例。考虑上海当地使用较广的语言类型是普通话和上海话,处理器在接收到语音数据后,可以通过上述语音分类模型得到一个该语音数据属于普通话的判别分数和一个该语音数据属于上海话的判别分数。而现有方法在得到上述两个判别分数后,就直接根据上述判别分数确定语音数据所对应的语言类型,继而只利用基于所确定的语言类型的语音识别模型进行语音识别得到最终的语音识别结果。然而,由于不同类型语言在发音上往往会存在一些相似的特征,例如,苏州话和上海话在某些词句的发音上就很相似。因此,只根据针对语音数据的语言类型判别分数就确定语音数据所对应的语言类型,误差往往会比较大;而且一旦在此处出现了判别错误,会对后续 的语音识别产生较明显的影响,导致得到的语音识别结果往往是不准确的。为了提高语音识别的准确率,在本实施方式中,在通过语言分类模型对语音数据本身进行判别的同时,还利用基于可能的语言类型的语音识别模型分别对语音数据进行语音识别,以得到多个可能的语言类型的识别结果,并对多个语言类型的识别结果再进行判断,以得到基于语音识别结果的可信度的判断结果。
具体的,处理器可以将乘客的语音数据分为三组。其中,第一组数据可以用于输入语言分类模型,用以对语音数据所属的语言类型进行判别,得到该乘客语音数据属于普通话的判别分数(可以记为分数1)和属于上海话的判别分数(可以记为分数2)。第二组数据用于输入普通话语音识别模型,用以利用基于普通话训练得到的语音识别模型对语音数据进行语音识别,得到普通话的识别结果(可以记为结果1)。第三组数据用于输入上海话语音识别模型,用以利用基于上海训练得到的语音识别模型对语音数据进行语音识别,得到上海话的识别结果(可以记为结果2)。再通过针对识别结果的判别(例如对识别结果的场景相关度判别或置信度判别)分别对结果1、结果2的可信度进行进一步的判别,得到针对结果1的判别分数(可以记为分数3)和针对结果2的判别分数(可以记为分数4)。再结合针对语音数据的判别分数、针对识别结果的判别分数这两种不同参数进行综合判别,以从两种语言类型的识别结果中选择出较为准确的语音识别结果。
例如,可以根据分数1和分数3,通过加权的方式得到用于表征普通话的识别结果准确度的综合评价分数(可以记为分数5)。类似的,可以根据分数2和分数4,通过加权的方式得到用于表征上海话的识别结果准确度的综合评价分数(可以记为分数6),再根据分数5和分数6的大小关系,确定出准确度相对较高的语言类型的识别结果作为最终的语音识别结果。当然,上述所列举的综合判别方式只是一种示意性说明。具体实施时,也可以根据具体的应用场景和实施条件选择其他的方式进行综合判别。对此,本申请不作限定。
为了准确确定出各个语言识别模型所识别出的语音内容的可信度,可以基于语音所处的场景、识别出的语句的句法结构等来确定识别出的语音内容的可信度。
例如,语音识别设备大多应用于某些具体的应用场景中,如果识别出的语音识别结果与所在场景的偏差较大,那么可以认为该语音识别结果的可信度较低,如果识别出的语音识别结果与所在场景较为吻合,那么可以认为该语音识别结果的可信度较高。以地铁售票机为例,如果A模型识别出的语音结果为:我想买一张地铁票,B模型识别出的语音结果为:我想买一张高铁票,因为是地铁售票机,显然A模型识别出的语音结果的可 信度更高。
具体实施时,可以利用事先训练好的场景相关度判别模型对多个语言类型的识别结果与该系统应用场景的相关程度进行判别,以得到多个语言类型的识别结果的场景相关度的评价分数,即针对识别结果的判别分数。当然,也可以根据具体的应用场景预先设置多个与目标场景相关的场景关键词或关键句,进而可以对语音识别结果进行场景关键词或关键句的检测,当语音识别结果中检测出了一个或多个场景关键词或关键句时,可以判别该语音识别结果与应用场景具有较高的关联度。
例如,在对应用于地铁自助售票机的语音识别设备得到的识别结果的进行判别时,如果识别结果中存在多个与地铁场景相关的预设场景关键词,则可以认为该识别结果的场景相关度评价分数较高,即该识别结果的判别分数较高。其中,上述场景关键词具体可以包括但不限于以下至少之一:目的地站点、起始站点、车票等等。当然,上述所列举的利用场景相关度判别模型或场景关键词对识别结果进行场景相关度判别的判别方式只是一种示意性说明。具体实施时,也可以根据具体应用场景和实施条件,选择其他合适的方式进行场景相关度的判别。对此,本申请不作限定。
在一个实施方式中,为了进一步优化对识别结果可信度的判别,还可以在对多个语言类型的识别结果进行场景相关度的判别,得到各个语言类型的识别结果的场景相关度评价分数外,还可以同时对多个语言类型的识别结果进行置信度的判别,得到各个语言类型的识别结果的置信度评价分数。再将同一个语言类型的识别结果的相关度评价分数和置信度评价分数作为该种语言类型的识别结果的判别分数。其中,上述各个语言类型的识别结果置信度具体可以理解为各个语言类型的语音识别模型针对对应的语言类型的语音数据进行识别时的准确率。例如,苏州话的识别结果的置信度可以理解为利用该苏州话语音识别模型识别苏州话的语音数据的准确率。
当然,上述所列举的针对多种语言类型的识别结果的可信度的判别方式只是为了更好地说明本申请实施方式。具体实施时,也可以结合具体的应用场景,选择其他合适的方式对上述识别结果的可信度进行判别。例如,也可以对识别结果进行句法结构判别,根据识别结果的句法结构判别结果对识别结果的可信度进行判别。例如,可以根据句法结构判别结果,将符合句法结构的识别结果判别为可信度较高的识别结果。举例而言,对于同一个语音数据,通过A语言类型的语音识别模型得到的识别结果1是“一张到玉山站的地铁票”,通过B语言类型的语音识别模型得到的识别结果2是“姨丈道玉山站的地贴瓢”。通过对上述两种识别结果分别进行句法结构判别,可知结果1相对于结果2更符 合句法结构,因此可以判断结果1相对于结果2可信度更高。
然而值得注意的是,上述以场景和句法结构作为识别结果可信度的判断依据仅是一种示例性描述,并不构成对本申请的限定,在实际确定可信度的时候还可以采用其它的确定因素。
结合实际情况,考虑到如果需要判别的语言类型较多,例如有20个语言类型,则每一次的语音识别需要将语音数据都转换为20种语言类型的识别结果,再对这20个语言类型的识别结果分别进行判别,势必会增加处理器的运行负担,降低识别效率。基于上述情形,在需要判别的语言类型的数量相对较大时,可以先对语音数据的语言类型进行初步语言判断,以便可以从多种语言类型的语音识别模型中筛选出可能性比较大的几个待确定语言类型的语音识别模型,以便后续分析时仅对待确定语言类型的语音识别模型进行进一步识别和判断,从而可以有效地降低处理器的工作负荷。
具体的,例如上述初步语言判断可以是先比较通过语言分类模型得到的语音数据属于各个语言类型的判别分数,选出判别分数相对较高的预设个数个(例如2个)的语言类型作为待确定语言类型;处理器仅利用待确定的语言类型的语音识别模型对语音数据进行语音识别,得到预设个数个待确定语言类型的识别结果(即得到相对数量较少的识别结果);进而仅对上述预设个数个待确定语言类型的识别结果进行判别;再结合针对语音数据的待确定语言类型的判别分数、待确定语言类型的识别结果的判别分数,从预设个数个待确定的语言类型的识别结果中确定准确度最高的语言类型的识别结果作为最终的语音识别结果。当然,上述所列举的初步语言判断的实现方式只是一种示意性说明,具体实施时,也可以根据具体情况采用其他合适的实现方式对语音数据进行初步语言判断,以缩小后续需要进一步确定的语言类型的语音识别模型的数量。
为了进一步提高识别效率,减轻处理器的工作负荷,考虑到在对语音数据属于的语言类型进行判别时往往不需要对语音数据的全部内容进行处理。因此,参阅图6所示,可以从语音数据中截取一部分的语音数据作为检验数据进行针对语音数据的语言类型的判别。具体的,考虑到用户在输入语音数据时,中间部分的语音数据通常相对较为连贯,口音特征也较为显著,可以截取语音数据中第一预设时间点(例如语音数据开始后的第5秒)与第二预设时间点(例如语音数据结束前的第5秒)之间语音数据作为检验数据,仅对该部分数据进行语言类型的判别,得到语音数据属于各个语言类型的判别分数。当然,还可以联系具体的应用场景,用户输入的语音数据可能某些部分受外界的噪音干扰相对较大,为了提高判别的准确度,可以从语音数据中提取较为清晰的一部分数据作为检验 数据。例如,可以对语音数据先进行重音检测,截取语音数据中重音位置的预设范围内(例如重音位置前20秒至重音位置后20秒)的语音数据作为检验数据,再进行针对语音数据的语言类型判别。
在通过上述语音识别设备得到最终的语音识别结果后,处理器可以将最终的语音识别结果发送至相应的执行服务器,以根据语音识别结果执行相应的用户指令。例如,处理器可以将乘客的语音识别结果发送至自助购票机中处理售票业务的服务器,该服务器可以根据语音识别结果向乘客出售其所要求的地铁票,完成自助售票。
本申请提供的多语言语音识别方法,由于不但对语音数据本身进行语言类型的判别,还对基于不同语言类型得到的语音识别结果进行相应判别,再综合针对多种数据的判别结果,选择出准确度较高的语言类型所对应的语音识别结果作为最终的语音识别结果,从而有效地提高了语音识别的准确率。
图7是本申请所述一种语音识别方法一个实施例的方法流程示意图。虽然本申请提供了如下述实施例或附图所示的方法操作步骤或装置结构,但基于常规或者无需创造性的劳动在所述方法或装置中可以包括更多或者更少的操作步骤或模块单元。在逻辑性上不存在必要因果关系的步骤或结构中,这些步骤的执行顺序或装置的模块结构不限于本申请实施例描述及附图所示的执行顺序或模块结构。所述的方法或模块结构的在实际中的装置或终端产品应用时,可以按照实施例或者附图所示的方法或模块结构连接进行顺序执行或者并行执行(例如并行处理器或者多线程处理的环境,甚至分布式处理环境)。
具体的如图7所示,本申请一种实施例提供的一种语音识别方法可以包括如下步骤:
S701:获取语音数据;
S702:通过多个方言语音识别模型组件对所述语音数据进行识别,得到多个语音识别结果;
S703:确定所述多个语音识别结果中各个语音识别结果的可信度;
S704:将可信度最高的语音识别结果作为所述语音数据的识别结果。
本申请实施例中的可信度具体可以理解为一种用于评价语音识别结果与真实语义的接近程度的参数。
在一个实施方式中,不同的语音识别模型用于识别不同类型的语言,在将可信度最高的语音识别结果作为所述语音数据的识别结果之后,所述方法还可以包括:将所述可信度最高的语音识别结果所对应的语音识别模型对应的语言类型,作为所述语音数据的 语言类型。
在一个实施方式中,确定所述多个语音识别结果中各个语音识别结果的可信度,具体可以包括:根据以下至少之一确定所述多个语音识别结果中各个语音识别结果的可信度:语音识别结果与场景的相关度、语音识别结果的句法结构。当然,需要说明的是,上述所列举的多种用于确定可信度的方式只是为了更好地说明本申请实施例。具体实施时,也可以根据具体情况选择其他的合适方式对语音识别结果的可信度进行判别。对此,本申请不作限定。
在一个实施方式中,在通过多个语音识别模型对所述语音数据进行识别,得到多个语音识别结果之前,所述方法还包括以下内容:通过语言分类模型识别出所述语音数据属于各语言类型的置信度。
在一个实施方式中,上述确定所述多个语音识别结果中各个语音识别结果的可信度,具体实施时,可以包括:结合通过语言分类模型识别出所述语音数据属于各语言类型的置信度,确定所述多个语音识别结果中各个语音识别结果的可信度。
在一个实施方式中,通过语言分类模型识别出所述语音数据属于各语言类型的置信度,具体实施可以包括:截取所述语音数据中第一预设时间点和第二预设时间点之间的数据作为所述检验数据;或,截取所述语音数据中重音位置的预设范围内的数据作为所述检验数据;通过语言分类模型识别出所述检验数据属于各语言类型的置信度。
在本例中还提供了一种语音识别方法,如图8所示,可以包括:
步骤801:获取语音数据;
步骤802:通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的识别结果和置信度值;
步骤803:通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值;
步骤804:对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果。
在上述步骤804中,对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果,可以包括:
S1:获取所述语言数据用于多个语言方言的置信度值,和属于目标场景的相关度值;
S2:获取预设的置信度权重值和相关度权重值;
S3:根据所述置信度权重值、所述相关度权重值、用于多个语言方言的置信度值和属于目标场景的相关度值、确定所述语音数据的方言识别结果。
即,可以对相关度和置信度赋予不同的权重值,根据权重值对每个方言识别结果进行打分,然后,根据分数的高低确定以哪个识别结果作为最终的识别结果。即,可以根据所述置信度权重值、所述相关度权重值、用于多个语言方言的置信度值和属于目标场景的相关度值,对所述语音数据属于各方言的可能性进行打分;将分数最高的方言,作为所述语言数据对应的方言;将分数最高的方言对应的方言语音识别模型组件的识别结果,作为所述语音数据的语音识别结果。
对于置信度和相关度等的数值判断,也可以是采用打分的方式,即,通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的置信度值,可以包括:通过所述多个方言语音识别模型组件并行对所述语音数据进行打分;将打分结果,作为用于多个语言方言的置信度值。
考虑到语音数据可能一开始一段时间内是没有有效语音的,因此,可以截取开始数据之后的一段语音作为确定识别基础,这样截取得到的语音相对而言数据量更大一些,可以有效降低数据处理量,又可以得到更为准确的结果。即,通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的置信度值,可以包括:从所述语音数据中截取语音开始后预定秒数后的数据作为样本数据;通过多个方言语音识别模型组件对所述样本数据并行进行识别,得到用于多个语言方言的置信度值。
本申请实施例中可以事先通过学习、训练建立语言(方言)分类模型,以识别语音数据的语言类型。具体实施时,可以按照以下方式建立上述语言分类模型:获取样本数据;提取样本数据中每一段数据不同语言(方言)的语音的I-vector;根据语音的I-vector属于哪一种语言(方言)类型学习、训练一个多分类模型,例如神经网络等,即得到了可以用于对语音数据进行语言类型判别的语言分类模型。
本申请实施例中可以事先通过学习、训练建立可信度判别模型,以确定所述多个语音识别结果中各个语音识别结果的可信度。具体的,以建立场景相关度判别模型为例:获取样本数据,并将样本数据中的正例(例如属于目标场景的识别结果)和负例(例如不属于目标场景的识别结果)进行矢量化。其中,具体的,可以采用one-hot或者wordvec的方式对上述正例和负例进行矢量化。再对矢量化后的数据进行训练,得到一个二分类模型。该二分类模型可以用于判别语音识别结果是否属于对应的目标场景。
本申请实施例中为了提高识别效率,减少工作量,在对多种语音识别结果进行可信度判别,得到可信度的判别结果前,具体实施时,可以先从所述多种语言类型中确定出待确定的语言类型。
在一个实施方式中,具体实施时,可以根据所述语言类型的判别结果,从所述多种语言类型中确定出待确定的语言类型。当然上述所列举的从多种语音类型中确定出待确定的语言类型方式只是一种示意性说明,不应当构成对本申请的不当限定。
本申请实施例中在对语音数据进行语言类型判别前,所述方法具体实施时,还可以包括以下内容:获取语音数据。具体的,为了减少工作量,提高识别效率,在获取所述语音数据后,所述方法还可以包括:从所述语音数据中截取检验数据。上述检验数据具体可以用于识别语音数据所对应的语言类型。如此,可以避免对完整的语音数据进行分析处理,仅对所截取的部分语音数据进行语言类型的识别,减少了工作量,提高了识别效率。
具体的,可以通过截取所述语音数据中第一预设时间点和第二预设时间点之间的数据作为所述检验数据;也可以通过截取所述语音数据中重音位置的预设范围内的数据作为所述检验数据。当然,还可以根据具体的应用场景和精度要求,选择其他合适的截取方式截取上述检验数据。对此,本申请不作限定。
本申请实施例中,上述根据所述语言类型的判别结果和所述可信度的判别结果,从所述多种语音识别结果中筛选出符合要求的语音识别结果,具体实施时,可以包括:
S1:根据所述语言类型的判别结果和所述可信度的判别结果,进行多模融合判断,得到多模融合判断结果;
S2:根据所述多模融合判断结果,从所述多种语音识别结果中筛选出符合要求的语音识别结果。
本申请实施例中具体实施时,可以根据所述语言类型的判别结果和所述可信度的判别结果进行加权打分,以得到多模融合判断结果。也可以事先根据不同判别结果的特征训练一个二分类模型作为多模融合判断模型,用于进行多模融合判断,得到上述多模融合判断结果。
本申请实施例所提供的语音识别设备的实施例可以在移动终端、计算机终端或者类似的运算装置中执行。以运行在服务器端上的语音识别设备为例,图9是本申请实施例的一种语音识别设备的硬件结构框图。如图9所示,该系统具体可以包括声音采集器111和处理器112(处理器112可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置等),其中,声音采集器111可以通过内部线缆与上述处理器112耦合。当然也可以通过无线的方式,例如WIFI、蓝牙等于处理器112耦合。具体的,上述声音采集器111可以是麦克风、集音器等设备,用于采集语音数据。上述处理器具体可以用于通过多个 语音识别模型对所述语音数据进行识别,得到多个语音识别结果;确定所述多个语音识别结果中各个语音识别结果的可信度;将可信度最高的语音识别结果作为所述语音数据的识别结果。
本领域普通技术人员可以理解,图9所示的结构仅为示意,其并不对上述电子装置的结构造成限定。例如,上述系统还可以包括存储器113等结构。存储器113可用于存储应用软件的软件程序以及模块,如本发明实施例中的语音识别设备的程序指令/模块,处理器112通过运行存储在存储器113内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的应用程序的语音识别方法。存储器113可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器113可进一步包括相对于处理器112远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
在本实施例中,上述语音识别设备具体可以应用在地铁自助购票、智能导航、智能购物、智能家居、老人陪护等多种交互应用场景中。
在一个实施方式中,上述处理器112在将可信度最高的语音识别结果作为所述语音数据的识别结果之后,所述处理器112还可以用于将所述可信度最高的语音识别结果所对应的语音识别模型对应的语言类型,作为所述语音数据的语言类型。
本申请提供的多语言语音识别设备,通过多个语音识别模型分别对目标语音数据进行识别,然后选择识别结果中可信度最高的语音识别结果作为最终的识别结果,避免了先进行识别模型的选择而导致的如果识别模型选择错误,将会导致后续的语音识别结果都会出错的问题,达到了有效提升语音识别结果准确性的技术效果。
在软件层面,上述语音识别设备还可以包括:获取模块、识别模块、确定模块,其中:
获取模块,可以用于获取语音数据;
识别模块,可以用于通过多个语音识别模型对所述语音数据进行识别,得到多个语音识别结果;
确定模块,可以用于确定所述多个语音识别结果中各个语音识别结果的可信度,并将可信度最高的语音识别结果作为所述语音数据的识别结果。
在一个实施方式中,不同的语音识别模型用于识别不同类型的语言,在将可信度最高的语音识别结果作为所述语音数据的识别结果之后,确定模块还可以将所述可信度最 高的语音识别结果所对应的语音识别模型对应的语言类型,作为所述语音数据的语言类型。
在一个实施方式中,上述确定模块在确定所述多个语音识别结果中各个语音识别结果的可信度时,可以根据以下至少之一确定所述多个语音识别结果中各个语音识别结果的可信度:语音识别结果与场景的相关度、语音识别结果的句法结构等。当然上述所列举的多种确定识别结果的可信度只是为了更好地说明本申请实施例。具体实施时,也可以根据具体情况和精度要求选择其他合适方式进行可信度的确定。
在一个实施方式中,所述装置还包括语言类型判别模块,具体可以用于在识别模块通过多个语音识别模型对所述语音数据进行识别,得到多个语音识别结果之前,语言类型判别模块可以通过语言分类模型识别出所述语音数据属于各语言类型的置信度。
在一个实施方式中,上述确定模块具体实施时,可以结合通过语言分类模型识别出所述语音数据属于各语言类型的置信度,确定所述多个语音识别结果中各个语音识别结果的可信度。
在一个实施方式中,上述语言类型判别模块具体实施时,可以先通过以下方式中的一种获取检验数据:截取所述语音数据中第一预设时间点和第二预设时间点之间的数据作为所述检验数据;或,截取所述语音数据中重音位置的预设范围内的数据作为所述检验数据;再通过语言分类模型识别出所述检验数据属于各语言类型的置信度。
在一个实施方式中,为了减少工作量,提高工作效率,在对多种语音识别结果进行可信度判别,得到可信度的判别结果前,所述系统还可以包括语言类型初选模块,用于从所述多种语言类型中确定出待确定的语言类型。
在一个实施方式中,上述语言类型初选模块具体实施时,可以根据所述语言类型的判别结果,从所述多种语言类型中确定出待确定的语言类型。
在一个实施方式中,具体实施时,以包括两种方言的语音识别为例如何利用利用上述语音识别方法准确地进行语音识别。具体的,包括以下步骤:
S1:将语音数据输入至语言(方言)分类模型,得到该语音数据分别属于哪一种语言(方言)的判别分数;
S2:将该语音数据分别输入到两种语言(方言)类型的语音识别模型,得到该语音数据在两种语言类型的语音识别模型下得到的识别结果,并对识别结果分别进行置信度判别,得到两种识别结果的置信度的判别分数;
S3:将该语音数据在两种语言类型的语音识别模型下得到的语音识别结果分别输入 到场景相关度判别模型,得到两种识别结果分别和目标场景的相关度的判别分数;
S4:将上述该语音数据分别属于哪一种语言(方言)的判别分数、两种识别结果的置信度的判别分数、两种识别结果分别和目标场景的相关度的判别分数输入到多模融合判别模型,确定出哪种语言类型的语音识别结果是符合要求的语音识别结果;
S5:向用户展示符合要求的语音识别结果,或者,根据上述语音识别结果进行后续语义理解。
本申请实施方式中还提供了一种计算机存储介质,所述计算机存储介质存储有计算机程序指令,在所述计算机程序指令被执行时实现:获取语音数据;通过多个语音识别模型对所述语音数据进行识别,得到多个语音识别结果;确定所述多个语音识别结果中各个语音识别结果的可信度;将可信度最高的语音识别结果作为所述语音数据的识别结果。
虽然本申请提供了如实施例或流程图所述的方法操作步骤,但基于常规或者无创造性的劳动可以包括更多或者更少的操作步骤。实施例中列举的步骤顺序仅仅为众多步骤执行顺序中的一种方式,不代表唯一的执行顺序。在实际中的装置或客户端产品执行时,可以按照实施例或者附图所示的方法顺序执行或者并行执行(例如并行处理器或者多线程处理的环境)。
上述实施例阐明的装置或模块,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。为了描述的方便,描述以上装置时以功能分为各种模块分别描述。在实施本申请时可以把各模块的功能在同一个或多个软件和/或硬件中实现。当然,也可以将实现某功能的模块由多个子模块或子单元组合实现。
本申请中所述的方法、装置或模块可以以计算机可读程序代码方式实现控制器按任何适当的方式实现,例如,控制器可以采取例如微处理器或处理器以及存储可由该(微)处理器执行的计算机可读程序代码(例如软件或固件)的计算机可读介质、逻辑门、开关、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程逻辑控制器和嵌入微控制器的形式,控制器的例子包括但不限于以下微控制器:ARC 625D、Atmel AT91SAM、Microchip PIC18F26K20以及Silicone Labs C8051F320,存储器控制器还可以被实现为存储器的控制逻辑的一部分。本领域技术人员也知道,除了以纯计算机可读程序代码方式实现控制器以外,完全可以通过将方法步骤进行逻辑编程来使得控制器以逻辑门、开关、专用集成电路、可编程逻辑控制器和嵌入微控制器等的形式来实现相同功能。因此这种控制器可以被认为是一种硬件部件,而对其内部包括的用于实现各种功 能的装置也可以视为硬件部件内的结构。或者甚至,可以将用于实现各种功能的装置视为既可以是实现方法的软件模块又可以是硬件部件内的结构。
本申请所述装置中的部分模块可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构、类等等。也可以在分布式计算环境中实践本申请,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
通过以上的实施方式的描述可知,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的硬件的方式来实现。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,也可以通过数据迁移的实施过程中体现出来。该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,移动终端,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。
本说明书中的各个实施例采用递进的方式描述,各个实施例之间相同或相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。本申请的全部或者部分可用于众多通用或专用的计算机系统环境或配置中。例如:个人计算机、服务器计算机、手持设备或便携式设备、平板型设备、移动通信终端、多处理器系统、基于微处理器的系统、可编程的电子设备、网络PC、小型计算机、大型计算机、包括以上任何系统或设备的分布式计算环境等等。
虽然通过实施例描绘了本申请,本领域普通技术人员知道,本申请有许多变形和变化而不脱离本申请的精神,希望所附的权利要求包括这些变形和变化而不脱离本申请的精神。

Claims (11)

  1. 一种语音识别方法,其特征在于,包括:
    获取语音数据;
    通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的识别结果和置信度值;
    通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值;
    对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果。
  2. 根据权利要求1所述的方法,其特征在于,对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果,包括:
    获取所述语言数据用于多个语言方言的置信度值,和属于目标场景的相关度值;
    获取预设的置信度权重值和相关度权重值;
    根据所述置信度权重值、所述相关度权重值、用于多个语言方言的置信度值和属于目标场景的相关度值、确定所述语音数据的方言识别结果。
  3. 根据权利要求2所述的方法,其特征在于,根据所述置信度权重值、所述相关度权重值,用于多个语言方言的置信度值和属于目标场景的相关度值,确定所述语音数据的方言识别结果,包括:
    根据所述置信度权重值、所述相关度权重值、用于多个语言方言的置信度值和属于目标场景的相关度值,对所述语音数据属于各方言的可能性进行打分;
    将分数最高的方言,作为所述语言数据对应的方言;
    将分数最高的方言对应的方言语音识别模型组件的识别结果,作为所述语音数据的语音识别结果。
  4. 根据权利要求1所述的方法,其特征在于,通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的置信度值,包括:
    通过所述多个方言语音识别模型组件并行对所述语音数据进行打分;
    将打分结果,作为用于多个语言方言的置信度值。
  5. 根据权利要求1所述的方法,其特征在于,通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的置信度值,包括:
    从所述语音数据中截取语音开始后预定秒数后的数据作为样本数据;
    通过多个方言语音识别模型组件对所述样本数据并行进行识别,得到用于多个语言 方言的置信度值。
  6. 一种语音识别设备,其特征在于,包括:声音采集器和处理器,所述声音采集器与所述处理器耦合,其中:
    所述声音采集器用于采集语音数据;
    所述处理器用于通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的识别结果和置信度值;通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值;对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果。
  7. 一种地铁站售票机,其特征在于,包括:声音采集器和处理器,所述声音采集器与所述处理器耦合,其中:
    所述声音采集器用于采集语音数据;
    所述处理器用于通过多个方言语音识别模型组件对所述语音数据并行进行识别,得到用于多个语言方言的识别结果和置信度值;通过场景相关度判别模型,确定用于多个语言方言的识别结果属于目标场景的相关度值;对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果。
  8. 根据权利要求7所述的售票机,其特征在于,对置信度和相关度进行融合判断,确定所述语音数据的方言识别结果,包括:
    获取所述语言数据用于多个语言方言的置信度值,和属于目标场景的相关度值;
    获取预设的置信度权重值和相关度权重值;
    根据所述置信度权重值、所述相关度权重值、用于多个语言方言的置信度值和属于目标场景的相关度值、确定所述语音数据的方言识别结果。
  9. 根据权利要求8所述的售票机,其特征在于,根据所述置信度权重值、所述相关度权重值,用于多个语言方言的置信度值和属于目标场景的相关度值,确定所述语音数据的方言识别结果,包括:
    根据所述置信度权重值、所述相关度权重值、用于多个语言方言的置信度值和属于目标场景的相关度值,对所述语音数据属于各方言的可能性进行打分;
    将分数最高的方言,作为所述语言数据对应的方言;
    将分数最高的方言对应的方言语音识别模型组件的识别结果,作为所述语音数据的语音识别结果。
  10. 一种语音识别方法,其特征在于,包括:
    获取语音数据;
    通过多个方言语音识别模型组件对所述语音数据进行识别,得到多个语音识别结果;
    确定所述多个语音识别结果中各个语音识别结果的可信度;
    将可信度最高的语音识别结果作为所述语音数据的识别结果。
  11. 一种计算机可读存储介质,其特征在于,其上存储有计算机指令,所述指令被执行时实现权利要求1至5中任一项所述方法的步骤。
PCT/CN2019/102485 2018-08-30 2019-08-26 语音识别方法和设备 WO2020043040A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811000407.9A CN110875039B (zh) 2018-08-30 2018-08-30 语音识别方法和设备
CN201811000407.9 2018-08-30

Publications (1)

Publication Number Publication Date
WO2020043040A1 true WO2020043040A1 (zh) 2020-03-05

Family

ID=69643927

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/102485 WO2020043040A1 (zh) 2018-08-30 2019-08-26 语音识别方法和设备

Country Status (2)

Country Link
CN (1) CN110875039B (zh)
WO (1) WO2020043040A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464182A (zh) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112750462A (zh) * 2020-08-07 2021-05-04 腾讯科技(深圳)有限公司 一种音频处理方法、装置及设备
CN112164400A (zh) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 语音交互方法、服务器和计算机可读存储介质
CN112466280B (zh) * 2020-12-01 2021-12-24 北京百度网讯科技有限公司 语音交互方法、装置、电子设备和可读存储介质
CN113077793B (zh) * 2021-03-24 2023-06-13 北京如布科技有限公司 一种语音识别方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040098259A1 (en) * 2000-03-15 2004-05-20 Gerhard Niedermair Method for recognition verbal utterances by a non-mother tongue speaker in a speech processing system
CN104036774A (zh) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 藏语方言识别方法及系统
CN106128462A (zh) * 2016-06-21 2016-11-16 东莞酷派软件技术有限公司 语音识别方法及系统
CN107135247A (zh) * 2017-02-16 2017-09-05 江苏南大电子信息技术股份有限公司 一种人与人工智能协同工作的服务系统及方法
CN107564513A (zh) * 2016-06-30 2018-01-09 阿里巴巴集团控股有限公司 语音识别方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104217717B (zh) * 2013-05-29 2016-11-23 腾讯科技(深圳)有限公司 构建语言模型的方法及装置
CN105448292B (zh) * 2014-08-19 2019-03-12 北京羽扇智信息科技有限公司 一种基于场景的实时语音识别系统和方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040098259A1 (en) * 2000-03-15 2004-05-20 Gerhard Niedermair Method for recognition verbal utterances by a non-mother tongue speaker in a speech processing system
CN104036774A (zh) * 2014-06-20 2014-09-10 国家计算机网络与信息安全管理中心 藏语方言识别方法及系统
CN106128462A (zh) * 2016-06-21 2016-11-16 东莞酷派软件技术有限公司 语音识别方法及系统
CN107564513A (zh) * 2016-06-30 2018-01-09 阿里巴巴集团控股有限公司 语音识别方法及装置
CN107135247A (zh) * 2017-02-16 2017-09-05 江苏南大电子信息技术股份有限公司 一种人与人工智能协同工作的服务系统及方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114464182A (zh) * 2022-03-03 2022-05-10 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法
CN114464182B (zh) * 2022-03-03 2022-10-21 慧言科技(天津)有限公司 一种音频场景分类辅助的语音识别快速自适应方法

Also Published As

Publication number Publication date
CN110875039B (zh) 2023-12-01
CN110875039A (zh) 2020-03-10

Similar Documents

Publication Publication Date Title
WO2020043040A1 (zh) 语音识别方法和设备
US10685245B2 (en) Method and apparatus of obtaining obstacle information, device and computer storage medium
CN110147726B (zh) 业务质检方法和装置、存储介质及电子装置
US10957339B2 (en) Speaker recognition method and apparatus, computer device and computer-readable medium
US10991366B2 (en) Method of processing dialogue query priority based on dialog act information dependent on number of empty slots of the query
KR102119468B1 (ko) 상담원의 상담내용을 기반으로 상담 챗봇을 학습하는 시스템 및 방법
US10847147B2 (en) Hyperarticulation detection in repetitive voice queries using pairwise comparison for improved speech recognition
KR101702829B1 (ko) 인공 신경망 기반 서브-음성 유닛 구별을 이용한 화자 검증 및 식별
WO2017215122A1 (zh) 多语音识别模型切换方法、装置和存储介质
CN111932144B (zh) 一种客服坐席分配方法、装置、服务器及存储介质
US20190013013A1 (en) Trial-based calibration for audio-based identification, recognition, and detection system
CN111179935B (zh) 一种语音质检的方法和设备
CN106294774A (zh) 基于对话服务的用户个性化数据处理方法及装置
WO2019184054A1 (zh) 一种弹幕信息的处理方法及系统
CN103415825A (zh) 用于手势识别的系统和方法
CN105336324A (zh) 一种语种识别方法及装置
CN108388553B (zh) 对话消除歧义的方法、电子设备及面向厨房的对话系统
CN114494935B (zh) 视频信息的处理方法、装置、电子设备和介质
CN112468659A (zh) 应用于电话客服的质量评价方法、装置、设备及存储介质
KR20220082790A (ko) 오디오 신호를 처리하는 방법과 장치, 모델의 훈련 방법과 장치, 전자 기기, 저장 매체, 및 컴퓨터 프로그램
CN111209373A (zh) 基于自然语义的敏感文本识别方法和装置
CN114297359A (zh) 一种对话意图识别方法、装置、电子设备及可读存储介质
CN115206328A (zh) 数据处理方法、装置和客服机器人
CN113724693B (zh) 语音判别方法、装置、电子设备及存储介质
CN114895775A (zh) 业务数据处理方法、装置和智能柜台

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19854960

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19854960

Country of ref document: EP

Kind code of ref document: A1