WO2023273776A1 - 语音数据的处理方法及装置、存储介质、电子装置 - Google Patents

语音数据的处理方法及装置、存储介质、电子装置 Download PDF

Info

Publication number
WO2023273776A1
WO2023273776A1 PCT/CN2022/096411 CN2022096411W WO2023273776A1 WO 2023273776 A1 WO2023273776 A1 WO 2023273776A1 CN 2022096411 W CN2022096411 W CN 2022096411W WO 2023273776 A1 WO2023273776 A1 WO 2023273776A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
preset
recognition
models
sample
Prior art date
Application number
PCT/CN2022/096411
Other languages
English (en)
French (fr)
Inventor
朱文博
Original Assignee
青岛海尔科技有限公司
海尔智家股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202110744802.3A external-priority patent/CN113593535B/zh
Application filed by 青岛海尔科技有限公司, 海尔智家股份有限公司 filed Critical 青岛海尔科技有限公司
Publication of WO2023273776A1 publication Critical patent/WO2023273776A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0635Training updating or merging of old and new templates; Mean values; Weighting

Definitions

  • the present disclosure relates to the communication field, and in particular, to a voice data processing method and device, a storage medium, and an electronic device.
  • the natural speech audio data from the user is obtained from the input device through the speech interaction system, and the audio data is input into one or more speech recognition engines to recognize the user's speech, thereby obtaining the speech recognition result.
  • the use of multiple engines is to input the voice data of the user into multiple engines, obtain the recognition results of all the engines, and perform certain calculations to obtain the final result.
  • the interactive response time of different speech recognition engines is not the same. If all engines pass through, it will definitely wait for the last recognition result to arrive before making subsequent judgments. The best way to identify the result is to wait too long in the real user interaction experience, which seriously affects the interaction experience.
  • Embodiments of the present disclosure provide a voice data processing method and device, a storage medium, and an electronic device, so as to at least solve the problem in the related art that when multiple voice recognition engines (ie, voice models) are used for voice recognition, the recognition time is long, It is impossible to determine the accuracy of the recognition results and other issues.
  • voice recognition engines ie, voice models
  • a method for processing voice data including: acquiring voice data to be processed; Assuming that at least one target speech model is determined in the speech model, the weight of each preset speech model represents the confidence of the recognition result of the preset speech model; the speech data to be processed is processed by the at least one target speech model.
  • a voice data processing device including: an acquisition module configured to acquire voice data to be processed; a configuration module configured to recognize the voice data according to a preset recognition model configuration, wherein, the preset recognition model is a model for recognizing speech composed of a plurality of preset speech models, and the preset recognition model includes weights corresponding to each preset speech model, and the weights are used to indicate different preset speech models. It is assumed that the speech model corresponds to the weighting coefficient of the recognition result and the confidence degree; the determination module is configured to determine from the plurality of preset speech models that at least one target speech model has a corresponding effect on the to-be The processed voice data is processed for recognition.
  • a computer-readable storage medium where a computer program is stored in the storage medium, wherein the computer program is set to execute any one of the above method embodiments when running in the steps.
  • an electronic device including a memory and a processor, wherein a computer program is stored in the memory, and the processor is configured to run the computer program to perform any of the above Steps in the method examples.
  • the speech data to be processed is obtained; according to the weight corresponding to each preset speech model in the plurality of preset speech models, at least one target speech model is determined from the plurality of preset speech models, and the weight of each preset speech model Characterize the confidence of the recognition result of the preset speech model; process the speech data to be processed through at least one target speech model, that is, by determining the weight corresponding to each preset speech model in a plurality of preset speech models, select from At least one target speech model conforming to the processing of the speech data to be processed is processed to feed back more accurate speech results to the target object.
  • Speech model for speech recognition, the recognition time is long, and the accuracy of the recognition result cannot be determined, which ensures the flexibility of speech data recognition and improves the determination time for the recognition accuracy.
  • Fig. 1 is the block diagram of the hardware structure of the computer terminal of a kind of voice data processing method of the embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method for processing voice data according to an embodiment of the disclosure
  • FIG. 3 is a structural block diagram (1) of a device for processing voice data according to an embodiment of the disclosure
  • Fig. 4 is a structural block diagram (2) of an apparatus for processing voice data according to an embodiment of the disclosure.
  • FIG. 1 is a hardware structural block diagram of a computer terminal according to a voice data processing method according to an embodiment of the present disclosure.
  • the computer terminal may include one or more (only one is shown in Figure 1) processors 102 (processors 102 may include but not limited to processing devices such as microprocessor MCU or programmable logic device FPGA, etc.) and a memory 104 for storing data.
  • processors 102 may include but not limited to processing devices such as microprocessor MCU or programmable logic device FPGA, etc.
  • the above-mentioned computer terminal may further include a transmission device 106 and an input and output device 108 for communication functions.
  • the structure shown in Figure 1 is only for illustration, and it does not limit the structure of the above-mentioned computer terminal.
  • the computer terminal may also include more or less components than those shown in FIG. 1 , or have a different configuration with functions equivalent to those shown in FIG. 1 or more functions than those shown in FIG. 1 .
  • the memory 104 can be used to store computer programs, for example, software programs and modules of application software, such as the computer program corresponding to the voice data processing method in the embodiment of the present disclosure, and the processor 102 runs the computer program stored in the memory 104, thereby Executing various functional applications and data processing is to realize the above-mentioned method.
  • the memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory.
  • the memory 104 may further include a memory that is remotely located relative to the processor 102, and these remote memories may be connected to a computer terminal through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the transmission device 106 is configured to receive or transmit data via a network.
  • the specific example of the above-mentioned network may include a wireless network provided by the communication provider of the computer terminal.
  • the transmission device 106 includes a network interface controller (NIC for short), which can be connected to other network devices through a base station so as to communicate with the Internet.
  • the transmission device 106 may be a radio frequency (Radio Frequency, referred to as RF) module, which is configured to communicate with the Internet in a wireless manner.
  • RF radio frequency
  • FIG. 2 is a flow chart of a method for processing voice data according to an embodiment of the disclosure. The process includes the following steps:
  • Step S202 acquiring voice data to be processed
  • Step S204 Determine at least one target speech model from the plurality of preset speech models according to the weight corresponding to each preset speech model in the plurality of preset speech models, and the weight of each preset speech model represents the preset speech model Confidence of the recognition result;
  • Step S206 process the speech data to be processed by using the at least one target speech model.
  • the speech data to be processed is obtained; according to the weight corresponding to each preset speech model in the plurality of preset speech models, at least one target speech model is determined from the plurality of preset speech models, and the weight of each preset speech model Characterize the confidence of the recognition result of the preset speech model; process the speech data to be processed through at least one target speech model, that is, by determining the weight corresponding to each preset speech model in a plurality of preset speech models, select from At least one target speech model conforming to the processing of the speech data to be processed is processed to feed back more accurate speech results to the target object.
  • Speech model for speech recognition, the recognition time is long, and the accuracy of the recognition result cannot be determined, which ensures the flexibility of speech data recognition and improves the determination time for the recognition accuracy.
  • the recognition types of the above-mentioned preset speech models are various, that is, there are preset speech models that can perform speech recognition, there can also be preset speech models for semantic understanding, and there can also be preset speech models for
  • the preset voice model for fingerprint recognition is not limited in the present disclosure, but similar models can be used as the preset voice model in the embodiments of the present disclosure.
  • the method before obtaining the speech data to be processed, the method further includes: obtaining sample speech for training the plurality of preset speech models; Said sample speech is processed to obtain the recognition result and confidence degree corresponding to each preset speech model; according to the recognition result and the confidence degree corresponding to each preset speech model, it is determined that the plurality of preset speech models correspond to the weight of.
  • the sample voice has the same parameter information as the voice data to be processed, specifically: the parameter information can be: user ID, voiceprint features, targeted voice processing equipment (home appliances, robots, speakers, etc.), etc.
  • processing the sample speech through the plurality of preset speech models respectively to obtain recognition results corresponding to each preset speech model includes: obtaining standard recognition data of the sample speech, wherein , the standard recognition data is used to indicate that the sample speech correctly parses the corresponding text content; determine the difference between the standard recognition data and the recognition data obtained by the preset speech models for the sample speech; determine according to the difference The recognition result of each preset speech model for the sample speech is obtained.
  • processing the sample speech through the plurality of preset speech models respectively to obtain the confidence corresponding to each preset speech model includes: obtaining a confidence interval corresponding to the sample speech; determining The probability that the recognition value obtained by each preset speech model for the sample speech processing exists within the confidence interval, wherein the recognition value is used to indicate the recognition data and standard of each preset speech model for the sample speech recognition Recognizing the number of repeated word sequences in the data; determining the confidence corresponding to each preset speech model according to the probability.
  • the historical word error rate corresponding to the preset recognition model is screened through the preset word error rate threshold, and then the preset recognition model for recognizing voice data The word error rate of is guaranteed to be within the range allowed by the target object.
  • determining the weights corresponding to the plurality of preset speech models includes: obtaining the sample speech in the A plurality of recognition results of the plurality of preset speech models, determining a first feature vector of the sample speech according to the plurality of recognition results; obtaining a plurality of confidence levels of the sample speech in the plurality of preset speech models , determining a second feature vector of the sample speech according to the plurality of confidence levels; inputting the first feature vector and the second feature vector into a preset neural network model to obtain the plurality of preset The weight corresponding to the speech model.
  • At least one target speech model is determined from the plurality of preset speech models according to the weight corresponding to each preset speech model in the plurality of preset speech models, and the weight of each preset speech model Before the weight characterizes the confidence of the recognition result of the preset speech model, the method further includes: determining the identity information of the target object corresponding to the speech data to be processed; determining the calling authority of the target object according to the identity information, wherein the The invoking authority is used to indicate a list of models among the plurality of preset speech models that can process the speech data to be processed corresponding to the target object, wherein different preset recognition models are used to recognize speech data of different structures.
  • the preset identification models that can be selected when calling the preset identification model are also different, because the target object can register the identity on the server in advance, and According to the registration result, it is assigned the calling authority of the corresponding preset recognition model, that is, when the registration of the target object on the server is completed and the identity verification of the target object is passed, multiple preset recognition models can be set on the server Select one or more preset recognition models corresponding to the calling authority to process the voice data.
  • a shunt strategy for redistributing traffic calls to achieve the best user interaction experience by calling multiple general speech recognition engines Because the existing multi-engine calls usually identify the same user voice data on multiple engines at the same time, the response time of each engine is inconsistent, and the time obtained by all results is used as the standard each time, which leads to each Every time, the longest interaction time is the final response time, which seriously affects the user's interaction experience.
  • the advantages of multiple engines are obvious, and they can complement each other to achieve the best recognition results.
  • an optional embodiment of the present disclosure mainly provides a method for implementing a splitting strategy based on multiple speech recognition engines.
  • each speech is recognized by only one engine, but The engine is the best engine for recognizing the voice among the engines, and regularly redistributes the engine used by each user to achieve the highest matching degree between the user's data and the engine, and achieve the best recognition results and interactive experience.
  • the engine dynamic shunting strategy dynamically calls different engines to achieve more accurate recognition results fed back to the user within the response time of a single engine call without affecting the technical effect of the interactive experience.
  • the multi-purpose speech recognition engine recognition result output solution is as follows, including the following steps:
  • Step 1 First, based on the existing recognition system, use man-machine dialogue to simultaneously enter part of the user's voice into the multi-engine recognition, and screen and label the user data to obtain the user's correct instruction requirements.
  • Step 2 Make statistics on the confidence (also called credibility) confidence values of the data obtained by each engine in the above steps, and determine the proportion of the overall data reaching the threshold according to the threshold analysis of each engine.
  • the calculation of the Confidence value Since it is a common model in the cloud, statistical confidence is performed according to different structures and results of the model.
  • the traditional model structure uses the posterior probability, that is, the language model and the acoustic model are used to score the best path to obtain the result of the posterior probability, and the speech recognition obtains the optimal word sequence
  • the formula is as follows:
  • P(W) is the score of the language model
  • W) is the score of the acoustic model
  • the calculation of the Confidence ratio can be performed, and the confidence results of all the data are obtained according to all engine calculations, and normalized by softmax.
  • the above c(total) is the total confidence value
  • the above c m (conf ⁇ 1..n ⁇ >thres m ) indicates whether the corresponding confidence value after m engines identify n data is greater than the preset
  • the average confidence of the M engines C M is used to indicate the proportion of n data in the M engine to form a vector of confidence; the vector is normalized by the softmax function: the formula is as follows:
  • the recognition results of each engine are counted according to the word error rate WER of the recognition evaluation standard, and the formula is as follows:
  • W M [(1-WER 1 ),..., (1-WER m )];
  • the above W M is a vector of recognition accuracy; it is also normalized by the softmax function;
  • ⁇ 1 , ⁇ 2 ⁇ R m , R m is a set of weight coefficients corresponding to each engine
  • S 1 and S 2 are used as vectors of two sets of m-dimensional features
  • k-fold cross-validation is used for DNN model training.
  • Step 3 Sort S, select the three engines with the top three accuracy rates, and the difference between the default word error rates is within 10%.
  • the final weight distribution scheme is obtained, that is, the cloud configures users
  • the engine mode that can be called, in the case of multiple engines choosing the best one to call, can improve the recognition rate to the greatest extent.
  • Step 4 Repeat steps 1-3 regularly to automate the entire process to dynamically redistribute engine calls according to weights.
  • the weight coefficient models of different engines are trained and tuned to obtain the best weight results.
  • Engines are dynamically allocated according to weight results, so that different users can call different engines. Achieve optimal recognition accuracy, retrain weight results regularly, and dynamically allocate engines.
  • the use of multi-speech recognition engine mixed calling method improves the accuracy of recognition, and user commands enter a single engine to obtain the best recognition results of all engines and reduce response time. Further, because the weight of each engine can be automatically generated, it can automatically Call different engines to implement dynamic allocation strategies.
  • the method according to the above embodiments can be implemented by means of software plus a necessary general-purpose hardware platform, and of course also by hardware, but in many cases the former is Better implementation.
  • the technical solution of the present disclosure can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, disk, CD) contains several instructions to enable a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the methods described in various embodiments of the present disclosure.
  • This embodiment also provides a device for processing voice data, which is used to implement the above embodiments and preferred implementation modes, and those that have already been described will not be repeated.
  • the term "module” may be a combination of software and/or hardware that realizes a predetermined function.
  • the devices described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
  • Fig. 3 is a structural block diagram of a device for processing voice data according to an embodiment of the present disclosure. As shown in Fig. 3 , the device includes:
  • (1) obtaining module 34 is set to obtain the voice data to be processed
  • Configuration module 36 is configured to determine at least one target speech model from the plurality of preset speech models according to the corresponding weight of each preset speech model in the plurality of preset speech models, and the weight of each preset speech model Representing the confidence level of the preset speech model recognition result;
  • the determining module 38 is configured to process the speech data to be processed through the at least one target speech model.
  • the speech data to be processed is obtained; according to the weight corresponding to each preset speech model in the plurality of preset speech models, at least one target speech model is determined from the plurality of preset speech models, and the weight of each preset speech model Characterize the confidence of the recognition result of the preset speech model; process the speech data to be processed through at least one target speech model, that is, by determining the weight corresponding to each preset speech model in a plurality of preset speech models, select from At least one target speech model conforming to the processing of the speech data to be processed is processed to feed back more accurate speech results to the target object.
  • Speech model for speech recognition, the recognition time is long, and the accuracy of the recognition result cannot be determined, which ensures the flexibility of speech data recognition and improves the determination time for the recognition accuracy.
  • the recognition types of the above-mentioned preset speech models are various, that is, there are preset speech models that can perform speech recognition, there can also be preset speech models for semantic understanding, and there can also be preset speech models for
  • the preset voice model for fingerprint recognition is not limited in the present disclosure, but similar models can be used as the preset voice model in the embodiments of the present disclosure.
  • Fig. 4 is a structural block diagram of another voice data processing device according to an embodiment of the present disclosure. As shown in Fig. 4 , the device also includes: a sample module 30 and a permission module 32;
  • the above-mentioned device further includes: a sample module, configured to acquire sample speech for training the plurality of preset speech models; processing to obtain the recognition results and confidence levels corresponding to the preset speech models; and determine the weights corresponding to the multiple preset speech models according to the recognition results and the confidence levels corresponding to the preset speech models.
  • a sample module configured to acquire sample speech for training the plurality of preset speech models; processing to obtain the recognition results and confidence levels corresponding to the preset speech models; and determine the weights corresponding to the multiple preset speech models according to the recognition results and the confidence levels corresponding to the preset speech models.
  • the sample voice has the same parameter information as the voice data to be processed, specifically: the parameter information can be: user ID, voiceprint features, targeted voice processing equipment (home appliances, robots, speakers, etc.), etc.
  • the above-mentioned sample module is further configured to obtain standard recognition data of the sample speech, wherein the standard recognition data is used to indicate that the sample speech correctly parses the corresponding text content; determine the standard recognition data The difference from the recognition data obtained by processing the sample speech by the preset speech models; determining the recognition result of the sample speech by the preset speech models according to the difference.
  • the above-mentioned sample module is further configured to obtain a confidence interval corresponding to the sample speech; determine that the recognition value obtained by the processing of the sample speech of each preset speech model has a corresponding relationship with the confidence interval Probability, wherein the recognition value is used to indicate the number of repeated word sequences between the recognition data after sample speech recognition and the standard recognition data for each preset speech model; determine the confidence corresponding to each preset speech model according to the probability .
  • the historical word error rate corresponding to the preset recognition model is screened through the preset word error rate threshold, and then the preset recognition model for recognizing voice data The word error rate of is guaranteed to be within the range allowed by the target object.
  • the above-mentioned sample module is further configured to obtain multiple recognition results of the sample voice in the multiple preset voice models, and determine the first recognition result of the sample voice according to the multiple recognition results.
  • a feature vector obtaining a plurality of confidence levels of the sample speech in the plurality of preset speech models, and determining a second feature vector of the sample speech according to the plurality of confidence levels; combining the first feature vector and the set The second feature vector is input into a preset neural network model to obtain weights corresponding to the multiple preset speech models.
  • the above device further includes: a permission module, configured to determine the identity information of the target object corresponding to the voice data to be processed; determine the call permission of the target object according to the identity information, wherein the call The authority is used to indicate a list of models among the plurality of preset speech models capable of processing the speech data to be processed corresponding to the target object, wherein different preset recognition models are used to recognize speech data of different structures.
  • a permission module configured to determine the identity information of the target object corresponding to the voice data to be processed
  • determine the call permission of the target object according to the identity information, wherein the call The authority is used to indicate a list of models among the plurality of preset speech models capable of processing the speech data to be processed corresponding to the target object, wherein different preset recognition models are used to recognize speech data of different structures.
  • the preset identification models that can be selected when calling the preset identification model are also different, because the target object can register the identity on the server in advance, and According to the registration result, it is assigned the calling authority of the corresponding preset recognition model, that is, when the registration of the target object on the server is completed and the identity verification of the target object is passed, multiple preset recognition models can be set on the server Select one or more preset recognition models corresponding to the calling authority to process the voice data.
  • orientations or positional relationships indicated by the terms “center”, “upper”, “lower”, “front”, “rear”, “left”, “right” etc. are based on The orientations or positional relationships shown in the drawings are only for the convenience of describing the present disclosure and simplifying the description, and do not indicate or imply that the referred devices or components must have a specific orientation, be constructed and operated in a specific orientation, and therefore cannot be understood as Limitations on this Disclosure.
  • first and second are used for descriptive purposes only, and should not be understood as indicating or implying relative importance.
  • connection should be understood in a broad sense, for example, it can be a fixed connection or a detachable connection , or integrally connected; may be mechanically connected, may also be electrically connected; may be directly connected, may also be indirectly connected through an intermediary, and may be internal communication between two components.
  • an element is referred to as being “fixed on” or “disposed on” another element, it can be directly on the other element or intervening elements may also be present.
  • a component is said to be “connected” to another element, it may be directly connected to the other element or intervening elements may also be present.
  • the above-mentioned modules can be realized by software or hardware. For the latter, it can be realized by the following methods, but not limited to this: the above-mentioned modules are all located in the same processor; or, the above-mentioned modules can be combined in any combination The forms of are located in different processors.
  • Embodiments of the present disclosure also provide a storage medium, in which a computer program is stored, wherein the computer program is configured to execute the steps in any one of the above method embodiments when running.
  • the above-mentioned storage medium may be configured to store a computer program for performing the following steps:
  • the above-mentioned storage medium may include but not limited to: U disk, read-only memory (Read-Only Memory, referred to as ROM), random access memory (Random Access Memory, referred to as Various media that can store computer programs such as RAM), mobile hard disk, magnetic disk or optical disk.
  • Embodiments of the present disclosure also provide an electronic device, including a memory and a processor, where a computer program is stored in the memory, and the processor is configured to run the computer program to execute the steps in any one of the above method embodiments.
  • the electronic device may further include a transmission device and an input and output device, wherein the transmission device is connected to the processor, and the input and output device is connected to the processor.
  • the above-mentioned processor may be configured to execute the following steps through a computer program:
  • each module or each step of the above-mentioned disclosure can be realized by a general-purpose computing device, and they can be concentrated on a single computing device, or distributed in a network composed of multiple computing devices above, in an exemplary embodiment, they may be implemented in program code executable by a computing device, thus, they may be stored in a storage device to be executed by a computing device, and in some cases, may be different from The steps shown or described here are performed sequentially, or they are fabricated into individual integrated circuit modules, or multiple modules or steps among them are fabricated into a single integrated circuit module. As such, the present disclosure is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

本公开提供了一种语音数据的处理方法及装置、存储介质、电子装置,上述方法包括:获取待处理的语音数据;根据多个预设语音模型中各预设语音模型对应的权重,从多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;通过至少一个目标语音模型对待处理的语音数据进行处理,解决了现有技术中在使用多种语音识别引擎(即语音模型)进行语音识别时,识别时间长,无法确定识别结果的准确率等问题,确保了语音数据进行识别的灵活性,提升对于识别准确率的确定时间。

Description

语音数据的处理方法及装置、存储介质、电子装置
本公开要求于2021年6月30日提交中国专利局、申请号为202110744802.3、发明名称“语音数据的处理方法及装置、存储介质、电子装置”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及通信领域,具体而言,涉及一种语音数据的处理方法及装置、存储介质、电子装置。
背景技术
现有的语音对话系统中,通过语音交互系统从输入设备中获取来自用户的自然语音音频数据,将该音频数据输入至一个或多语音识别引擎来识别用户语音,从而获得语音识别结果。
单一引擎的识别,通常存在着各自的问题,尤其云端大模型,每个引擎都有各自的优劣。
通常,多引擎的使用为将用户的语音数据输入到多个引擎,得到全部引擎的识别结果后进行一定的计算,得到最终的结果。然而这就存在着,不同语音识别引擎的交互响应时间不尽相同的问题,若经过全部引擎,就一定会等待最后一个识别结果到来后,再进行后续的判决,但这种以时间为代价获取较佳的识别结果的方式,在真实的用户交互体验时,等待过久,严重影响交互体验。
针对相关技术中,在使用多种语音识别引擎(即语音模型)进行语音识别时,识别时间长,无法确定识别结果的准确率等问题,尚未提出有效的技术方案。
发明内容
本公开实施例提供了一种语音数据的处理方法及装置、存储介质、电子装置,以至少解决相关技术中,在使用多种语音识别引擎(即语音模型) 进行语音识别时,识别时间长,无法确定识别结果的准确率等问题。
根据本公开的一个实施例,提供了一种语音数据的处理方法,包括:获取待处理的语音数据;根据多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;通过所述至少一个目标语音模型对所述待处理的语音数据进行处理。
根据本公开的另一个实施例,提供了一种语音数据的处理装置,包括:获取模块,设置为获取待处理的语音数据;配置模块,设置为根据预设识别模型对所述语音数据进行识别配置,其中,所述预设识别模型为多个预设语音模型组成的用于识别语音的模型,所述预设识别模型包含各个预设语音模型对应的权重,所述权重用于指示不同预设语音模型对应识别结果和置信度的加权系数;确定模块,设置为在确定所述识别配置对应内容的情况下,从所述多个预设语音模型中确定至少一个目标语音模型对所述待处理的语音数据进行识别处理。
根据本公开的又一个实施例,还提供了一种计算机可读的存储介质,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
根据本公开的又一个实施例,还提供了一种电子装置,包括存储器和处理器,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行上述任一项方法实施例中的步骤。
通过本公开,获取待处理的语音数据;根据多个预设语音模型中各预设语音模型对应的权重,从多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;通过至少一个目标语音模型对待处理的语音数据进行处理,也就是说,通过确定多个预设语音模型中各预设语音模型对应的权重,从中选择出符合处理待处理的语音数据的至少一个目标语音模型对待处理的语音数据进行处理,从而向目标对象反馈更准确的语音结果,因此,可以解决现有技术中 在使用多种语音识别引擎(即语音模型)进行语音识别时,识别时间长,无法确定识别结果的准确率等问题,确保了语音数据进行识别的灵活性,提升对于识别准确率的确定时间。
附图说明
此处所说明的附图用来提供对本公开的进一步理解,构成本申请的一部分,本公开的示意性实施例及其说明用于解释本公开,并不构成对本公开的不当限定。在附图中:
图1是本公开实施例的一种语音数据的处理方法的计算机终端的硬件结构框图;
图2是根据本公开实施例的语音数据的处理方法的流程图;
图3是根据本公开实施例的语音数据的处理装置的结构框图(一);
图4是根据本公开实施例的语音数据的处理装置的结构框图(二)。
具体实施方式
下文中将参考附图并结合实施例来详细说明本公开。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
需要说明的是,本公开的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
本申请实施例所提供的方法实施例可以在计算机终端或者设备终端类似的运算装置中执行。以运行在计算机终端上为例,图1是本公开实施例的一种语音数据的处理方法的计算机终端的硬件结构框图。如图1所示,计算机终端可以包括一个或多个(图1中仅示出一个)处理器102(处理器102可以包括但不限于微处理器MCU或可编程逻辑器件FPGA等的处理装置)和用于存储数据的存储器104,在一个示例性实施例中,上述计算机终端还可以包括用于通信功能的传输设备106以及输入输出设备108。本领域普通技术人员可以理解,图1所示的结构仅为示意,其并不对上述 计算机终端的结构造成限定。例如,计算机终端还可包括比图1中所示更多或者更少的组件,或者具有与图1所示等同功能或比图1所示功能更多的不同的配置。
存储器104可用于存储计算机程序,例如,应用软件的软件程序以及模块,如本公开实施例中的语音数据的处理方法对应的计算机程序,处理器102通过运行存储在存储器104内的计算机程序,从而执行各种功能应用以及数据处理,即实现上述的方法。存储器104可包括高速随机存储器,还可包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器104可进一步包括相对于处理器102远程设置的存储器,这些远程存储器可以通过网络连接至计算机终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
传输装置106设置为经由一个网络接收或者发送数据。上述的网络具体实例可包括计算机终端的通信供应商提供的无线网络。在一个实例中,传输装置106包括一个网络适配器(Network Interface Controller,简称为NIC),其可通过基站与其他网络设备相连从而可与互联网进行通讯。在一个实例中,传输装置106可以为射频(Radio Frequency,简称为RF)模块,其设置为通过无线方式与互联网进行通讯。
在本实施例中提供了一种语音数据的处理方法,图2是根据本公开实施例的语音数据的处理方法的流程图,该流程包括如下步骤:
步骤S202,获取待处理的语音数据;
步骤S204,根据多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;
步骤S206,通过所述至少一个目标语音模型对所述待处理的语音数据进行处理。
通过上述步骤,获取待处理的语音数据;根据多个预设语音模型中各 预设语音模型对应的权重,从多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;通过至少一个目标语音模型对待处理的语音数据进行处理,也就是说,通过确定多个预设语音模型中各预设语音模型对应的权重,从中选择出符合处理待处理的语音数据的至少一个目标语音模型对待处理的语音数据进行处理,从而向目标对象反馈更准确的语音结果,因此,可以解决现有技术中在使用多种语音识别引擎(即语音模型)进行语音识别时,识别时间长,无法确定识别结果的准确率等问题,确保了语音数据进行识别的灵活性,提升对于识别准确率的确定时间。
需要说明的是,上述预设语音模型的识别种类多种多样,即存在可以进行语音识别的预设语音模型,还可以存在用于进行语义理解的预设语音模型,还可以是用于进行声纹识别的预设语音模型,本公开对此不做过多限定,但类似的模型均可以作为本公开实施例中的预设语音模型。
在一个示例性实施例中,获取待处理的语音数据之前,所述方法还包括:获取用于训练所述多个预设语音模型的样本语音;通过所述多个预设语音模型分别对所述样本语音进行处理,得到各预设语音模型对应的识别结果和置信度;根据所述各预设语音模型对应的所述识别结果和所述置信度,确定所述多个预设语音模型对应的权重。
需要说明的是,样本语音与待处理的语音数据具有相同的参数信息,具体的:参数信息可以是:用户ID、声纹特征、针对的语音处理设备(家电、机器人、音箱等)等。
可以理解的是,为了保证语音数据可以在后续的过程中被更加快速的识别,在确定了语音数据的处理准确率之后,根据语音数据对应内容的语义种类,确定对于同一种语义种类不同识别模型的准确率,继而得到所述语音数据的语音数据识别列表,在后续碰见包含相同语义中的语音数据时,从语音数据识别列表选择具有较高识别准确率对应的预设识别模型进行识别操作。
在一个示例性实施例中,通过所述多个预设语音模型分别对所述样本语音进行处理,得到各预设语音模型对应的识别结果,包括:获取所述样本语音的标准识别数据,其中,所述标准识别数据用于指示样本语音正确解析对应的文本内容;确定所述标准识别数据与所述各预设语音模型对于所述样本语音处理得到的识别数据的差异;根据所述差异确定出所述各预设语音模型对于所述样本语音的识别结果。
在一个示例性实施例中,通过所述多个预设语音模型分别对所述样本语音进行处理,得到各预设语音模型对应的置信度,包括:获取所述样本语音对应的置信区间;确定所述各预设语音模型对于所述样本语音处理得到的识别值存在与所述置信区间的概率,其中,所述识别值用于指示各预设语音模型对于样本语音识别后的识别数据与标准识别数据存在重复的词序数量;根据所述概率确定所述各预设语音模型对应的置信度。
也就是说,为了保证语音数据识别的准确率在一定安全范围内,通过预设的词错误率阈值对预设识别模型对应的历史词错误率进行筛选,进而将识别语音数据的预设识别模型的词错误率保证在目标对象允许的范围内。
在一个示例性实施例中,根据所述各预设语音模型对应的所述识别结果和所述置信度,确定所述多个预设语音模型对应的权重,包括:获取所述样本语音在所述多个预设语音模型的多个识别结果,根据所述多个识别结果确定所述样本语音的第一特征向量;获取所述样本语音在所述多个预设语音模型的多个置信度,根据所述多个置信度确定所述样本语音的第二特征向量;将所述第一特征向量和所述第二特征向量输入到预设神经网络模型中,以获取所述多个预设语音模型对应的权重。
在一个示例性实施例中,根据所述多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度之前,所述方法还包括:确定所述待处理的语音数据对应目标对象的身份信息;根据 所述身份信息确定目标对象的调用权限,其中,所述调用权限用于指示多个预设语音模型中能处理所述目标对象对应的所述待处理的语音数据的模型列表,其中,不同的预设识别模型用于识别不同结构的语音数据。
简而言之,由于不同的目标对象对应的身份信息不同,在进行预设识别模型调用时可供选择的预设识别模型也是不同的,由于目标对象可以提前在服务器上进行身份的注册,并根据注册结果为其分配对应的预设识别模型的调用权限,即在对目标对象在服务器上完成了注册,且目标对象身份验证通过的情况下,可以从服务器上设置的多个预设识别模型中选取与调用权限对应的一个或多个预设识别模型进行语音数据的处理。
为了更好的理解上述语音数据的处理方法的过程,以下结合两个可选实施例对上述语音数据的处理方法流程进行说明。
在智能语音对话系统中,为了不影响交互响应时间,通过调用多个通用语音识别引擎的方法来达到最佳的用户交互体验的一种重新分配流量调用的分流策略。由于现有的多引擎调用通常为同时在多个引擎上识别同一用户语音数据,这样导致了各个引擎的响应时间是不一致的,而每次以所有结果得到的时间为准,这样就导致,每次都是以最长的交互时间为最终的响应时间,严重影响用户的交互体验。但多个引擎的优点却是明显的,可以互相弥补已达到最优的识别结果。
为解决这一问题,本公开可选实施例中,主要提供了一种基于多语音识别引擎的分流策略的实现方法,通过使用定时重新分配流量的策略,每个语音仅通过一个引擎识别,但该引擎为各引擎中识别该语音最优的引擎,定期重新分配每个用户使用的引擎,以达到用户的数据与引擎的匹配度最高,达到最优的识别结果及交互体验,进而通过使用多引擎动态分流的策略,动态调用不同的引擎,达到在单一引擎调用的响应时间内反馈给用户更准确的识别结果,却不影响交互体验的技术效果。
作为一种可选的实施方式,多通用语音识别引擎识别结果输出解决方案如下,包含以下步骤:
步骤1、首先基于现有识别系统,利用人机对话将部分用户语音同时进入多引擎识别,并对用户数据进行筛选及标注,以得到用户的正确指令要求。
步骤2、对各个引擎在上述步骤中的数据得到的置信度(也称可信度)confidence值进行统计,根据各引擎的阈值分析,确定整体数据中达到阈值的比例。
可选地,Confidence值的计算:由于是云端通用模型,根据模型的不同结构及结果进行统计confidence。
作为一种可选的实施例,传统的模型结构使用的是后验概率,即:使用语言模型及声学模型打分来确定最佳路径,以得到后验概率的结果,语音识别得到最佳词序列的公式如下:
Figure PCTCN2022096411-appb-000001
其中,P(W)为语言模型的打分,P(X|W)为声学模型打分。
作为另一种可选的实施方式,可以进行Confidence比例的计算,根据所有的引擎计算得到所有数据的confidence结果,经过softmax归一化。
例如,假设共m个引擎,n个数据:
Figure PCTCN2022096411-appb-000002
其中,上述c(total)为总的置信度值,上述c m(conf{1..n}>thres m)表示m个引擎对n个数据进行识别后对应的置信度值是否大于预设的M个引擎的平均置信度;C M用于指示n个数据在M引擎中的可信度的比例构成的向量;通过softmax函数对向量进行归一化:公式如下:
S 1=softmax(C M);
可选地,识别结果比例计算:根据识别评估标准的词错误率WER进行统计每个引擎的识别结果,公式如下:
W M=[(1-WER 1),...,(1-WER m)];
上述W M为识别准确率的向量;同样经过softmax函数归一化;
S 2=softmax(W M);
结合上述归一化后的结果S 1及S 2,加权平均重新衡量每个引擎的性能:
S=λ 1S 12S 2
其中,λ 1,λ 2∈R m,R m为每一个引擎对应的权重系数的集合,将S 1及S 2作为两组m维特征的向量,使用k折交叉验证,进行DNN模型训练,得到最优的λ 1,λ 2,从而得到最后的分配结果S。
步骤3、对S进行排序,选取准确率为前三的三个引擎,默认词错误率的相差度在10%以内,重新做归一化后,得到最终的权重分配方案,即云端通过配置用户可调用的引擎方式,在多引擎择优选一个的引擎调用的情况下,达到最大程度的提高识别率。
步骤4、定期重复执行步骤1-3,将整个流程自动化为动态根据权重重新分配引擎调用的方式。
可选地,根据以下表1的实际测试结果(WER)来看,双引擎的效果最佳:
表1
Figure PCTCN2022096411-appb-000003
综上,本公开可选实施例,通过将多引擎的置信度和识别结果作为特征向量,进行不同引擎的权重系数模型训练调优,得到最佳的权重结果。根据权重结果进行引擎的动态分配,使得不同的用户可调用不同的引擎。达到最优的识别准确性,定期重新训练权重结果,动态分配引擎。此外,使用多语音识别引擎混合调用方式,提高识别正确率,并且用户指令进入 单一引擎,得到全部引擎的最佳识别结果,降低响应时间,进一步的由于各引擎的权重可自动生成,故可自动调用不同的引擎,实现动态分配策略。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本公开各个实施例所述的方法。
在本实施例中还提供了一种语音数据的处理装置,该装置用于实现上述实施例及优选实施方式,已经进行过说明的不再赘述。如以下所使用的,术语“模块”可以实现预定功能的软件和/或硬件的组合。尽管以下实施例所描述的装置较佳地以软件来实现,但是硬件,或者软件和硬件的组合的实现也是可能并被构想的。
图3是根据本公开实施例的语音数据的处理装置的结构框图,如图3所示,该装置包括:
(1)获取模块34,设置为获取待处理的语音数据;
(2)配置模块36,设置为根据多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;
(3)确定模块38,设置为通过所述至少一个目标语音模型对所述待处理的语音数据进行处理。
通过上述装置,获取待处理的语音数据;根据多个预设语音模型中各预设语音模型对应的权重,从多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;通过至少一个目标语音模型对待处理的语音数据进行处理,也就是说,通过 确定多个预设语音模型中各预设语音模型对应的权重,从中选择出符合处理待处理的语音数据的至少一个目标语音模型对待处理的语音数据进行处理,从而向目标对象反馈更准确的语音结果,因此,可以解决现有技术中在使用多种语音识别引擎(即语音模型)进行语音识别时,识别时间长,无法确定识别结果的准确率等问题,确保了语音数据进行识别的灵活性,提升对于识别准确率的确定时间。
需要说明的是,上述预设语音模型的识别种类多种多样,即存在可以进行语音识别的预设语音模型,还可以存在用于进行语义理解的预设语音模型,还可以是用于进行声纹识别的预设语音模型,本公开对此不做过多限定,但类似的模型均可以作为本公开实施例中的预设语音模型。
图4是根据本公开实施例的另一种语音数据的处理装置的结构框图,如图4所示,该装置还包括:样本模块30,权限模块32;
在一个示例性实施例中,上述装置还包括:样本模块,设置为获取用于训练所述多个预设语音模型的样本语音;通过所述多个预设语音模型分别对所述样本语音进行处理,得到各预设语音模型对应的识别结果和置信度;根据所述各预设语音模型对应的所述识别结果和所述置信度,确定所述多个预设语音模型对应的权重。
需要说明的是,样本语音与待处理的语音数据具有相同的参数信息,具体的:参数信息可以是:用户ID、声纹特征、针对的语音处理设备(家电、机器人、音箱等)等。
可以理解的是,为了保证语音数据可以在后续的过程中被更加快速的识别,在确定了语音数据的处理准确率之后,根据语音数据对应内容的语义种类,确定对于同一种语义种类不同识别模型的准确率,继而得到所述语音数据的语音数据识别列表,在后续碰见包含相同语义中的语音数据时,从语音数据识别列表选择具有较高识别准确率对应的预设识别模型进行识别操作。
在一个示例性实施例中,上述样本模块,还设置为获取所述样本语音 的标准识别数据,其中,所述标准识别数据用于指示样本语音正确解析对应的文本内容;确定所述标准识别数据与所述各预设语音模型对于所述样本语音处理得到的识别数据的差异;根据所述差异确定出所述各预设语音模型对于所述样本语音的识别结果。
在一个示例性实施例中,上述样本模块,还设置为获取所述样本语音对应的置信区间;确定所述各预设语音模型对于所述样本语音处理得到的识别值存在与所述置信区间的概率,其中,所述识别值用于指示各预设语音模型对于样本语音识别后的识别数据与标准识别数据存在重复的词序数量;根据所述概率确定所述各预设语音模型对应的置信度。
也就是说,为了保证语音数据识别的准确率在一定安全范围内,通过预设的词错误率阈值对预设识别模型对应的历史词错误率进行筛选,进而将识别语音数据的预设识别模型的词错误率保证在目标对象允许的范围内。
在一个示例性实施例中,上述样本模块,还设置为获取所述样本语音在所述多个预设语音模型的多个识别结果,根据所述多个识别结果确定所述样本语音的第一特征向量;获取所述样本语音在所述多个预设语音模型的多个置信度,根据所述多个置信度确定所述样本语音的第二特征向量;将所述第一特征向量和所述第二特征向量输入到预设神经网络模型中,以获取所述多个预设语音模型对应的权重。
在一个示例性实施例中,上述装置还包括:权限模块,设置为确定所述待处理的语音数据对应目标对象的身份信息;根据所述身份信息确定目标对象的调用权限,其中,所述调用权限用于指示多个预设语音模型中能处理所述目标对象对应的所述待处理的语音数据的模型列表,其中,不同的预设识别模型用于识别不同结构的语音数据。
简而言之,由于不同的目标对象对应的身份信息不同,在进行预设识别模型调用时可供选择的预设识别模型也是不同的,由于目标对象可以提前在服务器上进行身份的注册,并根据注册结果为其分配对应的预设识别 模型的调用权限,即在对目标对象在服务器上完成了注册,且目标对象身份验证通过的情况下,可以从服务器上设置的多个预设识别模型中选取与调用权限对应的一个或多个预设识别模型进行语音数据的处理。
在本公开的描述中,需要理解的是,术语中“中心”、“上”、“下”、“前”、“后”、“左”、“右”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本公开和简化描述,而不是指示或暗示所指的装置或组件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本公开的限制。此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性。
在本公开的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“连接”、“相连”应做广义理解,例如,可以是固定连接,也可以是拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以是通过中间媒介间接相连,可以是两个组件内部的连通。当组件被称为“固定于”或“设置于”另一个元件,它可以直接在另一个组件上或者也可以存在居中的组件。当一个组件被认为是“连接”另一个元件,它可以是直接连接到另一个元件或者可能同时存在居中元件。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本公开的具体含义。
需要说明的是,上述各个模块是可以通过软件或硬件来实现的,对于后者,可以通过以下方式实现,但不限于此:上述模块均位于同一处理器中;或者,上述各个模块以任意组合的形式分别位于不同的处理器中。
本公开的实施例还提供了一种存储介质,该存储介质中存储有计算机程序,其中,该计算机程序被设置为运行时执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,在本实施例中,上述存储介质可以被设置为存储用于执行以下步骤的计算机程序:
S1,获取待处理的语音数据;
S2,根据多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;
S3,通过所述至少一个目标语音模型对所述待处理的语音数据进行处理。
在一个示例性实施例中,在本实施例中,上述存储介质可以包括但不限于:U盘、只读存储器(Read-Only Memory,简称为ROM)、随机存取存储器(Random Access Memory,简称为RAM)、移动硬盘、磁碟或者光盘等各种可以存储计算机程序的介质。
本公开的实施例还提供了一种电子装置,包括存储器和处理器,该存储器中存储有计算机程序,该处理器被设置为运行计算机程序以执行上述任一项方法实施例中的步骤。
在一个示例性实施例中,上述电子装置还可以包括传输设备以及输入输出设备,其中,该传输设备和上述处理器连接,该输入输出设备和上述处理器连接。
在一个示例性实施例中,在本实施例中,上述处理器可以被设置为通过计算机程序执行以下步骤:
S1,获取待处理的语音数据;
S2,根据多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;
S3,通过所述至少一个目标语音模型对所述待处理的语音数据进行处理。
在一个示例性实施例中,本实施例中的具体示例可以参考上述实施例及可选实施方式中所描述的示例,本实施例在此不再赘述。
显然,本领域的技术人员应该明白,上述的本公开的各模块或各步骤 可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,在一个示例性实施例中,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本公开不限制于任何特定的硬件和软件结合。
以上所述仅为本公开的优选实施例而已,并不用于限制本公开,对于本领域的技术人员来说,本公开可以有各种更改和变化。凡在本公开的原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (14)

  1. 一种语音数据的处理方法,包括:
    获取待处理的语音数据;
    根据多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;
    通过所述至少一个目标语音模型对所述待处理的语音数据进行处理。
  2. 根据权利要求1所述的方法,其中,获取待处理的语音数据之前,所述方法还包括:
    获取用于训练所述多个预设语音模型的样本语音;
    通过所述多个预设语音模型分别对所述样本语音进行处理,得到各预设语音模型对应的识别结果和置信度;
    根据所述各预设语音模型对应的所述识别结果和所述置信度,确定所述多个预设语音模型对应的权重。
  3. 根据权利要求2所述的方法,其中,通过所述多个预设语音模型分别对所述样本语音进行处理,得到各预设语音模型对应的识别结果,包括:
    获取所述样本语音的标准识别数据,其中,所述标准识别数据用于指示样本语音正确解析对应的文本内容;
    确定所述标准识别数据与所述各预设语音模型对于所述样本语音处理得到的识别数据的差异;
    根据所述差异确定出所述各预设语音模型对于所述样本语音的 识别结果。
  4. 根据权利要求2所述的方法,其中,通过所述多个预设语音模型分别对所述样本语音进行处理,得到各预设语音模型对应的置信度,包括:
    获取所述样本语音对应的置信区间;
    确定所述各预设语音模型对于所述样本语音处理得到的识别值存在与所述置信区间的概率,其中,所述识别值用于指示各预设语音模型对于样本语音识别后的识别数据与标准识别数据存在重复的词序数量;
    根据所述概率确定所述各预设语音模型对应的置信度。
  5. 根据权利要求2所述的方法,其中,根据所述各预设语音模型对应的所述识别结果和所述置信度,确定所述多个预设语音模型对应的权重,包括:
    获取所述样本语音在所述多个预设语音模型的多个识别结果,根据所述多个识别结果确定所述样本语音的第一特征向量;
    获取所述样本语音在所述多个预设语音模型的多个置信度,根据所述多个置信度确定所述样本语音的第二特征向量;
    将所述第一特征向量和所述第二特征向量输入到预设神经网络模型中,以获取所述多个预设语音模型对应的权重。
  6. 根据权利要求1所述的方法,其中,根据所述多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度之前,所述方法还包括:
    确定所述待处理的语音数据对应目标对象的身份信息;
    根据所述身份信息确定目标对象的调用权限,其中,所述调用权限用于指示多个预设语音模型中能处理所述目标对象对应的所述待处理的语音数据的模型列表,其中,不同的预设识别模型用于识别不同结构的语音数据。
  7. 一种语音数据的处理装置,包括:
    获取模块,设置为获取待处理的语音数据;
    配置模块,设置为根据多个预设语音模型中各预设语音模型对应的权重,从所述多个预设语音模型中确定至少一个目标语音模型,各预设语音模型的权重表征该预设语音模型识别结果的置信度;
    确定模块,设置为通过所述至少一个目标语音模型对所述待处理的语音数据进行处理。
  8. 根据权利要求7所述的装置,其中,所述装置还包括:
    样本模块,设置为获取用于训练所述多个预设语音模型的样本语音;通过所述多个预设语音模型分别对所述样本语音进行处理,得到各预设语音模型对应的识别结果和置信度;根据所述各预设语音模型对应的所述识别结果和所述置信度,确定所述多个预设语音模型对应的权重。
  9. 根据权利要求8所述的装置,其中,所述样本模块,还用于获取所述样本语音的标准识别数据,其中,所述标准识别数据用于指示样本语音正确解析对应的文本内容;确定所述标准识别数据与所述各预设语音模型对于所述样本语音处理得到的识别数据的差异;根据所述差异确定出所述各预设语音模型对于所述样本语音的识别结果。
  10. 根据权利要求8所述的装置,其中,所述样本模块,还用于 获取所述样本语音对应的置信区间;确定所述各预设语音模型对于所述样本语音处理得到的识别值存在与所述置信区间的概率,其中,所述识别值用于指示各预设语音模型对于样本语音识别后的识别数据与标准识别数据存在重复的词序数量;根据所述概率确定所述各预设语音模型对应的置信度。
  11. 根据权利要求8所述的装置,其中,所述样本模块,还用于获取所述样本语音在所述多个预设语音模型的多个识别结果,根据所述多个识别结果确定所述样本语音的第一特征向量;获取所述样本语音在所述多个预设语音模型的多个置信度,根据所述多个置信度确定所述样本语音的第二特征向量;将所述第一特征向量和所述第二特征向量输入到预设神经网络模型中,以获取所述多个预设语音模型对应的权重。
  12. 根据权利要求7所述的装置,其中,所述装置还包括:权限模块,用于确定所述待处理的语音数据对应目标对象的身份信息;根据所述身份信息确定目标对象的调用权限,其中,所述调用权限用于指示多个预设语音模型中能处理所述目标对象对应的所述待处理的语音数据的模型列表,其中,不同的预设识别模型用于识别不同结构的语音数据。
  13. 一种计算机可读的存储介质,其中,所述存储介质中存储有计算机程序,其中,所述计算机程序被设置为运行时执行所述权利要求1至6任一项中所述的方法。
  14. 一种电子装置,包括存储器和处理器,其中,所述存储器中存储有计算机程序,所述处理器被设置为运行所述计算机程序以执行所述权利要求1至6任一项中所述的方法。
PCT/CN2022/096411 2021-06-30 2022-05-31 语音数据的处理方法及装置、存储介质、电子装置 WO2023273776A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110744802.3 2021-06-30
CN202110744802.3A CN113593535B (zh) 2021-06-30 语音数据的处理方法及装置、存储介质、电子装置

Publications (1)

Publication Number Publication Date
WO2023273776A1 true WO2023273776A1 (zh) 2023-01-05

Family

ID=78245663

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/096411 WO2023273776A1 (zh) 2021-06-30 2022-05-31 语音数据的处理方法及装置、存储介质、电子装置

Country Status (1)

Country Link
WO (1) WO2023273776A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104795069A (zh) * 2014-01-21 2015-07-22 腾讯科技(深圳)有限公司 语音识别方法和服务器
CN110148416A (zh) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 语音识别方法、装置、设备和存储介质
CN111179934A (zh) * 2018-11-12 2020-05-19 奇酷互联网络科技(深圳)有限公司 选择语音引擎的方法、移动终端和计算机可读存储介质
CN111883122A (zh) * 2020-07-22 2020-11-03 海尔优家智能科技(北京)有限公司 语音识别方法及装置、存储介质、电子设备
CN111933117A (zh) * 2020-07-30 2020-11-13 腾讯科技(深圳)有限公司 语音验证方法和装置、存储介质及电子装置
CN112116910A (zh) * 2020-10-30 2020-12-22 珠海格力电器股份有限公司 语音指令的识别方法和装置、存储介质、电子装置
CN113593535A (zh) * 2021-06-30 2021-11-02 青岛海尔科技有限公司 语音数据的处理方法及装置、存储介质、电子装置

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104795069A (zh) * 2014-01-21 2015-07-22 腾讯科技(深圳)有限公司 语音识别方法和服务器
CN111179934A (zh) * 2018-11-12 2020-05-19 奇酷互联网络科技(深圳)有限公司 选择语音引擎的方法、移动终端和计算机可读存储介质
CN110148416A (zh) * 2019-04-23 2019-08-20 腾讯科技(深圳)有限公司 语音识别方法、装置、设备和存储介质
CN111883122A (zh) * 2020-07-22 2020-11-03 海尔优家智能科技(北京)有限公司 语音识别方法及装置、存储介质、电子设备
CN111933117A (zh) * 2020-07-30 2020-11-13 腾讯科技(深圳)有限公司 语音验证方法和装置、存储介质及电子装置
CN112116910A (zh) * 2020-10-30 2020-12-22 珠海格力电器股份有限公司 语音指令的识别方法和装置、存储介质、电子装置
CN113593535A (zh) * 2021-06-30 2021-11-02 青岛海尔科技有限公司 语音数据的处理方法及装置、存储介质、电子装置

Also Published As

Publication number Publication date
CN113593535A (zh) 2021-11-02

Similar Documents

Publication Publication Date Title
US11189263B2 (en) Voice data processing method, voice interaction device, and storage medium for binding user identity with user voice model
CN107798032B (zh) 自助语音会话中的应答消息处理方法和装置
CN107886949B (zh) 一种内容推荐方法及装置
CN110996116B (zh) 一种主播信息的推送方法、装置、计算机设备和存储介质
CN110336723A (zh) 智能家电的控制方法及装置、智能家电设备
CN111049996B (zh) 多场景语音识别方法及装置、和应用其的智能客服系统
US20200342214A1 (en) Face recognition method and apparatus, classification model training method and apparatus, storage medium and computer device
WO2017206661A1 (zh) 语音识别的方法及系统
WO2020014899A1 (zh) 语音控制方法、中控设备和存储介质
WO2021135604A1 (zh) 语音控制方法、装置、服务器、终端设备及存储介质
CN111212191B (zh) 一种客户来电坐席分配方法
CN112463106A (zh) 基于智能屏幕的语音交互方法、装置、设备及存储介质
CN110990685A (zh) 基于声纹的语音搜索方法、设备、存储介质及装置
CN111447124B (zh) 基于生物特征识别的智能家居控制方法及智能控制设备
CN106356056B (zh) 语音识别方法和装置
CN117059074B (zh) 一种基于意图识别的语音交互方法、装置及存储介质
WO2023273776A1 (zh) 语音数据的处理方法及装置、存储介质、电子装置
CN111343660B (zh) 一种应用程序的测试方法及设备
CN113408567A (zh) 基于多分析任务的数据分析方法及电子设备
CN111640450A (zh) 多人声音频处理方法、装置、设备及可读存储介质
CN109346080A (zh) 语音控制方法、装置、设备和存储介质
CN115083412B (zh) 语音交互方法及相关装置、电子设备、存储介质
CN107222383B (zh) 一种对话管理方法和系统
CN109524002A (zh) 智能语音识别方法及装置
CN112148864B (zh) 语音交互方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831597

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE