WO2021253779A1 - 一种语音识别方法以及系统 - Google Patents

一种语音识别方法以及系统 Download PDF

Info

Publication number
WO2021253779A1
WO2021253779A1 PCT/CN2020/138443 CN2020138443W WO2021253779A1 WO 2021253779 A1 WO2021253779 A1 WO 2021253779A1 CN 2020138443 W CN2020138443 W CN 2020138443W WO 2021253779 A1 WO2021253779 A1 WO 2021253779A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
voice
text
speech
personalized
Prior art date
Application number
PCT/CN2020/138443
Other languages
English (en)
French (fr)
Inventor
温馨
党伟珍
Original Assignee
深圳Tcl新技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳Tcl新技术有限公司 filed Critical 深圳Tcl新技术有限公司
Publication of WO2021253779A1 publication Critical patent/WO2021253779A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/065Adaptation
    • G10L15/07Adaptation to the speaker
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Definitions

  • the present disclosure relates to the field of speech recognition technology, and in particular to a speech recognition method and system.
  • the natural language dialogue system relies on artificial intelligence technology to simulate the natural way of dialogue between people, becoming a new type of human-computer interaction, which is widely used in terminal equipment, such as smart TVs, smart phones, smart speakers, smart robots, etc. .
  • the realization of the natural speech dialogue system mainly relies on manually trained machine learning models, such as speech-to-text model (ASR), natural language understanding model (NLP), and text-to-speech model (TTS).
  • ASR speech-to-text model
  • NLP natural language understanding model
  • TTS text-to-speech model
  • the existing terminal equipment uses the default machine learning model pre-trained by the R&D personnel. When facing each user, the voice cannot be understood or the understanding error is often encountered. As a result, the natural language dialogue system cannot be executed normally, which cannot satisfy the real intention of the user.
  • the technical problem to be solved by the present disclosure is to provide a voice recognition method and system in view of the deficiencies of the prior art.
  • the first aspect of the present disclosure provides a voice recognition method, the method includes:
  • text information corresponding to the voice information is determined.
  • the voiceprint information is a sound wave spectrum corresponding to the voice information, and the voiceprint information corresponding to each user is different from each other.
  • the method before the obtaining the voiceprint information of the voice information to be recognized, the method includes:
  • the personality training sample includes several personality training voice groups, and each of the several personality training voice groups includes training voice data and real text information corresponding to the training voice data , And the real confidence level corresponding to the real text information; the voiceprint information corresponding to each training voice data in each training voice group is the same;
  • the obtaining the personality training samples corresponding to the voiceprint information specifically includes:
  • the acquiring the voiceprint information of the voice information to be recognized is specifically:
  • the deep learning network model is trained based on a preset training sample set, and the preset training sample set includes a number of voice information Group, each group of voice information includes voice information and real voiceprint information corresponding to the voice information.
  • the determining the personalized speech-to-text model corresponding to the voice information based on the voiceprint information specifically includes:
  • a personalized speech-to-text model corresponding to the voiceprint information If a personalized speech-to-text model corresponding to the voiceprint information is detected, execute the step of determining text information corresponding to the voice information based on the personalized speech-to-text model;
  • the default speech-to-text model is used as the speech-to-text model corresponding to the speech information.
  • the determining the text information corresponding to the voice information based on the personalized speech-to-text model specifically includes:
  • the voice information When the confidence of the reference text information is less than the preset confidence threshold, input the voice information into a preset default voice conversion module, and determine the target text information corresponding to the voice information through the default voice-to-text model;
  • the inputting text information corresponding to the voice information through the personalized speech-to-text model includes:
  • the reference text information is used as the text information corresponding to the voice information.
  • the default speech-to-text model is arranged on the server, and the default speech-to-text model is used to convert voice information corresponding to each voiceprint information into text information.
  • the default speech-to-text model is trained based on a speech data sample set, wherein the speech data sample set includes several speech data samples and text information corresponding to each speech data sample.
  • the speech data The sample is a voice data sample in Mandarin.
  • the personalized speech-to-text model is arranged on a terminal device, and the terminal device is a terminal device for acquiring voiceprint information of the voice information to be recognized.
  • the method further includes:
  • the response voice corresponding to the voice information is determined based on the text information, and the response voice is played, so that the user corresponding to the voice information can obtain the response information.
  • a second aspect of the present disclosure provides a voice recognition system
  • the voice recognition system includes a terminal device and a server; the terminal device is deployed with a personalized voice-to-text model, and the server is deployed with a default voice-to-text model;
  • the terminal device is a terminal device for performing acquisition of voiceprint information of the voice information to be recognized; based on the voiceprint information, a personalized voice-to-text model corresponding to the voice information is determined, wherein the personalized voice-to-text model Is obtained by training based on the voice data corresponding to the voiceprint information; determining the text information corresponding to the voice information based on the personalized speech-to-text model;
  • the server is configured to determine the target text information corresponding to the voice information through the default speech-to-text model when the confidence of the text information is less than a preset confidence threshold, and use the target text information as the voice The text message corresponding to the message.
  • a third aspect of the present disclosure provides a computer-readable storage medium, the computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize Steps in any of the speech recognition methods described above.
  • a fourth aspect of the present disclosure provides a terminal device, which includes: a processor, a memory, and a communication bus; the memory stores a computer readable program that can be executed by the processor;
  • the communication bus realizes connection and communication between the processor and the memory
  • the present disclosure provides a voice recognition method, the method includes: acquiring voiceprint information of the voice information to be recognized; and determining the personality corresponding to the voice information based on the voiceprint information A speech-to-text model, wherein the personalized speech-to-text model is trained based on voice data corresponding to the voiceprint information; and based on the personalized speech-to-text model, the text information corresponding to the voice information is determined.
  • the voiceprint information of the voice data used for training of the personalized speech-to-text model used in this disclosure is the same as the voiceprint information of the acquired voice information, so the voice expression mode of the acquired voice information is the same as that of the personalized speech-to-text model.
  • the voice expression mode of the adopted voice data is the same, so that the accuracy of the voice information recognition can be improved through the personalized voice conversion module, and then the user's intention can be better understood, which brings convenience to the user.
  • Fig. 1 is a flowchart of a speech recognition method provided by the present disclosure.
  • Figure 2 is a schematic structural diagram of the speech recognition system provided by the present disclosure.
  • Fig. 3 is a schematic structural diagram of a terminal device provided by the present disclosure.
  • the models are all the default speech-to-text models configured by the server.
  • the users facing each terminal device are different from each other, and the voice expression methods of each user are also different and the same, (for example, some are used for dialects, some users' pronunciation is inaccurate, etc.), which makes the voice-to-text model often appear
  • the incomprehensibility or misunderstanding of speech will cause the natural language dialogue system to fail to execute normally, thus failing to satisfy the user's true intentions.
  • the voiceprint information of the voice information is acquired; and the personalized speech-to-text model corresponding to the voiceprint information is used to determine the corresponding voice information text information.
  • the personalized voice-to-text model is trained based on the voice data corresponding to the voiceprint information, and the voice expression mode of the acquired voice information is the same as the voice expression mode of the voice data used by the personalized voice-to-text model, Therefore, the accuracy of the voice information recognition can be improved through the personalized voice conversion module, and the user's intention can be better understood, which brings convenience to the user.
  • the embodiments of the present disclosure can be applied to a terminal device.
  • the terminal device can receive voice information, obtain voiceprint information of the voice information, and determine the personalized speech-to-text model corresponding to the voice information based on the voiceprint information;
  • the personalized speech-to-text model determines the text information corresponding to the voice information.
  • the actions of the embodiments of the present disclosure are described as being performed by terminal devices, these actions may also have terminal devices and some of the server 1 connected to the terminal device, for example, the terminal device.
  • the device receives voice information, sends the voice information to the server, and the server responds to the voice information sent by the terminal device, and obtains the voiceprint information of the voice information, and determines the personalized voice-to-text model corresponding to the voice information; based on the personalized voice Translate the text model to determine the text information corresponding to the voice information.
  • the present disclosure is not limited in terms of execution subject, as long as the actions disclosed in the embodiments of the present disclosure are performed.
  • This implementation provides a voice recognition method, as shown in Figure 1, the method includes:
  • the voice information to be recognized may be voice information input by the user (for example, voice information picked up by a pickup), voice information sent by an external device, or download from the Internet (for example, Baidu, etc.) Voice message.
  • the voiceprint information is a sound wave spectrum corresponding to the voice information, and the voiceprint information corresponding to each user is different. Therefore, after the voice information is acquired, the voiceprint information is used as the identification information of the voice information to be recognized, so that the personalized speech-to-text model corresponding to the voice information can be determined subsequently based on the identification information.
  • the voiceprint information corresponding to the voice information can be obtained through traditional algorithms, for example, the hidden Markov model (HMM) method: usually a single-state HMM or a Gaussian mixture model is used (GMM), etc.; a deep learning network model can also be used to obtain the voiceprint information corresponding to the voice information.
  • the deep learning network model can be trained based on a preset training sample set that includes a number of voice information Group, each group of voice information includes voice information and real voiceprint information corresponding to the voice information.
  • the voiceprint information for recognizing voice information can be used as a functional module, and when the voice information to be recognized is acquired, the voiceprint information corresponding to the voice information is output through the functional module.
  • S20 Determine a personalized speech-to-text model corresponding to the voice information based on the voiceprint information.
  • the personalized speech-to-text model is obtained by training based on the voice data corresponding to the voiceprint information, and the personalized speech-to-text model corresponds to the voiceprint information, and is used to correspond to the voiceprint information.
  • Voice messages are converted to text messages.
  • the voice data is voice data formed by the user corresponding to the voiceprint information, for example, the voiceprint information corresponds to a segment of voice spoken by the user.
  • the individual voice-to-text models corresponding to each different voiceprint information are different from each other. For example, for voiceprint information A and voiceprint information B, the voiceprint information A and voiceprint information B are different.
  • the personalized speech-to-text model corresponding to voiceprint information A is different from the personalized speech-to-text model corresponding to voiceprint information B.
  • the personalized speech-to-text model A is trained based on the voice data corresponding to voiceprint information A
  • the voice data corresponding to voiceprint information A is the voice data formed by user A corresponding to voiceprint information A, for example, a segment of voice recorded by user A
  • the personalized voice-to-text model B is based on voice data training corresponding to voiceprint information B Therefore, the voice data corresponding to the voiceprint information B is the voice data formed by the user B corresponding to the voiceprint information B, for example, a segment of voice recorded by the user B.
  • the personalized voice-to-text model A can be used as the personalized voice-to-text model corresponding to voice information A; and when the obtained voiceprint information is voiceprint In the case of voice information B of information B, the personalized voice-to-text model A cannot be used as the personalized voice-to-text model corresponding to voice information B.
  • the personalized speech-to-text model may be configured on a terminal device or a server connected to the terminal device, where the terminal device is a terminal device for executing the voice recognition method, for example, a smart phone, Tablet PC and so on.
  • the terminal device can perform the operation of determining the personalized speech-to-text model corresponding to the voice information based on the voiceprint information;
  • the personalized speech-to-text model is configured on the server, The terminal device sends the voice information and the voiceprint information corresponding to the voice information to the server, through the server, based on the personalized voice-to-text model corresponding to the voiceprint information, and recognizes the voice information corresponding to the voice information based on the determined personalized voice-to-text model text information.
  • the personalized speech-to-text model is configured on the terminal device, so that the terminal device does not need to communicate with the server, which reduces the time delay caused by communication and improves the timeliness of speech recognition; at the same time,
  • the voice information can also be recognized when not connected to the Internet, which improves the success rate of voice information improvement.
  • the personalized speech-to-text model is a pre-trained network model, and the training process of the personalized speech-to-text model can be executed by a terminal device or a server connected to the terminal device.
  • the personalized speech-to-text model is trained by the server connected to the terminal device, after the server obtains the personalized speech-to-text model through training, the personalized speech-to-text model is deployed on the terminal device corresponding to the personalized speech-to-text model,
  • the terminal device corresponding to the personalized speech-to-text model may be a terminal device that sends personalized training samples corresponding to the personalized speech-to-text model to the server, or it may be the corresponding voiceprint information and the voice corresponding to the personalized speech-to-text model. Terminal equipment with the same pattern information.
  • the training process of the personalized speech-to-text model may be:
  • the personality training sample includes several personality training voice groups, and each personality training voice group corresponds to the voiceprint information, wherein each personality voice in the several personality training voice groups
  • Each group includes training voice data, the real text information corresponding to the training voice data, and the real confidence level corresponding to the real text information.
  • the voiceprint information corresponding to each training voice data in each training voice group is the same, and each The voice content of the training voice data is different from each other.
  • the personality training sample includes training voice data A and training voice data B
  • the voiceprint information corresponding to training voice data A is the same as the voiceprint information corresponding to training voice data B
  • the voice content contained in training voice data A is the same as the training voice.
  • the voice content contained in the data B is different, for example, the voice content contained in the training voice data A is "apple", the voice content contained in the training voice data B is "banana", and so on.
  • the real text information is a text expression corresponding to the voice content included in the training voice data; for example, the voice content included in the training voice information data is a voice corresponding to "Apple", and the real text information is "Apple".
  • the real confidence is the credibility of the real text information corresponding to the voice data, and the higher the real confidence, the higher the credibility of the real text information; conversely, the lower the real confidence is, the more reliable the real text information is.
  • the personality training samples may be generated according to the voice information input by the user received by the terminal device.
  • the process of generating the personality training sample may include: first, when a control instruction for establishing a personality training sample is received, a preset data set is established in response to the control instruction, wherein the preset data set is used to store the information in the personality training sample Each group of personalized voice groups; secondly, receive the input first voice information, and obtain the voiceprint information corresponding to the first voice information, and associate the voiceprint information with a preset data set to associate the voiceprint information As the identifier of the preset data set; again, receive the real text information corresponding to the first voice information and the real confidence to form a set of personalized voice groups; finally, continue to receive the input second voice information and the second voice information corresponding to the Real text information and real confidence, until a preset number of personalized voice groups are obtained.
  • the preset number is preset, for example, 1000.
  • the voice information and the voice information corresponding The real text information and the real confidence level form a set of personalized voice groups; if they are not the same, the voice input error will be prompted and the voice information will be discarded.
  • the number of different times reaches the preset number of times, the generating operation of the personality training samples is stopped.
  • the personality training samples can be generated multiple times. For each generation process, when the first voice information is obtained, it can be determined whether there is voiceprint information of the voice information. If the corresponding data set exists, store the voice information, the real text information corresponding to the voice information, and the true confidence in the data set; if it does not exist, create a data set for the voice information, and add the voice information, voice The real text information and the real confidence level corresponding to the information are stored in the created data set.
  • training the preset network model based on the personality training sample specifically includes: inputting the training voice data in the personality voice group in the personality training sample into the preset network model, and the preset network model Input the generated text information corresponding to the training voice data and the generated confidence level corresponding to the generated text information; modify the network parameters of the preset network model according to the real text information, the true confidence level, the generated text information, and the generated confidence level ; And continue to perform the step of inputting the training voice data in the next sex speech group in the personality training sample into the preset network model until the preset network model meets the preset conditions to obtain the personalized speech-to-text model.
  • the preset network model is a deep learning network model
  • the generated text information is text information corresponding to the voice content recognized by the training voice data through the preset network model
  • the generated confidence is the possibility of generating text information The higher the generation confidence, the higher the credibility of the generated text information; conversely, the lower the generation confidence, the lower the credibility of the generated text information.
  • the value range of the generated confidence is 0-1, and the greater the value of the generated confidence, the higher the credibility of the generated text information; on the contrary, the generated confidence The smaller the value, the lower the credibility of the generated text information.
  • the credibility of generating text information with a confidence level of 0.9 is higher than that of generating text information with a confidence level of 0.1.
  • the preset condition includes that the loss function value meets a preset requirement or the number of training times reaches a preset number.
  • the preset requirement may be determined according to the accuracy of the preset network model, which will not be described in detail here.
  • the preset number of times may be the maximum number of training times of the initial network model, for example, 2000 times.
  • the loss function value of the preset network model is calculated according to the real text information, the true confidence, the generated text information, and the generated confidence, and the loss function is obtained in the calculation
  • the training ends; if the loss function value does not meet the preset requirements, it is judged whether the training times of the preset network model The prediction times are reached. If the preset times are not reached, the network parameters of the preset network model are corrected according to the loss function value; if the preset times are reached, the training is ended. In this way, it is judged whether the training of the preset network model is completed through the loss function value and the number of training times, which can avoid the infinite loop of the training of the preset network model caused by the failure of the loss function value to meet the preset requirements.
  • the determining the personalized speech-to-text model corresponding to the voice information based on the voiceprint information specifically includes:
  • the detecting the personalized speech-to-text model corresponding to the voiceprint information can be understood as determining whether the terminal device itself and/or the server connected to the terminal device deploys the personalized voice corresponding to the voiceprint information based on the voiceprint information Text-to-text model; when the personalized voice-to-text model is deployed, it means that voice recognition can be performed based on the personalized voice-to-file module, so that the personalized voice-to-text model is used as the voice-to-text model corresponding to the voice information; when the personalized voice-to-text model is not deployed When converting a text model, it means that voice recognition cannot be performed based on the personalized voice-to-file module, and the default speech-to-text model is used as the speech-to-text model corresponding to the voice information.
  • the default speech-to-text model is a speech-to-text model configured on the server to convert all speech information into text information.
  • the default speech-to-text model can be obtained by training based on a set of speech data samples formed in Mandarin
  • the voice data sample set includes several voice data samples and text information corresponding to each voice data sample, and the voice data samples are voice data samples in the form of Mandarin, which are recorded as common voice data samples here.
  • the training process of the default speech-to-text model may include: inputting Mandarin speech data samples into the default speech-to-text model, and outputting predicted text information corresponding to the universal speech data samples through the default speech-to-text model, based on the predicted text information and universal
  • the text information corresponding to the voice data sample trains the default speech-to-text model to obtain the trained default speech-to-text model.
  • the default speech-to-text model is configured on the server to execute the speech-to-text model in which all speech information is converted into text information.
  • the input items of the default speech-to-text model during the training process are common speech data samples, and the output items are predicted text information; and the default speech-to-text model is used in the process of The input item is voice data, and the output item is text information.
  • the voice data may be Mandarin voice data, dialect voice data, or voice data with accents. It is understandable that the default speech-to-text model is not configured with voiceprint information. When the default speech-to-text model is used, all voice information corresponding to the voiceprint information can be identified through the default speech-to-text model. .
  • the default voice-to-text model is used to recognize the voice information, which can prevent the user from getting the voice without setting the personalized voice-to-text model.
  • the text information is text information output through the personalized speech-to-text model after the speech information is input into the personalized speech-to-text model; the text information is used to express the voice content of the voice information.
  • the text message is "Apple”
  • the text information corresponding to the voice information is used to obtain the text information corresponding to the voice information.
  • the output items of the personalized speech-to-text model include text information corresponding to the voice information and the confidence level of the text information. Therefore, in an implementation manner of this embodiment, the determining the text information corresponding to the voice information based on the personalized speech-to-text model specifically includes:
  • the voice information When the confidence of the reference text information is less than the preset confidence threshold, input the voice information into a preset default voice conversion module, and determine the target text information corresponding to the voice information through the default voice-to-text model; And use the target text information as text information corresponding to the voice information;
  • the reference text information is used as the text information corresponding to the voice information.
  • the reference text information is the personalized speech-to-text model, and the confidence of the reference text information is output from the personalized speech-to-text model.
  • the personalized speech-to-text model outputs the reference text information corresponding to the speech information and the confidence of the reference text information.
  • the confidence threshold is set in advance, and is a standard used to measure whether the parameter text information can be used as the text information corresponding to the voice information; when the confidence is greater than the confidence threshold, it indicates that the reference text information can be used as a voice
  • the text information corresponding to the information when the confidence is less than or equal to the confidence threshold, it means that the reference text information cannot be used as the text information corresponding to the voice information. For example, if the confidence of the reference text information is 0.7 and the preset confidence threshold is 0.8, then the reference text information cannot be used as the text information corresponding to the voice information.
  • the voice information is input into a default speech-to-text model, and the target text information corresponding to the voice information is output through the default speech-to-text model, and the target text
  • the information can avoid the problem of wrong text information output caused by insufficient personality training samples, and improve the recognition accuracy of text information.
  • the target text information is obtained, the confidence level of the target text information can be compared with the confidence level of the reference text information, and the text information corresponding to the voice information can be selected from the two with the higher confidence level. This can further ensure the accuracy of the text information. For example, if the confidence level of the reference text information is 0.7 and the confidence level of the target text information is 0.75, then the target text information is taken as the text information corresponding to the voice information.
  • this embodiment provides a voice recognition method, the method includes: obtaining voiceprint information of the voice information to be recognized; and determining a personalized voice-to-text model corresponding to the voice information based on the voiceprint information , Wherein the personalized speech-to-text model is obtained by training based on the voice data corresponding to the voiceprint information; based on the personalized speech-to-text model, the text information corresponding to the voice information is determined.
  • the voiceprint information of the voice data used for training of the personalized speech-to-text model used in this disclosure is the same as the voiceprint information of the acquired voice information, so the voice expression mode of the acquired voice information is the same as that of the personalized speech-to-text model.
  • the voice expression mode of the adopted voice data is the same, so that the accuracy of the voice information recognition can be improved through the personalized voice conversion module, and then the user's intention can be better understood, which brings convenience to the user.
  • this embodiment provides a voice recognition system
  • the voice recognition system includes a terminal device and a server; the terminal device is deployed with a personalized voice conversion model, and the server A default voice conversion model is deployed;
  • the terminal device is used to obtain the voiceprint information of the voice information to be recognized; determine the personalized voice conversion model corresponding to the voice information based on the voiceprint information, wherein the personalized voice conversion model is based on the voiceprint information Obtained through training; determine the text information corresponding to the voice information based on the personalized voice conversion model;
  • the server is configured to determine the target text information corresponding to the voice information through a default voice conversion model when the confidence of the text information is less than a preset confidence threshold, and use the target text information as the voice information Corresponding text information.
  • this embodiment provides a computer-readable storage medium, the computer-readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors , In order to realize the steps in the voice recognition method as described in the above embodiment.
  • the present disclosure also provides a terminal device, as shown in FIG. 3, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, and may also include a communication interface ( Communications Interface) 23 and bus 24.
  • the processor 20, the display screen 21, the memory 22, and the communication interface 23 can communicate with each other through the bus 24.
  • the display screen 21 is set to display a user guide interface preset in the initial setting mode.
  • the communication interface 23 can transmit information.
  • the processor 20 may call the logic instructions in the memory 22 to execute the method in the foregoing embodiment.
  • logic instructions in the memory 22 can be implemented in the form of software functional units and when sold or used as independent products, they can be stored in a computer readable storage medium.
  • the memory 22 can be configured to store software programs and computer-executable programs, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure.
  • the processor 20 executes functional applications and data processing by running software programs, instructions or modules stored in the memory 22, that is, implements the methods in the foregoing embodiments.
  • the memory 22 may include a program storage area and a data storage area, where the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal device, and the like.
  • the memory 22 may include a high-speed random access memory, and may also include a non-volatile memory.
  • U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disks or optical disks and other media that can store program codes can also be temporary State storage medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)

Abstract

提供了一种语音识别方法以及系统,其中该方法包括:获取待识别的语音信息的声纹信息(S10);基于声纹信息确定该语音信息对应的个性语音转文字模型,其中,该个性语音转文字模型为基于声纹信息对应的语音数据训练得到的(S20);基于该个性语音转文字模型,确定该语音信息对应的文字信息(S30)。该方法使用的个性语音转文字模型用于训练的语音数据的声纹信息与获取到的语音信息的声纹信息相同,那么获取到的语音信息的语音表达方式与该个性语音转文字模型所采用的语音数据的语音表达方式相同,从而通过该个性语音转换模块可以提高对该语音信息识别的准确性,进而可以更好的获知用户的意图,给用户的使用带来方便。

Description

一种语音识别方法以及系统
优先权
本公开要求于申请日为2020年6月19日提交中国专利局、申请号为“202010566791X”、申请名称为“一种语音识别方法以及系统”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及语音识别技术领域,特别涉及一种语音识别方法以及系统。
背景技术
自然语言对话系统依托于人工智能技术,模拟人与人之间的自然对话方式,成为一种新型的人机交互方式广泛应用于终端设备,例如,智能电视、智能手机、智能音箱、智能机器人等。自然语音对话系统的实现主要依赖于人工训练的机器学习模型,例如,语音转文字模型(ASR)、自然语言理解模型(NLP)以及文本转语音模型(TTS)等。然而,现有终端设备利用均是研发人员预先训练的默认的机器学习模型,而面对各用户时,经常会出现语音无法理解或者理解错误,例如,用户发音不标准或采用方言表达时,会导致自然语言对话系统无法正常执行,从而不能满足用户真正的意图。
公开内容
本公开要解决的技术问题在于,针对现有技术的不足,提供一种语音识别方法以及系统。
为了解决上述技术问题,本公开第一方面提供了一种语音识别方法,所述方法包括:
获取待识别的语音信息的声纹信息;
基于所述声纹信息确定所述语音信息对应的个性语音转文字模型,其中,所述个性语音转文字模型为基于所述声纹信息对应的语音数据训练得到的;
基于所述个性语音转文字模型,确定所述语音信息对应的文字信息。
在一个实施例中,所述声纹信息为该语音信息对应的声波频谱,并且各用户各自对应的声纹信息互不相同。
在一个实施例中,所述获取待识别的语音信息的声纹信息之前,所述方法包括:
获取所述声纹信息对应的个性训练样本,其中,个性训练样本包括若干个性训练语音组,若干个性训练语音组中每一个性语音组均包括训练语音数据、该训练语音数据对应的真实文字信息,以及该真实文字信息对应的真实置信度;各个性训练语音组中的各训练语音数据对应的声纹信息相同;
基于所述个性训练样本对预设网络模型进行训练,以得到所述个性语音转文字模型。
在一个实施例中,所述获取所述声纹信息对应的个性训练样本具体包括:
接收输入的第一语音信息,并获取该第一语音信息对应的声纹信息,将所述声纹信息与预设数据集相关联,以将所述声纹信息作为预设数据集的数据标识;
接收第一语音信息对应的真实文字信息以及真实置信度,以形成一组个性语音组;
继续接收输入的第二语音信息、第二语音信息对应的真实文字信息以及真实置信度,直至获取到预设数量的个性语音组,以得到个性训练样本,其中,第二语音信息对应的声纹信息与第一语音信息对应的声纹信息相同。
在一个实施例中,所述获取待识别的语音信息的声纹信息具体为:
基于经过训练的深度学习网络模型,获取待识别的语音信息的声纹信息,其中,所述深度学习网络模型是基于预设训练样本集训练得到的,所述预设训练样本集包括若干语音信息组,每组语音信息组包括语音信息以及语音信息对应的真实声纹信息。
在一个实施例中,所述基于所述声纹信息确定所述语音信息对应的个性语音转文字模型具体包括:
检测所述声纹信息对应的个性语音转文字模型;
若检测到所述声纹信息对应的个性语音转文字模型,执行基于所述个性语音转文字模型,确定所述语音信息对应的文字信息的步骤;
若未检测到所述声纹信息对应的个性语音转文字模型,将默认语音转文字模型作为该语音信息对应的语音转文字模型。
在一个实施例中,所述基于所述个性语音转文字模型,确定所述语音信息对应的文字信息具体包括:
将所述语音信息输入所述个性语音转换模块,通过所述个性语音转文字模型输入所述语音信息对应的参考文字信息;
当所述参考文字信息的置信度小于预设置信度阈值时,将所述语音信息输入预设的默认语音转换模块,通过所述默认语音转文字模型确定所述语音信息对应的目标文字信 息;
将所述目标文字信息作为所述语音信息对应的文字信息。
在一个实施例中,所述通过所述个性语音转文字模型输入所述语音信息对应的文字信息包括:
当所述参考文字信息的置信度大于或等于预设置信度阈值时,将所述参考文字信息作为所述语音信息对应的文字信息。
在一个实施例中,所述默认语音转文字模型布置于服务端上,所述默认语音转文字模型用于将各声纹信息对应的语音信息的转换为文字信息。
在一个实施例中,所述默认语音转文字模型基于语音数据样本集训练得到的,其中,所述语音数据样本集包括若干语音数据样本以及各语音数据样本各自对应的文字信息,所述语音数据样本为采用普通话形式的语音数据样本。
在一个实施例中,所述个性语音转文字模型布置于终端设备上,所述终端设备为用于执行获取待识别的语音信息的声纹信息的终端设备。
在一个实施例中,所述基于所述个性语音转文字模型,确定所述语音信息对应的文字信息之后,所述方法还包括:
基于所述文字信息确定所述语音信息对应的应答语音,并播放所述应答语音,以使得所述语音信息对应的用户获取到所述应答信息。
本公开第二方面提供了一种语音识别系统,所述语音识别系统包括终端设备以及服务端;所述终端设备部署有个性语音转文字模型,所述服务端部署有默认语音转文字模型;
所述终端设备为用于执行获取待识别的语音信息的声纹信息的终端设备;基于所述声纹信息确定所述语音信息对应的个性语音转文字模型,其中,所述个性语音转文字模型为基于所述声纹信息对应的语音数据训练得到的;基于所述个性语音转文字模型确定所述语音信息对应的文字信息;
所述服务端用于当所述文字信息的置信度小于预设置信度阈值时,通过默认语音转文字模型确定所述语音信息对应的目标文字信息,并将所述目标文字信息作为所述语音信息对应的文字信息。
本公开第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如上任一所述的语音识别方法中的步骤。
本公开第四方面提供了一种终端设备,其包括:处理器、存储器及通信总线;所述存储器上存储有可被所述处理器执行的计算机可读程序;
所述通信总线实现处理器和存储器之间的连接通信;
所述处理器执行所述计算机可读程序时实现如上任一所述的语音识别方法中的步骤。
有益效果:与现有技术相比,本公开提供了一种语音识别方法,所述方法包括:获取待识别的语音信息的声纹信息;基于所述声纹信息确定所述语音信息对应的个性语音转文字模型,其中,所述个性语音转文字模型为基于所述声纹信息对应的语音数据训练得到的;基于所述个性语音转文字模型,确定所述语音信息对应的文字信息。本公开中使用的个性语音转文字模型用于训练的语音数据的声纹信息与获取到的语音信息的声纹信息相同,那么获取到的语音信息的语音表达方式与该个性语音转文字模型所采用的语音数据的语音表达方式相同,从而通过该个性语音转换模块可以提高对该语音信息识别的准确性,进而可以更好的获知用户的意图,给用户的使用带来方便。
附图说明
图1为本公开提供的语音识别方法的流程图。
图2为本公开提供的语音识别系统的结构原理图。
图3为本公开提供的终端设备的结构原理图。
具体实施方式
本公开提供一种语音识别方法以及系统,为使本公开的目的、技术方案及效果更加清楚、明确,以下参照附图并举实施例对本公开进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本公开,并不用于限定本公开。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本公开的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组 合。
本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本公开所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。
发明人经过研究发现,目前终端设备(智能电视、智能手机、智能音箱、智能机器人等)普遍采用自然语音对话系统来进行语音交互,并且各终端设备采用的自然语音对话系统所采用的语音转文字模型均是服务端默认配置的默认语音转文字模型。然而,各终端设备面对的用户互不相同,并且各用户的语音表达方式也不同相同,(例如,部分用于采用方言、部分用户发音不准确等),这使得语音转文字模型经常会出现语音无法理解或者理解错误,会导致自然语言对话系统无法正常执行,从而不能满足用户真正的意图。
为了解决上述问题,在本公开实施例中,在获取待识别的语音信息时,获取该语音信息的声纹信息;并采用该声纹信息对应的个性语音转文字模型确定所述语音信息对应的文字信息。所述个性语音转文字模型为基于所述声纹信息对应的语音数据训练得到的,那么获取到的语音信息的语音表达方式与该个性语音转文字模型所采用的语音数据的语音表达方式相同,从而通过该个性语音转换模块可以提高对该语音信息识别的准确性,进而可以更好的获知用户的意图,给用户的使用带来方便。
举例说明,本公开实施例可以应用到终端设备,终端设备可以接收语音信息,并获取语音信息的声纹信息,并基于所述声纹信息确定所述语音信息对应的个性语音转文字模型;基于所述个性语音转文字模型,确定所述语音信息对应的文字信息。
可以理解的是,在上述应用场景中,虽然将本公开实施方式的动作描述为全部由终端设备执行,但是这些动作也可以部分有终端设备,部分有该终端设备连接的服务器1,例如,终端设备接收语音信息,将该语音信息发送至服务器,服务器响应终端设备发送的语音信息,并获取该语音信息的声纹信息,确定所述语音信息对应的个性语音转文字模型;基于所述个性语音转文字模型,确定所述语音信息对应的文字信息。可以理解的是,本公开在执行主体方面不受限制,只要执行了本公开实施方式所公开的动作即可。
需要注意的是,上述应用场景仅是为了便于理解本公开而示出,本公开的实施方 式在此方面不受任何限制。相反,本公开的实施方式可以应用于适用的任何场景。
下面结合附图,通过对实施例的描述,对公开内容作进一步说明。
本实施提供了一种语音识别方法,如图1所示,所述方法包括:
S10、获取待识别的语音信息的声纹信息。
具体地,所述待识别的语音信息可以是用户输入的语音信息(例如,用过拾音器拾取的语音信息),也可以是外部设备发送的语音信息,还可以是网络(例如,百度等)下载的语音信息。所述声纹信息为该语音信息对应的声波频谱,并且每个用户对应的声纹信息不同。由此,在获取到语音信息后,并将声纹信息作为该待识别的语音信息的标识信息,以便于后续可以基于该标识信息来确定该语音信息对应的个性语音转文字模型。
进一步,在获取到待识别的语音信息后,可以通过传统算法获取该语音信息对应的声纹信息,例如,隐式马尔可夫模型(HMM)方法:通常使用单状态的HMM,或高斯混合模型(GMM)等;也可以采用深度学习网络模型获取该语音信息对应的声纹信息,该深度学习网络模型可以是基于预设训练样本集训练得到的,所述预设训练样本集包括若干语音信息组,每组语音信息组包括语音信息以及语音信息对应的真实声纹信息。当然,在实际应用中,可以将识别语音信息的声纹信息作为一个功能模块,当获取到待识别的语音信息后,通过该功能模块输出该语音信息对应的声纹信息。
S20、基于所述声纹信息确定所述语音信息对应的个性语音转文字模型。
具体地,所述个性语音转文字模型为基于所述声纹信息对应的语音数据训练得到的,并且该个性语音转文字模型与所述声纹信息相对应,用于将该声纹信息对应的语音信息转换为文字信息。可以理解的是,所述语音数据为所述声纹信息对应的用户形成的语音数据,例如,该声纹信息对应用户说的一段语音。此外,对于不同的声纹信息,各不同声纹信息各自对应的个性语音转文字模型互不相同,例如,对于声纹信息A和声纹信息B,所述声纹信息A和声纹信息B不相同,声纹信息A对应的个性语音转文字模型与声纹信息B对应给的个性语音转文字模型不相同,其中,个性语音转文字模型A为基于声纹信息A对应的语音数据训练得,声纹信息A对应的语音数据为声纹信息A对应的用户A形成的语音数据,例如,用户A录制的一段语音等;个性语音转文字模型B为基于声纹信息B对应的语音数据训练得,声纹信息B对应的语音数据为声纹信息B对应的用户B形成的语音数据,例如,用户B录制的一段语音等。那么,当获取到声纹信息为声纹信息A的语音信息A时,可以将该个性语音转文字模型A作为语音信息A对应的个性语音转文字模型;而当获取到声纹信息为声纹信息B的语音信息B时,则 不能将个性语音转文字模型A作为语音信息B对应的个性语音转文字模型。
进一步,所述个性语音转文字模型可以是配置于终端设备,也可以配置于终端设备连接的服务端,其中,终端设备为用于执行该语音识别方法的终端设备,例如,可以是智能手机、平板电脑等。当终端设备配置该个性语音转文字模型时,终端设备可以执行基于所述声纹信息确定所述语音信息对应的个性语音转文字模型的操作;当该个性语音转文字模型配置于服务端时,终端设备将语音信息以及语音信息对应的声纹信息发送至服务端,通过服务端基于该声纹信息对应的个性语音转文字模型,并基于确定得到的个性语音转文字模型识别该语音信息对应的文字信息。在本实施例的一个实现中,该个性语音转文字模型配置于终端设备,这样终端设备不需要与服务端进行通讯,减少了通讯造成的时间延时,提高了语音识别的及时性;同时,在不联网的情况也可以对语音信息进行识别,提高了语音信息提高的成功率。
进一步,所述个性语音转文字模型为预先经过训练的网络模型,该个性语音转文字模型的训练过程可以由终端设备执行的,也可以由终端设备连接的服务端。当该个性语音转文字模型由终端设备连接的服务端训练时,服务端训练得到该个性语音转文字模型后,将该个性语音转文字模型部署与该个性语音转文字模型对应的终端设备上,其中,所述个性语音转文字模型对应的终端设备可以为向服务端发送个性语音转文字模型对应的个性训练样本的终端设备,也可以为对应的声纹信息与个性语音转文字模型对应的声纹信息相同的终端设备。
进一步,在本实施例的一个实现方式中,所述个性语音转文字模型的训练过程可以为:
H10、获取所述声纹信息对应的个性训练样本;
H20、基于所述个性训练样本对预设网络模型进行训练,以得到所述个性语音转文字模型。
具体地,在所述步骤H10中,所述个性训练样本包括若干个性训练语音组,每个个性训练语音组均与所述声纹信息相对应,其中,若干个性训练语音组中每一个性语音组均包括训练语音数据、该训练语音数据对应的真实文字信息,以及该真实文字信息对应的真实置信度,其中,各个性训练语音组中的各训练语音数据对应的声纹信息相同,并且各训练语音数据的语音内容互不相同。例如,个性训练样本包括训练语音数据A和训练语音数据B,那么练语音数据A对应的声纹信息和训练语音数据B对应的声纹信息相同,而练语音数据A包含的语音内容和训练语音数据B包含的语音内容不同,如, 练语音数据A包含的语音内容为“苹果”,训练语音数据B包含的语音内容为“香蕉”等。
进一步,所述真实文字信息为所述训练语音数据包含的语音内容对应的文字表达;例如,训练语音信息数据包括的语音内容为“苹果”对应的语音,真实文字信息为“苹果”。所述真实置信度为所述语音数据对应的真实文字信息的可信程度,并且真实置信度越高,真实文字信息的可信程度越高;反之,真实置信度越低,真实文字信息的可信程度越低。在本实施例的一个实现方式中,所述真实置信度可以为1。
进一步,所述个性训练样本可以是根据终端设备接收的用户输入的语音信息生成的。所述个性训练样本的生成过程可以包括:首先,当接收到建立个性训练样本的控制指令时,响应该控制指令建立预设数据集,其中,预设数据集用于存储该个性训练样本中的每组个性语音组;其次,接收输入的第一语音信息,并获取该第一语音信息对应的声纹信息,将所述声纹信息与预设数据集相关联,以将所述声纹信息作为预设数据集的标识;再次,接收第一语音信息对应的真实文字信息以及真实置信度,以形成一组个性语音组;最后,继续接收输入的第二语音信息、第二语音信息对应的真实文字信息以及真实置信度,直至获取到预设数量的个性语音组。其中,所述预设数量为预先设置的,例如,1000等。当然,值得说明的是,在获取到每个语音信息时,需要确定语音信息对应的声纹信息与数据集对应的声纹信息是否相同,若相同,则接收该语音信息以及该语音信息对应的真实文字信息以及真实置信度,以形成一组个性语音组;若不相同,则提示语音输入错误并丢弃该语音信息。当然,在实际应用中,当不相同次数达到预设次数时,停止个性训练样本的生成操作。
进一步,在本实施例的一个实现方式中,所述个性训练样本可以分多次生成,对于每次生成过程,在获取到第一个语音信息时,可以判断是否存在该语音信息的声纹信息对应的数据集,若存在,则将该语音信息、语音信息对应的真实文字信息以及真实置信度存入该数据集;若未存在则为该语音信息创建数据集,并将该语音信息、语音信息对应的真实文字信息以及真实置信度存入创建得到的数据集。
进一步,在所述步骤H20中,基于所述个性训练样本对预设网络模型进行训练具体包括:将个性训练样本中的个性语音组内的训练语音数据输入预设网络模型,通过预设网络模型输入该训练语音数据对应的生成文字信息,以及该生成文字信息对应的生成置信度;根据真实文字信息、真实置信度、生成文字信息以及生成置信度对所述预设网络模型的网络参数进行修正;并继续执行将个性训练样本中的下一个性语音组内的训练语音数据输入预设网络模型的步骤,直至所述预设网络模型满足预设条件,以得到个性语 音转文字模型。
进一步,所述预设网络模型为深度学习网络模型,所述生成文字信息为训练语音数据通过该预设网络模型识别到的语音内容对应的文字信息;所述生成置信度为生成文字信息的可信程度,并且生成置信度越高,生成文字信息的可信程度越高;反之,生成置信度越低,生成文字信息的可信程度越低。在本实施例的一个实现方式中,所述生成置信度的取值范围0-1,并且生成置信度的取值越大,说明生成文字信息的可信程度越高;反之,生成置信度的取值越小,说明生成文字信息的可信程度越低。例如,生成置信度为0.9的生成文字信息的可信程度高于生成置信度为0.1的生成文字信息的可信程度。
进一步,所述预设条件包括损失函数值满足预设要求或者训练次数达到预设次数。所述预设要求可以是根据预设网络模型的精度来确定,这里不做详细说明,所述预设次数可以为初始网络模型的最大训练次数,例如,2000次等。由此,在预设网络模型输出生成文字信息以及生成置信度后,根据真实文字信息、真实置信度、生成文字信息以及生成置信度来计算预设网络模型的损失函数值,在计算得到损失函数值后,判断所述损失函数值是否满足预设要求;若损失函数值满足预设要求,则结束训练;若损失函数值不满足预设要求,则判断所述预设网络模型的训练次数是否达到预测次数,若未达到预设次数,则根据所述损失函数值对所述预设网络模型的网络参数进行修正;若达到预设次数,则结束训练。这样通过损失函数值和训练次数来判断预设网络模型训练是否结束,可以避免因损失函数值无法达到预设要求而造成预设网络模型的训练进入死循环。
进一步,在本实施例的一个实现方式中,所述基于所述声纹信息确定所述语音信息对应的个性语音转文字模型具体包括:
S21、检测所述声纹信息对应的个性语音转文字模型;
S22、若检测到所述声纹信息对应的个性语音转文字模型,执行基于所述个性语音转文字模型,确定所述语音信息对应的文字信息的步骤;
S23、若未检测到所述声纹信息对应的个性语音转文字模型,将默认语音转文字模型作为该语音信息对应的语音转文字模型。
具体地,所述检测所述声纹信息对应的个性语音转文字模型可以理解为基于所述声纹信息确定终端设备自身和/或终端设备连接的服务端是否部署该声纹信息对应的个性语音转文字模型;当部署该个性语音转文字模型时,说明可以基于个性语音转文件模块进行语音识别,从而将该个性语音转文字模型作为语音信息对应的语音转文字模型;当未部署该个性语音转文字模型时,说明不可以基于个性语音转文件模块进行语音识别则 将默认语音转文字模型作为该语音信息对应的语音转文字模型。其中,所述默认语音转文字模型为配置于服务端,用于执行所有语音信息转换为文字信息的语音转文字模型,所述默认语音转文字模型可以基于采用普通话形成的语音数据样本集训练得到的,其中,所述语音数据样本集包括若干语音数据样本以及各语音数据样本对应的文字信息,所述语音数据样本为采用普通话形式的语音数据样本,这里记为普遍话语音数据样本。所述默认语音转文字模型的训练过程可以为:将普通话语音数据样本输入默认语音转文字模型,通过默认语音转文字模型输出普遍语音数据样本对应的预测文字信息,基于所述预测文字信息以及普遍语音数据样本对应的文字信息对默认语音转文字模型进行训练,以得到经过训练的默认语音转文字模型。在默认语音转文字模型训练完成后,将默认语音转文字模型配置于服务器,用于执行所有语音信息转换为文字信息的语音转文字模型。
进一步,由于默认语音转文字模型的训练过程可以知道,默认语音转文字模型在训练过程中的输入项为普遍语音数据样本,输出项为预测文字信息;而默认语音转文字模型在使用过程中的输入项为语音数据,输出项为文字信息,其中,语音数据可以为普通话语音数据,也可以为方言语音数据,还可以带有口音的语音数据等。可以理解的是,该默认语音转文字模型模型未配置声纹信息,在使用默认语音转文字模型时,所有声纹信息对应的语音信息均可以通过该默认语音转文字模型识别其对应的文字信息。本实施例在未检测到声纹信息对应的个性语音转文字模型时,采用默认语音转文字模型来识别语音信息,这样可以避免用户在未自行设置个性语音转文字模型时,也可以获取到语音信息对应的文字信息,进而进行人机交互,给用户的使用带来方便。
S30、基于所述个性语音转文字模型,确定所述语音信息对应的文字信息。
具体地,所述文字信息为将所述语音信息输入所述个性语音转文字模型后,通过该个性语音转文字模型输出的文字信息;该文字信息用于表达该语音信息的语音内容。例如,该文字信息为“苹果”,该语音信息包含的语音内容为“苹果”对应的语音。由此,所述基于所述个性语音转文字模型,确定所述语音信息对应的文字信息的过程可以为:将所述语音信息输入所述个性语音转文字模型,通过该个性语音转文字模型输出该语音信息对应的文字信息,以得到该语音信息对应的文字信息。
进一步,由该个性语音转文字模型的训练过程可以得知,该个性语音转文字模型的输出项包括语音信息对应的文字信息,以及该文字信息的置信度。由此,在本实施例的一个实现方式中,所述基于所述个性语音转文字模型,确定所述语音信息对应的文字信息具体包括:
将所述语音信息输入所述个性语音转换模块,通过所述个性语音转文字模型输入所述语音信息对应的参考文字信息;
当所述参考文字信息的置信度小于预设置信度阈值时,将所述语音信息输入预设的默认语音转换模块,通过所述默认语音转文字模型确定所述语音信息对应的目标文字信息;并将所述目标文字信息作为所述语音信息对应的文字信息;
当所述参考文字信息的置信度大于或等于预设置信度阈值时,将所述参考文字信息作为所述语音信息对应的文字信息。
具体地,所述参考文字信息为所述个性语音转文字模型,所述参考文字信息的置信度为所述个性语音转文字模型输出。可以理解的是,所述个性语音转文字模型输出所述语音信息对应的参考文字信息以及参考文字信息的置信度。所述置信度阈值为预先设置,为用于衡量所述参数文字信息是否可以作为该语音信息对应的文字信息的标准;当所述置信度大于置信度阈值时,说明该参考文字信息可以作为语音信息对应的文字信息;当所述置信度小于或等于置信度阈值时,说明参考文字信息无法作为语音信息对应的文字信息。例如,参考文字信息的置信度为0.7,预设置信度阈值为0.8,那么参考文字信息无法作为语音信息对应的文字信息。
进一步,当所述置信度小于或等于置信度阈值时,将所述语音信息输入默认语音转文字模型,通过所述默认语音转文字模型输出该语音信息对应的目标文字信息,并将该目标文字信息作为语音信息对应的文字信息这样可以避免由于个性训练样本不足而造成的文字信息输出错误的问题,提高了文字信息的识别准确性。当然,在实际应用中,在获取到目标文字信息时,可以将目标文字信息的置信度与参考文字信息的置信度进行比较,在两者中选取置信度高的作为语音信息对应的文字信息,这样可以进一步保证文字信息的准确性。例如,参考文字信息的置信度为0.7,目标文字信息的置信度为0.75,那么将目标文字信息作为语音信息对应的文字信息。
综上所述,本实施例提供了一种语音识别方法,所述方法包括:获取待识别的语音信息的声纹信息;基于所述声纹信息确定所述语音信息对应的个性语音转文字模型,其中,所述个性语音转文字模型为基于所述声纹信息对应的语音数据训练得到的;基于所述个性语音转文字模型,确定所述语音信息对应的文字信息。本公开中使用的个性语音转文字模型用于训练的语音数据的声纹信息与获取到的语音信息的声纹信息相同,那么获取到的语音信息的语音表达方式与该个性语音转文字模型所采用的语音数据的语音表达方式相同,从而通过该个性语音转换模块可以提高对该语音信息识别的准确性,进 而可以更好的获知用户的意图,给用户的使用带来方便。
基于上述语音识别方法,如图2所示,本实施例提供了一种语音识别系统,所述语音识别系统包括终端设备以及服务端;所述终端设备部署有个性语音转换模型,所述服务端部署有默认语音转换模型;
所述终端设备用于获取待识别的语音信息的声纹信息;基于所述声纹信息确定所述语音信息对应的个性语音转换模型,其中,所述个性语音转换模型为基于所述声纹信息训练得到的;基于所述个性语音转换模型确定所述语音信息对应的文本信息;
所述服务端用于当所述文本信息的置信度小于预设置信度阈值时,通过默认语音转换模型确定所述语音信息对应的目标文本信息,并将所述目标文本信息作为所述语音信息对应的文本信息。
基于上述语音识别方法,本实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如上述实施例所述的语音识别方法中的步骤。
基于上述语音识别方法,本公开还提供了一种终端设备,如图3所示,其包括至少一个处理器(processor)20;显示屏21;以及存储器(memory)22,还可以包括通信接口(Communications Interface)23和总线24。其中,处理器20、显示屏21、存储器22和通信接口23可以通过总线24完成相互间的通信。显示屏21设置为显示初始设置模式中预设的用户引导界面。通信接口23可以传输信息。处理器20可以调用存储器22中的逻辑指令,以执行上述实施例中的方法。
此外,上述的存储器22中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。
存储器22作为一种计算机可读存储介质,可设置为存储软件程序、计算机可执行程序,如本公开实施例中的方法对应的程序指令或模块。处理器20通过运行存储在存储器22中的软件程序、指令或模块,从而执行功能应用以及数据处理,即实现上述实施例中的方法。
存储器22可包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端设备的使用所创建的数据等。此外,存储器22可以包括高速随机存取存储器,还可以包括非易失性存储器。例如,U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等多种可以存储程序代码的介质,也可以是暂态存储介 质。
此外,上述存储介质以及终端设备中的多条指令处理器加载并执行的具体过程在上述方法中已经详细说明,在这里就不再一一陈述。
最后应说明的是:以上实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开各实施例技术方案的精神和范围。

Claims (15)

  1. 一种语音识别方法,其特征在于,所述方法包括:
    获取待识别的语音信息的声纹信息;
    基于所述声纹信息确定所述语音信息对应的个性语音转文字模型,其中,所述个性语音转文字模型为基于所述声纹信息对应的语音数据训练得到的;
    基于所述个性语音转文字模型,确定所述语音信息对应的文字信息。
  2. 根据权利要求1所述语音识别方法,其特征在于,所述声纹信息为该语音信息对应的声波频谱,并且各用户各自对应的声纹信息互不相同。
  3. 根据权利要求1所述语音识别方法,其特征在于,所述获取待识别的语音信息的声纹信息之前,所述方法包括:
    获取所述声纹信息对应的个性训练样本,其中,所述个性训练样本包括若干个性训练语音组,若干个性训练语音组中每一个性语音组均包括训练语音数据、该训练语音数据对应的真实文字信息,以及该真实文字信息对应的真实置信度;各个性训练语音组中的各训练语音数据对应的声纹信息相同;
    基于所述个性训练样本对预设网络模型进行训练,以得到所述个性语音转文字模型。
  4. 根据权利要求3所述语音识别方法,其特征在于,所述获取所述声纹信息对应的个性训练样本具体包括:
    接收输入的第一语音信息,并获取该第一语音信息对应的声纹信息,将所述声纹信息与预设数据集相关联,以将所述声纹信息作为预设数据集的数据标识;
    接收第一语音信息对应的真实文字信息以及真实置信度,以形成一组个性语音组;
    继续接收输入的第二语音信息、第二语音信息对应的真实文字信息以及真实置信度,直至获取到预设数量的个性语音组,以得到个性训练样本,其中,第二语音信息对应的声纹信息与第一语音信息对应的声纹信息相同。
  5. 根据权利要求1所述语音识别方法,其特征在于,所述获取待识别的语音信息的声纹信息具体为:
    基于经过训练的深度学习网络模型,获取待识别的语音信息的声纹信息,其中,所述深度学习网络模型是基于预设训练样本集训练得到的,所述预设训练样本集包括若干语音信息组,每组语音信息组包括语音信息以及语音信息对应的真实声纹信息。
  6. 根据权利要求1所述语音识别方法,其特征在于,所述基于所述声纹信息确定所述语音信息对应的个性语音转文字模型具体包括:
    检测所述声纹信息对应的个性语音转文字模型;
    若检测到所述声纹信息对应的个性语音转文字模型,执行基于所述个性语音转文字模型,确定所述语音信息对应的文字信息的步骤;
    若未检测到所述声纹信息对应的个性语音转文字模型,将默认语音转文字模型作为该语音信息对应的语音转文字模型。
  7. 根据权利要求1所述语音识别方法,其特征在于,所述基于所述个性语音转文字模型,确定所述语音信息对应的文字信息具体包括:
    将所述语音信息输入所述个性语音转换模块,通过所述个性语音转文字模型输入所述语音信息对应的参考文字信息;
    当所述参考文字信息的置信度小于预设置信度阈值时,将所述语音信息输入预设的默认语音转换模块,通过所述默认语音转文字模型确定所述语音信息对应的目标文字信息;
    将所述目标文字信息作为所述语音信息对应的文字信息。
  8. 根据权利要求7所述语音识别方法,其特征在于,所述通过所述个性语音转文字模型输入所述语音信息对应的文字信息包括:
    当所述参考文字信息的置信度大于或等于预设置信度阈值时,将所述参考文字信息作为所述语音信息对应的文字信息。
  9. 根据权利要求7所述语音识别方法,其特征在于,所述默认语音转文字模型布置于服务端上,所述默认语音转文字模型用于将各声纹信息对应的语音信息的转换为文字信息。
  10. 根据权利要求7所述的语音识别方法,其特征在于,所述默认语音转文字模型基于语音数据样本集训练得到的,其中,所述语音数据样本集包括若干语音数据样本以及各语音数据样本各自对应的文字信息,所述语音数据样本为采用普通话形式的语音数据样本。
  11. 根据权利要求1所述语音识别方法,其特征在于,所述个性语音转文字模型布置于终端设备上,所述终端设备为用于执行获取待识别的语音信息的声纹信息的终端设备。
  12. 根据权利要求1-11任意一项所述的语音识别方法,其特征在于,所述基于所述个性语音转文字模型,确定所述语音信息对应的文字信息之后,所述方法还包括:
    基于所述文字信息确定所述语音信息对应的应答语音,并播放所述应答语音,以使得所述语音信息对应的用户获取到所述应答信息。
  13. 一种语音识别系统,其特征在于,所述语音识别系统包括终端设备以及服务端;所述终端设备部署有个性语音转文字模型,所述服务端部署有默认语音转文字模型;
    所述终端设备为用于执行获取待识别的语音信息的声纹信息的终端设备;基于所述声纹信息确定所述语音信息对应的个性语音转文字模型,其中,所述个性语音转文字模型为基于所述声纹信息对应的语音数据训练得到的;基于所述个性语音转文字模型确定所述语音信息对应的文字信息;
    所述服务端用于当所述文字信息的置信度小于预设置信度阈值时,通过默认语音转文字模型确定所述语音信息对应的目标文字信息,并将所述目标文字信息作为所述语音信息对应的文字信息。
  14. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如权利要求1-12任意一项所述的语音识别方法中的步骤。
  15. 一种终端设备,其特征在于,包括:处理器、存储器及通信总线;所述存储器上存储有可被所述处理器执行的计算机可读程序;
    所述通信总线实现处理器和存储器之间的连接通信;
    所述处理器执行所述计算机可读程序时实现如权利要求1-12任意一项所述的语音识别方法中的步骤。
PCT/CN2020/138443 2020-06-19 2020-12-23 一种语音识别方法以及系统 WO2021253779A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010566791.XA CN113823263A (zh) 2020-06-19 2020-06-19 一种语音识别方法以及系统
CN202010566791.X 2020-06-19

Publications (1)

Publication Number Publication Date
WO2021253779A1 true WO2021253779A1 (zh) 2021-12-23

Family

ID=78911606

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/138443 WO2021253779A1 (zh) 2020-06-19 2020-12-23 一种语音识别方法以及系统

Country Status (2)

Country Link
CN (1) CN113823263A (zh)
WO (1) WO2021253779A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116530944A (zh) * 2023-07-06 2023-08-04 荣耀终端有限公司 声音处理方法及电子设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000214880A (ja) * 1999-01-20 2000-08-04 Sony Internatl Europ Gmbh 音声認識方法及び音声認識装置
CN103456303A (zh) * 2013-08-08 2013-12-18 四川长虹电器股份有限公司 一种语音控制的方法和智能空调系统
CN105096940A (zh) * 2015-06-30 2015-11-25 百度在线网络技术(北京)有限公司 用于进行语音识别的方法和装置
CN105096941A (zh) * 2015-09-02 2015-11-25 百度在线网络技术(北京)有限公司 语音识别方法以及装置
CN110634472A (zh) * 2018-06-21 2019-12-31 中兴通讯股份有限公司 一种语音识别方法、服务器及计算机可读存储介质
CN111261168A (zh) * 2020-01-21 2020-06-09 杭州中科先进技术研究院有限公司 一种支持多任务多模型的语音识别引擎及方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000214880A (ja) * 1999-01-20 2000-08-04 Sony Internatl Europ Gmbh 音声認識方法及び音声認識装置
CN103456303A (zh) * 2013-08-08 2013-12-18 四川长虹电器股份有限公司 一种语音控制的方法和智能空调系统
CN105096940A (zh) * 2015-06-30 2015-11-25 百度在线网络技术(北京)有限公司 用于进行语音识别的方法和装置
CN105096941A (zh) * 2015-09-02 2015-11-25 百度在线网络技术(北京)有限公司 语音识别方法以及装置
CN110634472A (zh) * 2018-06-21 2019-12-31 中兴通讯股份有限公司 一种语音识别方法、服务器及计算机可读存储介质
CN111261168A (zh) * 2020-01-21 2020-06-09 杭州中科先进技术研究院有限公司 一种支持多任务多模型的语音识别引擎及方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116530944A (zh) * 2023-07-06 2023-08-04 荣耀终端有限公司 声音处理方法及电子设备
CN116530944B (zh) * 2023-07-06 2023-10-20 荣耀终端有限公司 声音处理方法及电子设备

Also Published As

Publication number Publication date
CN113823263A (zh) 2021-12-21

Similar Documents

Publication Publication Date Title
EP3770905B1 (en) Speech recognition method, apparatus and device, and storage medium
US10891952B2 (en) Speech recognition
JP6633008B2 (ja) 音声対話装置及び音声対話方法
US20210183392A1 (en) Phoneme-based natural language processing
CN110047481B (zh) 用于语音识别的方法和装置
KR101183344B1 (ko) 사용자 정정들을 이용한 자동 음성 인식 학습
KR20210009596A (ko) 지능적 음성 인식 방법, 음성 인식 장치 및 지능형 컴퓨팅 디바이스
KR20190104941A (ko) 감정 정보 기반의 음성 합성 방법 및 장치
US20190385608A1 (en) Intelligent voice recognizing method, apparatus, and intelligent computing device
KR20190098110A (ko) 지능형 프레젠테이션 방법
JP2017058673A (ja) 対話処理装置及び方法と知能型対話処理システム
KR20190101329A (ko) 지능적 음성 출력 방법, 음성 출력 장치 및 지능형 컴퓨팅 디바이스
US20190385607A1 (en) Intelligent voice outputting method, apparatus, and intelligent computing device
KR102321801B1 (ko) 지능적 음성 인식 방법, 음성 인식 장치 및 지능형 컴퓨팅 디바이스
US20230206897A1 (en) Electronic apparatus and method for controlling thereof
US11790893B2 (en) Voice processing method based on artificial intelligence
KR20190106890A (ko) 감정 정보 기반의 음성 합성 방법 및 장치
KR20190104278A (ko) 지능적 음성 인식 방법, 음성 인식 장치 및 지능형 컴퓨팅 디바이스
CN109712610A (zh) 用于识别语音的方法和装置
US20200020337A1 (en) Intelligent voice recognizing method, apparatus, and intelligent computing device
CN113674746B (zh) 人机交互方法、装置、设备以及存储介质
WO2014173325A1 (zh) 喉音识别方法及装置
CN113611316A (zh) 人机交互方法、装置、设备以及存储介质
KR20240122776A (ko) 뉴럴 음성 합성의 적응 및 학습
US10847154B2 (en) Information processing device, information processing method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20941135

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20941135

Country of ref document: EP

Kind code of ref document: A1