WO2020186712A1 - 一种语音识别方法、装置及终端 - Google Patents

一种语音识别方法、装置及终端 Download PDF

Info

Publication number
WO2020186712A1
WO2020186712A1 PCT/CN2019/106806 CN2019106806W WO2020186712A1 WO 2020186712 A1 WO2020186712 A1 WO 2020186712A1 CN 2019106806 W CN2019106806 W CN 2019106806W WO 2020186712 A1 WO2020186712 A1 WO 2020186712A1
Authority
WO
WIPO (PCT)
Prior art keywords
recognition result
target
voice recognition
voice
speech recognition
Prior art date
Application number
PCT/CN2019/106806
Other languages
English (en)
French (fr)
Inventor
任晓楠
崔保磊
戴磊
Original Assignee
海信视像科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海信视像科技股份有限公司 filed Critical 海信视像科技股份有限公司
Publication of WO2020186712A1 publication Critical patent/WO2020186712A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present disclosure mainly relates to the field of smart home technology, and in particular to a voice recognition method, device and terminal.
  • the embodiment of the present disclosure provides a voice recognition method applied to a terminal, and the method includes:
  • Each voice recognition result and the target file corresponding to the target voice recognition result are displayed on the display interface, wherein the target voice recognition result is displayed in a first display mode, and other voice recognition results are displayed in a second display mode.
  • the obtaining the target file corresponding to the target speech recognition result includes:
  • the performing semantic recognition on the target speech recognition result and determining the service type corresponding to the target speech recognition result includes:
  • a preset dictionary database perform word segmentation processing on the target speech recognition result, perform semantic recognition on each word segmentation in the target speech recognition result, and determine the service type corresponding to each word segmentation;
  • the service type corresponding to the target speech recognition result is determined.
  • the displaying each voice recognition result and the target file corresponding to the target voice recognition result on a display interface includes:
  • the target file corresponding to the target voice recognition result is displayed on the display interface of the terminal.
  • the method further includes:
  • the modified target voice recognition result is displayed in a first display mode, and other voice recognition results are displayed in a second display mode; at the same time, the target file corresponding to the modified target voice recognition result is displayed.
  • the determining the voice recognition result that the voice information meets the first matching threshold according to the pre-trained voice matching model includes:
  • the Chinese character sequence whose score meets the first matching threshold is used as the speech recognition result.
  • the embodiment of the present disclosure provides a terminal, including a processor, a communication interface, a memory, and a communication bus; wherein the processor, the communication interface, and the memory complete mutual communication through the communication bus;
  • a computer program is stored in the memory
  • the processor is configured to run the computer program to enable the terminal to realize:
  • Each voice recognition result and the target file corresponding to the target voice recognition result are displayed on the display interface, wherein the target voice recognition result is displayed in a first display mode, and other voice recognition results are displayed in a second display mode.
  • the processor is configured to obtain the target file corresponding to the target speech recognition result by executing the following:
  • the processor is configured to perform semantic recognition on the target speech recognition result by executing the following to determine the service type corresponding to the target speech recognition result:
  • a preset dictionary database perform word segmentation processing on the target speech recognition result, perform semantic recognition on each word segmentation in the target speech recognition result, and determine the service type corresponding to each word segmentation;
  • the service type corresponding to the target speech recognition result is determined.
  • the processor is configured to display each voice recognition result and the target file corresponding to the target voice recognition result to a display interface by executing the following:
  • the target file corresponding to the target voice recognition result is displayed on the display interface of the terminal.
  • the processor is further configured to run the computer program to enable the terminal to realize:
  • the modified target voice recognition result is displayed in a first display mode, and other voice recognition results are displayed in a second display mode; at the same time, the target file corresponding to the modified target voice recognition result is displayed.
  • the processor is configured to execute the voice matching model completed according to the pre-training as follows to determine a voice recognition result that the voice information meets the first matching threshold:
  • the Chinese character sequence whose score meets the first matching threshold is used as the speech recognition result.
  • the embodiments of the present disclosure provide a computer-readable non-volatile storage medium, the computer-readable non-volatile storage medium stores a computer program, and when the computer program is executed by a processor, any one of the above is applied to The steps of the terminal method.
  • FIG. 1 is a schematic diagram of a process of a speech recognition method provided in Embodiment 1 of the present disclosure
  • FIG. 2 is an example diagram of a voice matching model provided by an embodiment of the disclosure
  • FIG. 3 is a schematic diagram of a process of a voice recognition method provided by an embodiment of the disclosure.
  • FIG. 4 is a schematic diagram of a process of displaying a voice recognition result provided by an embodiment of the disclosure.
  • FIG. 4a is a schematic diagram of a voice recognition result display provided by an embodiment of the disclosure.
  • FIG. 5 is a schematic diagram of a process of a voice recognition method provided by an embodiment of the disclosure.
  • FIG. 6 is a schematic diagram of a process of displaying a voice recognition result provided by an embodiment of the disclosure.
  • FIG. 7 is a schematic diagram of a process of displaying a voice recognition result provided by an embodiment of the disclosure.
  • FIG. 8 is a schematic diagram of a voice recognition result display provided by an embodiment of the disclosure.
  • FIG. 9 is a schematic diagram of a voice recognition result display provided by an embodiment of the disclosure.
  • FIG. 10 is a schematic structural diagram of a server provided by an embodiment of the disclosure.
  • FIG. 11 is a schematic structural diagram of a terminal provided by an embodiment of the disclosure.
  • Voice recognition allows machines to receive, recognize and understand voice signals and convert them into corresponding digital signals.
  • speech recognition has produced a large number of applications in many industries, there is still a lot of work to be done to achieve true human-machine natural communication. For example, greater improvements are needed in self-adaptation to achieve freedom from accents, dialects and specific people. Impact requirements.
  • voice types in reality. From the perspective of voice characteristics, they can be divided into male, female, and child voices.
  • many people’s pronunciation is far from the standard pronunciation, which requires accent and Processing of dialects. This leads to the problem that the recognition result obtained when the user inputs speech is inconsistent with the user's intention.
  • the embodiments of the present disclosure provide a voice recognition method, device, and terminal.
  • the method includes: receiving input voice information; and determining at least one voice of the voice information that meets a first matching threshold according to a pre-trained voice matching model Recognition result; determining that the speech recognition result with the highest matching degree among the at least one speech recognition result is the target speech recognition result; obtaining the target file corresponding to the target speech recognition result; combining each speech recognition result and the target speech recognition result
  • the corresponding target file is displayed on the display interface, wherein the target voice recognition result is displayed in a first display mode, and other voice recognition results are displayed in a second display mode.
  • the speech recognition result with the highest matching degree in the first display mode it can be displayed quickly and the user's convenience is improved; at least one speech recognition result is separately semantically recognized to obtain more possible user search intentions, and each Two voice recognition results and at least one voice recognition result are displayed on the display interface of the terminal through the second display mode, which effectively provides users with more search results, effectively improves the coverage of voice recognition results and user intentions, and improves user acceptance
  • the success rate of voice search improves the use effect of voice recognition products.
  • Step 101 Receive input voice information
  • the terminal can obtain the voice information input by the user through the voice device of the terminal, or obtain the voice information input by the user through an external voice device; specifically, a voice recognition module is provided in the terminal to recognize voice information and perform voice information collection .
  • a communication module such as a WIFI wireless communication module, is provided in the terminal, so that the terminal can be connected to the server, and the collected voice information can be sent to the server.
  • a communication module such as a WIFI wireless communication module
  • it can also be executed entirely by the terminal, or only part of the voice information that needs to be processed by the server can be sent, which is not limited here.
  • Step 102 According to a pre-trained voice matching model, determine at least one voice recognition result of the voice information that meets the first matching threshold;
  • the voice matching model can be set on the terminal or on the server, which is not limited here. If it is set on a server, the server sends the at least one voice recognition result to the terminal after determining at least one voice recognition result of the voice information that meets the first matching threshold.
  • Step 103 Determine that the speech recognition result with the highest matching degree among the at least one speech recognition result is the target speech recognition result
  • the terminal may determine the target voice recognition result with the highest matching degree based on the score of at least one voice recognition result, or after the server determines the target voice recognition result with the highest matching degree based on the score of at least one voice recognition result , Send the target voice recognition result to the terminal.
  • Step 104 Obtain the target file corresponding to the target voice recognition result
  • the terminal can search for the target file corresponding to the target speech recognition result in the local or network resource library according to the target speech recognition result, or search the network resource library according to the target speech recognition result. After determining the target file corresponding to the target voice recognition result, the target file or the identification information of the target file is sent to the terminal, so that the terminal determines the target file corresponding to the target voice recognition result.
  • Step 105 Display each voice recognition result and the target file corresponding to the target voice recognition result on a display interface, where the target voice recognition result is displayed in a first display mode, and other voice recognition results are displayed in a second display mode.
  • the second display mode may be a display mode opposite to the first display mode.
  • the first display mode may be a display mode such as highlighting and a selected box
  • the second display mode may be a display mode such as no highlighting and no selected box.
  • the target display result shown in FIG. 7 is the first display mode where the selected box is highlighted
  • the second display mode is the mode where the selected box is not highlighted.
  • the specific display method is not limited here.
  • the voice recognition method provided by the embodiment of the present disclosure can display the voice recognition result with the highest matching degree in the first display mode, which can be displayed quickly and the convenience of the user is improved; semantic recognition is performed on at least one voice recognition result, Obtain more possible user search intentions, and display each voice recognition result and at least one voice recognition result to the display interface of the terminal through the second display mode, which effectively provides users with more search results and effectively improves voice
  • the coverage rate of recognition results and user intentions improves the success rate of users through voice search and improves the use effect of voice recognition products.
  • a method for a voice recognition model to determine a voice recognition result including:
  • Step 1 Acquire the voice information input by the user, and determine the characteristic acoustic probability of the voice information through the acoustic characteristics
  • acoustic feature extraction that is, voice acoustic feature information is extracted from voice information.
  • the extraction part should have better distinguishability for the modeling unit of the acoustic model.
  • the acoustic features in the embodiments of the present disclosure may include: Mel Cepstral Coefficients (MFCC), Linear Prediction Cepstral Coefficients (LPCC), Perceptual Linear Prediction Coefficients (PLP), and the like.
  • Step 2 Input the voice information extracted from the acoustic features into a voice matching model, the voice matching model includes a language model and an acoustic model;
  • the training process of the voice matching model may include:
  • Step 1 Obtain sample voice information, where the sample voice information carries tagging information of the voice to which it belongs;
  • Step 2 Input the voice information of each sample into the voice matching model
  • Step 3 Training the voice matching model according to the voice information of each sample and the output of the voice matching model.
  • sample voice information can be collected by the terminal or obtained by other means; for the sample voice information, the sample voice information can be labeled.
  • the model can be dynamic time warping technology, Hidden Markov Model (HMM), artificial neural network, support vector machine And other models. Input the voice information of each sample into the model, and train the voice matching model according to the label information of each sample voice information and the output result of the voice matching model.
  • HMM Hidden Markov Model
  • a voice matching model is obtained by training a large number of sample voice information, and the collected voice information can be voice recognized through the voice matching model.
  • the acoustic model uses the training speech features and their corresponding annotation information to perform annotated acoustic model modeling.
  • the acoustic model constructs the mapping relationship between the observed features in the speech signal and the speech modeling unit, so as to classify the phoneme or phoneme state.
  • the acoustic model may use HMM as the modeling basis of the acoustic model.
  • the language model can adopt the N-gram statistical language model under the framework of speech recognition based on statistical learning modeling, which includes a Markov chain representing the generation process of the word sequence, and the probability p(W) of the word sequence W will be generated.
  • statistical learning modeling which includes a Markov chain representing the generation process of the word sequence, and the probability p(W) of the word sequence W will be generated.
  • w k represents the k-th word in the word sequence.
  • the above formula shows that the probability of generating the current word is only related to the previous n-1 words;
  • the language model training and evaluation index may adopt the language model perplexity (Perplexity, PP), which is defined as the reciprocal of the set average of word sequence generation probability, namely:
  • the probability of each word and related word combination appearing in the training set corpus is first counted, and the relevant parameters of the language model are estimated based on this.
  • the language model can be optimized by methods such as discounting and backing-off, and the recurrent neural network (RNN) modeling language model.
  • RNN recurrent neural network
  • Step 3 Input the result obtained by the speech model to the decoder for decoding to obtain possible text information for speech recognition.
  • the decoder is used to analyze the most likely word sequence W'through the relevant search algorithm to output the possible text information in the voice information.
  • step 102 determining the voice recognition result that the voice information meets the first matching threshold according to the pre-trained voice matching model includes:
  • Step 1 Input the voice information into the voice matching model, recognize the pinyin sequence in the voice information, and form all possible candidate characters;
  • Step 2 For each possible candidate character, determine the possible sequence of Chinese characters and the score of the sequence of Chinese characters through grammatical rules and statistical methods;
  • the score of the Chinese character sequence is obtained, and some pinyin recognition errors are corrected.
  • the probability and statistics language model is used to search for possible correct paths in character sequences or word sequences.
  • the decoder in the embodiments of the present disclosure can adopt the Viterbi algorithm of dynamic programming ideas, and perform rapid synchronization through certain algorithms (such as Gaussian Selection algorithm, Language Mode Look-Ahead, etc.) Probability calculation and tailoring of the search space to reduce computational complexity and memory overhead, and improve the efficiency of the search algorithm.
  • Step 3 Use the Chinese character sequence whose score meets the first matching threshold as the speech recognition result.
  • At least one Chinese character sequence and score matched by the language model are sorted. From the result of template matching, the recognition result with high matching degree has a higher probability of correct recognition. Of course, there are also cases where the model has a higher matching score, but the result is not correct because a certain model is not supplemented with corpus. Therefore, a speech recognition result whose matching score needs to meet the first threshold can be selected for semantic recognition.
  • the score needs to meet the first threshold, that is, a speech recognition result with a score greater than the first threshold ⁇ is used as a possible correct recognition result.
  • the recognition results are as follows:
  • the first threshold may be 0.4; then the speech recognition result at this time is: "Zheng Kai's video, Zheng Kai's video, and block letters video".
  • step 104 when all the recognition result model matching scores are less than the first threshold ⁇ , the recognition result with the highest score is obtained and step 104 is performed to perform semantic processing.
  • the target voice recognition result can also be output to the interface of the terminal and displayed.
  • the interface of the terminal may be the display interface of the client terminal of the voice assistant that collects voice information, or may be other interfaces of the terminal, which is not limited here.
  • the target voice recognition result is "Zheng Kai's video".
  • the process of displaying the recognition result includes the following steps:
  • Step 1 Create the layout file of the interface
  • the layout file includes a text control that displays the voice recognition result.
  • Step 2 Create the interface to load the layout file and initialize the text control.
  • Step 3 Display the voice recognition result, that is, the recognized text information on the display interface of the terminal.
  • the server stores a preset dictionary library, which contains a large amount of corpus data and has a semantic analysis function.
  • the semantic analysis function of the speech recognition result performs semantic analysis processing.
  • a semantic recognition model is stored in the server, and the semantic recognition model can recognize word segmentation of voice information; determine the word segmentation in the voice information, recognize the semantics of the word segmentation, and determine the target file corresponding to each semantic.
  • the dictionary library to be retrieved is small, in order to increase the parsing rate, semantic recognition can be done on the terminal, which is not limited here.
  • step 104 it includes:
  • Step 1 Perform semantic recognition on the target speech recognition result, and determine the service type corresponding to the target speech recognition result;
  • the terminal can output the word segmentation of the voice recognition result according to the semantic recognition model on the terminal, analyze the semantics of the word segmentation, and the semantically corresponding labeling result to find whether the labeling result contains the service type related to the service type.
  • the server performs semantic recognition, after receiving the voice recognition result sent by the terminal, the server outputs the word segmentation of the voice recognition result according to the semantic recognition model on the server, analyzes the semantics of the word segmentation, and the semantically corresponding tagging results, and finds the tagging results Whether to include business types related to business types.
  • Step 2 Search for the target file corresponding to the target voice recognition result in the service type corresponding to the target voice recognition result from the resource library.
  • the specific implementation process of semantic recognition model recognition may include:
  • Step 1 According to the preset dictionary library, perform word segmentation processing on the target speech recognition result, and perform semantic recognition on each word segmentation in the target speech recognition result, and determine the business type corresponding to each word segmentation;
  • the preset dictionary library can obtain the corpus through methods such as web crawlers to update the word segmentation and the annotation of the corresponding business type.
  • Step 2 Determine the service type corresponding to the target speech recognition result according to the weight of the service type corresponding to each word segmentation.
  • other voice recognition results that exceed the first threshold may also perform the above operations simultaneously with the target voice recognition result.
  • the above operations can also be performed after receiving the user's switching instruction, which is not limited here.
  • Step 1 Perform semantic recognition on each voice recognition result in the at least one voice recognition result, and determine the service type corresponding to the voice recognition result;
  • the speech recognition result 1 is input into the semantic recognition model. If the output result of the semantic recognition model contains service type 1, it is considered that the speech recognition result 1 contains service type 1, and it needs to be included in the application corresponding to service type 1. Perform subsequent processing.
  • the speech recognition result 1 is "Zheng Kai's video”
  • the word segmentation result output by the semantic recognition model is: Zheng Kai, ⁇ ,video.
  • the service type of the video is the video type
  • the service type of the voice recognition result is the video type.
  • the business type can also be determined according to the attributes of the word segmentation.
  • the voice recognition result 2 is "weather forecast”
  • the word segmentation result determined by the semantic recognition model is: weather, forecast; "weather” has weather attributes (weatherKeys), then the service type is determined to be the weather query type.
  • the specific implementation process of semantic recognition model recognition may include:
  • Step 1 Perform word segmentation processing on the speech recognition result according to the preset dictionary library, perform semantic recognition on each word segmentation in the speech recognition result, and determine the service type corresponding to each word segmentation;
  • the preset dictionary library can obtain the corpus through methods such as web crawlers to update the word segmentation and the annotation of the corresponding business type.
  • Step 2 Determine the service type corresponding to the voice recognition result according to the weight of the service type corresponding to each word segmentation.
  • the weight of the service type is based on the priority of the service type in the terminal, the priority of the database from which the word segmentation in the preset dictionary database comes from, or the terminal user At least one of the user preferences is determined.
  • the speech recognition result 3 is "video in block letters", and the word segmentation result determined by the semantic recognition model is: block letters,, video.
  • the service type of the video is the video type; the service type in the block letters is the education type; if it is determined that the weight of the video type corresponding to "video” is greater than the weight of the education type corresponding to the "block letters", the service type of the voice recognition result 3 is determined to be the video type . If it is determined that the weight of the video type corresponding to "video” is the same as the weight of the education type corresponding to "block letters", the service type corresponding to the voice recognition result 3 may also be determined as the education type and the video type.
  • the speech recognition result 4 is "weather pre-explosion", and the word segmentation result determined by the semantic recognition model is: weather pre-explosion; according to the preset dictionary library, the weather pre-explosion is determined as a movie, and its corresponding service types include video types, Song type, etc.; the service type of voice recognition result 4 is determined according to the weight of the video type corresponding to "weather pre-explosion” and the weight of the song type corresponding to "weather pre-explosion”.
  • the target file corresponding to the at least one voice recognition result is searched for in the service type corresponding to the at least one voice recognition result from the resource library.
  • speech recognition result 1 you can search for Zheng Kai's target file from the video type in the resource library.
  • voice recognition result 2 you can search for the target file of the weather forecast from the weather query service in the resource library.
  • speech recognition result 3 you can search for target files in block letters from the video type or education type or education video type in the resource library.
  • voice recognition result 4 you can search for the target file of weather pre-explosion from the video type or song type in the resource library.
  • step 105 it specifically includes:
  • Step 1 Determine the priority of each voice recognition result in the at least one voice recognition result
  • the search results are displayed in the form of a tabulator key (TAB), and the ranking of the displayed results is mainly based on the TAB ranking of the hot search ranking.
  • TAB tabulator key
  • Step 2 Display each voice recognition result according to the priority order on the display interface of the terminal;
  • the priority can be determined based on user big data analysis, score, user preference, etc., and is not limited here.
  • Step 3 Display each voice recognition result and the target file corresponding to the target voice recognition result on the display interface.
  • the semantic recognition module converts the TAB data and target file corresponding to the speech recognition result into JSON data, and transmits it to the display module of the terminal;
  • the display module of the terminal parses the corresponding speech recognition result and target file
  • each speech recognition result and the corresponding target file are displayed.
  • the display result can be as shown in Figure 7.
  • the voice recognition result is not displayed to the terminal. If the semantics of "video in block letters" cannot be understood or the content of "block letters” calligraphy cannot be found in the resource library, the voice recognition result will not be displayed to the terminal.
  • the display result can be as shown in Figure 8 and Figure 9.
  • the speech recognition result which specifically includes:
  • the modified target voice recognition result is displayed in a first display mode, and other voice recognition results are displayed in a second display mode; at the same time, the target file corresponding to the modified target voice recognition result is displayed.
  • the method further includes:
  • the “weather pre-explosion” is recorded in the user's user preferences, and the matching degree of "weather pre-explosion” is increased.
  • the embodiments of the present disclosure also provide in some implementation manners, including:
  • the first control instruction is executed in the terminal.
  • the voice information also contains word segmentation of the operation type, it means that the terminal must perform corresponding operations based on the voice information.
  • an instruction for processing according to the voice information can be directly sent to the terminal. For example, open, watch, play and other operation types of word segmentation.
  • the recognized voice recognition result is "open Zheng Kai's video"
  • the target file of "Zheng Kai's video” can be opened directly.
  • multiple target files may be displayed first, and after obtaining the user's operation instruction, the open control instruction is executed.
  • the recognition result with the highest matching score is displayed to the user, and at least one voice recognition result that meets the first matching threshold is respectively subjected to semantic recognition, combined with the semantic processing result,
  • an embodiment of the present disclosure further provides a server 1100, as shown in FIG. 10, comprising: a processor 1101, a communication interface 1102, a memory 1103, and a communication bus 1104, wherein the processor 1101, Communication interface 1102, memory 1103 completes mutual communication through communication bus 1104;
  • a computer program is stored in the memory 1103;
  • the processor 1101 is configured to run the computer program to enable the server 1100 to realize:
  • Receive voice information sent by the terminal determine at least one voice recognition result whose voice information meets the first matching threshold according to the pre-trained voice matching model; determine that the voice recognition result with the highest matching degree among the at least one voice recognition result is Target voice recognition result; obtain the target file corresponding to the target voice recognition result; display each voice recognition result and the target file corresponding to the target voice recognition result to the display interface of the terminal, wherein the target voice recognition result is The first display mode is displayed, and the other voice recognition results are displayed in the second display mode.
  • the processor 1101 is configured to obtain the target file corresponding to the target speech recognition result by executing the following:
  • the processor 1101 is configured to perform semantic recognition on the target speech recognition result by executing the following to determine the service type corresponding to the target speech recognition result:
  • the preset dictionary database perform word segmentation processing on the target speech recognition result, perform semantic recognition on each word segmentation in the target speech recognition result, and determine the business type corresponding to each word segmentation; according to the weight of the business type corresponding to each word segmentation To determine the service type corresponding to the target voice recognition result.
  • the processor 1101 is configured to display each voice recognition result and the target file corresponding to the target voice recognition result to a display interface by executing the following:
  • the processor 1101 is further configured to run the computer program to enable the server 1100 to implement:
  • the target voice recognition result is displayed in a first display mode, other voice recognition results are displayed in a second display mode, and the target file corresponding to the modified target voice recognition result is displayed at the same time.
  • the processor 1101 is configured to execute the voice matching model completed according to the pre-training as follows to determine the voice recognition result that the voice information meets the first matching threshold:
  • Input the voice information into the voice matching model recognize the pinyin sequence in the voice information, and form all possible candidate characters; for each possible candidate character, determine the possible sequence of Chinese characters through grammar rules and statistical methods And the score of the Chinese character sequence; the Chinese character sequence whose score meets the first matching threshold is used as the speech recognition result.
  • the communication bus mentioned by the above server may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 1102 is used for communication between the aforementioned server and other devices.
  • the memory may include random access memory (Random Access Memory, RAM), and may also include non-volatile memory (Non-Volatile Memory, NVM), such as at least one disk storage.
  • NVM non-Volatile Memory
  • the memory may also be at least one storage device located far away from the foregoing processor.
  • the foregoing processor may be a general-purpose processor, including a central processing unit, a network processor (Network Processor, NP), etc.; it may also be a digital instruction processor (Digital Signal Processing, DSP), an application specific integrated circuit, a field programmable gate display, or Other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • NP Network Processor
  • DSP Digital Signal Processing
  • the embodiments of the present disclosure also provide a computer-readable non-volatile storage medium, and the computer-readable non-volatile storage medium stores a computer program executable by a server, When the program is running on the server, any one of the methods in the foregoing embodiments is realized when the server is executed.
  • the aforementioned computer-readable non-volatile storage medium may be any available medium or data storage device that can be accessed by the processor in the server, including but not limited to magnetic storage such as floppy disks, hard disks, magnetic tapes, magneto-optical disks (MO), etc., optical Storage such as compact disc (CD), digital versatile disc (DVD), Blu-ray disc (BD), high-definition versatile disc (HVD), etc., and semiconductor memory such as read only memory (ROM), erasable programmable read only memory (EPROM) , Erasable programmable read-only memory (EEPROM), non-volatile memory (NAND FLASH), solid state drive (SSD), etc.
  • magnetic storage such as floppy disks, hard disks, magnetic tapes, magneto-optical disks (MO), etc.
  • optical Storage such as compact disc (CD), digital versatile disc (DVD), Blu-ray disc (BD), high-definition versatile disc (HVD), etc.
  • semiconductor memory such as read only memory (ROM
  • an embodiment of the present disclosure further provides a terminal 1200, as shown in FIG. 11, including: a processor 1201, a communication interface 1202, a memory 1203, and a communication bus 1204, wherein the processor 1201, Communication interface 1202, memory 1203 completes mutual communication through communication bus 1204;
  • a computer program is stored in the memory 1203;
  • the processor 1201 is configured to run the computer program to enable the terminal 1200 to realize:
  • Receive input voice information determine at least one voice recognition result of the voice information that meets the first matching threshold according to a pre-trained voice matching model; determine the voice recognition result with the highest matching degree among the at least one voice recognition result as the target Voice recognition result; obtain the target file corresponding to the target voice recognition result; display each voice recognition result and the target file corresponding to the target voice recognition result to the display interface, wherein the target voice recognition result is displayed in the first display mode Display, other voice recognition results are displayed in the second display mode.
  • the processor 1201 is configured to obtain the target file corresponding to the target speech recognition result by executing the following:
  • the processor 1201 is configured to perform semantic recognition on the target voice recognition result by executing the following to determine the service type corresponding to the target voice recognition result:
  • the preset dictionary database perform word segmentation processing on the target speech recognition result, perform semantic recognition on each word segmentation in the target speech recognition result, and determine the business type corresponding to each word segmentation; according to the weight of the business type corresponding to each word segmentation To determine the service type corresponding to the target voice recognition result.
  • the processor 1201 is configured to display each voice recognition result and the target file corresponding to the target voice recognition result to a display interface by executing the following:
  • the processor 1201 is further configured to run the computer program to enable the terminal to realize:
  • the switching instruction is a user's switching instruction to the target voice recognition result obtained through the communication interface 1202.
  • the processor 1201 is configured to execute the voice matching model completed according to pre-training as follows to determine a voice recognition result that the voice information meets the first matching threshold:
  • Input the voice information into the voice matching model recognize the pinyin sequence in the voice information, and form all possible candidate characters; for each possible candidate character, determine the possible sequence of Chinese characters through grammar rules and statistical methods And the score of the Chinese character sequence; the Chinese character sequence whose score meets the first matching threshold is used as the speech recognition result.
  • the communication bus mentioned by the above terminal may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus.
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the communication bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 1202 is used for communication between the aforementioned terminal and other devices.
  • the memory may include RAM or NVM, such as at least one disk storage.
  • the memory may also be at least one storage device located far away from the foregoing processor.
  • the above-mentioned processor may be a general-purpose processor, including a central processing unit, NP, etc.; it may also be a DSP, an application specific integrated circuit, a field programmable gate array or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the embodiments of the present disclosure also provide a computer-readable non-volatile storage medium, and the computer-readable non-volatile storage medium stores a computer program executable by a terminal, When the program runs on the terminal, any method in the foregoing embodiments is implemented when the terminal is executed.
  • the above-mentioned computer-readable non-volatile storage medium may be any available medium or data storage device that can be accessed by the processor in the terminal, including but not limited to magnetic storage such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc., optical Memory such as CD, DVD, BD, HVD, etc., and semiconductor memory such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD), etc.
  • magnetic storage such as floppy disk, hard disk, magnetic tape, magneto-optical disk (MO), etc.
  • optical Memory such as CD, DVD, BD, HVD, etc.
  • semiconductor memory such as ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid state drive (SSD), etc.
  • the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete application embodiment, or an embodiment combining applications and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本公开公开了一种语音识别方法、装置及终端,所述方法包括:接收输入的语音信息;根据预先训练的语音匹配模型,确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果;确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;获取所述目标语音识别结果对应的目标文件;将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。

Description

一种语音识别方法、装置及终端
相关申请的交叉引用
本申请要求在2019年03月20日提交中国专利局、申请号为201910211472.4、申请名称为“一种语音识别方法、装置及终端”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开主要涉及智能家居技术领域,尤其涉及一种语音识别方法、装置及终端。
背景技术
目前语音识别产品越来越多,随着技术的进步和普及率的提高,用户对这种交互方式亦逐渐接受和认可。随着语音交互技术及人工智能的不断提高,应用场景从语音助手、智能音箱等方面加速扩围。语音识别产品在使用过程中,通过采集周围环境的声音来进行语义解析并执行用户语音指令操作。
目前,在语音识别技术中,在自适应方面仍需要更大的改进,现实中的语音类型是各种各样的,从声音特征来说可以分为男音、女音和童音,另外,很多人的发音同标准发音有很大的差距,并且还存在一定的同音的情况,导致用户输入语音时得到的识别结果与用户意图不一致,影响了语音识别产品的使用效果。
发明内容
本公开实施例提供了一种语音识别方法,应用于终端,所述方法包括:
接收输入的语音信息;
根据预先训练的语音匹配模型,确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果;
确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;
获取所述目标语音识别结果对应的目标文件;
将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。
在一些实施方式中,所述获取所述目标语音识别结果对应的目标文件,包括:
对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型;
从资源库中在所述目标语音识别结果对应的业务类型中查找所述目标语音识别结果对应的目标文件。
在一些实施方式中,所述对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型,包括:
根据预设的词典库,对所述目标语音识别结果进行分词处理,并对所述目标语音识别结果中各分词进行语义识别,确定各分词对应的业务类型;
根据各分词对应的业务类型的权重,确定所述目标语音识别结果对应的业务类型。
在一些实施方式中,所述将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,包括:
确定每个语音识别结果的优先级;
在所述终端的显示界面上按照所述优先级排列显示所述每个语音识别结果;
将所述目标语音识别结果对应的目标文件在所述终端的显示界面上显示。
在一些实施方式中,所述将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面之后,所述方法还包括:
获取用户对所述目标语音识别结果的切换指令;
根据所述切换指令,确定更改后的目标语音识别结果对应的目标文件;
将更改后的所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示;同时显示更改后的所述目标语音识别结果对应的目标文件。
在一些实施方式中,所述根据预先训练完成的语音匹配模型,确定所述语音信息满足第一匹配阈值的语音识别结果,包括:
将所述语音信息输入至所述语音匹配模型,识别所述语音信息中的拼音序列,组成所有可能的候选字;
针对每个可能的候选字,通过语法规则和统计方法,确定可能的汉字序列及汉字序列的得分;
将得分满足第一匹配阈值的汉字序列作为所述语音识别结果。
本公开实施例提供了一种终端,包括处理器、通信接口、存储器和通信总线;其中,处理器,通信接口,存储器通过通信总线完成相互间的通信;
所述存储器中存储有计算机程序;
所述处理器,配置为运行所述计算机程序以使所述终端实现:
接收输入的语音信息;
根据预先训练的语音匹配模型,确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果;
确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;
获取所述目标语音识别结果对应的目标文件;
将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。
在一些实施方式中,所述处理器,配置为通过下述执行所述获取所述目标语音识别结果对应的目标文件:
对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对 应的业务类型;
从资源库中在所述目标语音识别结果对应的业务类型中查找所述目标语音识别结果对应的目标文件。
在一些实施方式中,所述处理器,配置为通过下述执行所述对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型:
根据预设的词典库,对所述目标语音识别结果进行分词处理,并对所述目标语音识别结果中各分词进行语义识别,确定各分词对应的业务类型;
根据各分词对应的业务类型的权重,确定所述目标语音识别结果对应的业务类型。
在一些实施方式中,所述处理器,配置为通过下述执行所述将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面:
确定每个语音识别结果的优先级;
在所述终端的显示界面上按照所述优先级排列显示所述每个语音识别结果;
将所述目标语音识别结果对应的目标文件在所述终端的显示界面上显示。
在一些实施方式中,所述处理器,还配置为运行所述计算机程序以使所述终端实现:
获取用户对所述目标语音识别结果的切换指令;
根据所述切换指令,确定更改后的目标语音识别结果对应的目标文件;
将更改后的所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示;同时显示更改后的所述目标语音识别结果对应的目标文件。
在一些实施方式中,所述处理器,配置为通过下述执行所述根据预先训练完成的语音匹配模型,确定所述语音信息满足第一匹配阈值的语音识别结果:
将所述语音信息输入至所述语音匹配模型,识别所述语音信息中的拼音序列,组成所有可能的候选字;
针对每个可能的候选字,通过语法规则和统计方法,确定可能的汉字序列及汉字序列的得分;
将得分满足第一匹配阈值的汉字序列作为所述语音识别结果。
本公开实施例提供了一种计算机可读非易失性存储介质,所述计算机可读非易失性存储介质上存储计算机程序,所述计算机程序被处理器执行时实现上述任一项应用于终端的方法的步骤。
附图说明
为了更清楚地说明本公开实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本公开实施例1提供的一种语音识别方法的过程示意图;
图2为本公开实施例提供的一种语音匹配模型的示例图;
图3为本公开实施例提供的一种语音识别方法的过程示意图;
图4为本公开实施例提供的一种语音识别结果显示的过程示意图;
图4a为本公开实施例提供的一种语音识别结果显示的示意图;
图5为本公开实施例提供的一种语音识别方法的过程示意图;
图6为本公开实施例提供的一种语音识别结果显示的过程示意图;
图7为本公开实施例提供的一种语音识别结果显示的过程示意图;
图8为本公开实施例提供的一种语音识别结果显示的示意图;
图9为本公开实施例提供的一种语音识别结果显示的示意图;
图10为本公开实施例提供的一种服务器的结构示意图;
图11为本公开实施例提供的一种终端的结构示意图。
具体实施方式
下面将结合附图对本公开作进一步地详细描述,显然,所描述的实施例 仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本公开保护的范围。
语音识别是让机器接收、识别和理解语音信号,并将其转换成相应的数字信号。虽然语音识别在许多行业产生了大量的应用,但是要实现真正的人机自然交流还需要很多的工作要做,例如,在自适应方面需要更大的改进,达到不受口音、方言和特定人影响的要求。现实中的语音类型是各种各样的,从声音特征来说可以分为男音、女音和童音,另外,很多人的发音同标准发音有很大的差距,这就需要进行口音和方言的处理。导致用户输入语音时得到的识别结果与用户意图不一致的问题。
用户进行语音输入后,用户需要语音识别的词语中存在较多根据常用词进行音译化后得到的,如:四大名助——>四大名著,陆垚知马莉——>路遥知马力,天气预报——>天气预爆等。具有语音识别功能的终端在进行语音识别处理时,由于多音词以及口音、方言和特定人、连续语音识别中的无意义语气词等影响,导致语音识别的影响因素多样,很可能不能识别出用户想要的结果,导致部分用户的意图可能无法实现,造成误识别的情况,进一步影响了语音识别系统在识别速度、识别效率,降低了用户体验效果。
为了解决现有技术存在的问题,以电视机作为终端的场景为例。本公开实施例提供了一种语音识别方法、装置及终端,所述方法包括:接收输入的语音信息;根据预先训练的语音匹配模型,确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果;确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;获取所述目标语音识别结果对应的目标文件;将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。通过对第一显示方式显示匹配度最高的语音识别结果,可以进行快速显示,提高用户使用的便捷度;对至少一个语音识别结果分别进行语义识别,获得更多可能的用户搜索意图,并将每个语音 识别结果及至少一个语音识别结果通过第二显示方式显示至所述终端的显示界面,有效的为用户提供更多的搜索结果,有效提高语音识别结果与用户意图的覆盖率,提高用户通过语音搜索的成功率,提高语音识别产品的使用效果。
本公开实施例提供的所有方案,可以由终端执行,也可以由服务器执行,可以根据需要设置,在此不做限定。如图1所示,包括:
步骤101:接收输入的语音信息;
其中,终端可以通过终端的语音设备获取用户输入的语音信息,也可以通过外接的语音设备获取用户输入的语音信息;具体地,终端中设置有语音识别模块,可以识别语音信息并进行语音信息采集。
另外,终端中设置有通信模块,例如WIFI无线通讯模块等,使得该终端能够与服务器连接,可以将采集到的语音信息发送给服务器。当然,也可以全部由终端执行,也可以只发送部分需要服务器处理的语音信息,在此不做限定。
步骤102:根据预先训练的语音匹配模型,确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果;
在具体实施过程中,语音匹配模型可以设置在终端,也可以设置在服务器上,在此不做限定。若设置于服务器上,则服务器确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果后,将所述至少一个语音识别结果发送至终端。
步骤103:确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;
在具体实施过程中,终端可以根据至少一个语音识别结果的得分,确定匹配度最高的目标语音识别结果,也可以为服务器根据至少一个语音识别结果的得分,确定匹配度最高的目标语音识别结果后,将目标语音识别结果发送至终端。
步骤104:获取所述目标语音识别结果对应的目标文件;
在具体实施过程中,终端可以根据目标语音识别结果,在本地或网络资源库中搜索所述目标语音识别结果对应的目标文件,也可以为根据目标语音识别结果,在网络资源库中搜索所述目标语音识别结果对应的目标文件,确定目标语音识别结果对应的目标文件后,将目标文件或目标文件的标识信息发送至终端,以使终端确定所述目标语音识别结果对应的目标文件。
步骤105:将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。
在具体实施过程中,第二显示方式可以为与第一显示方式相对的显示方式。例如,第一显示方式可以为高亮、带有选中框等显示方式,第二显示方式可以为无高亮显示、无选中框等显示方式。例如,如图7所示的目标显示结果为选中框高亮显示的第一显示方式,第二显示方式为无选中框高亮显示的方式。具体的显示方式在此不做限定。
本公开实施例提供的一种语音识别方法,通过对第一显示方式显示匹配度最高的语音识别结果,可以进行快速显示,提高用户使用的便捷度;对至少一个语音识别结果分别进行语义识别,获得更多可能的用户搜索意图,并将每个语音识别结果及至少一个语音识别结果通过第二显示方式显示至所述终端的显示界面,有效的为用户提供更多的搜索结果,有效提高语音识别结果与用户意图的覆盖率,提高用户通过语音搜索的成功率,提高语音识别产品的使用效果。
本公开实施例中,如图2所示,提供一种语音识别模型确定语音识别结果的方法,包括:
步骤一、获取用户输入的语音信息,通过声学特征确定所述语音信息的特征声学概率;
具体的,声学特征提取,即从语音信息中提取语音声学特征信息,为了保证识别准确率,该提取部分应该对声学模型的建模单元具有较好的区分性。本公开实施例中的声学特征可以包括:梅尔倒谱系数(MFCC)、线性预测倒 谱系数(LPCC)、感知线性预测系数(PLP)等。
步骤二、将声学特征提取后的语音信息输入至语音匹配模型中,语音匹配模型包括语言模型和声学模型;
举例来说,在本公开实施例中,所述语音匹配模型的训练过程可以包括:
步骤一、获取样本语音信息,所述样本语音信息中携带有其所属语音的标注信息;
步骤二、将每个样本语音信息输入到语音匹配模型中;
步骤三、根据每个样本语音信息及所述语音匹配模型的输出,对所述语音匹配模型进行训练。
为了方便语音匹配模型的训练,可以收集大量的样本语音信息,该样本语音信息可以为终端采集的,也可以为其他途径获取的;针对样本语音信息,可以该样本语音信息进行标注。
将样本语音信息输入到该语音匹配模型中,对该语音匹配模型进行训练,该模型可以为有动态时间规整技术、隐马尔可夫模型(Hidden Markov Model,HMM)、人工神经网络、支持向量机等模型。将每个样本语音信息输入到该模型中,根据每个样本语音信息的标注信息及语音匹配模型的输出结果,对该语音匹配模型进行训练。
本公开实施例中,通过对大量样本语音信息进行训练得到语音匹配模型,并通过该语音匹配模型可以对采集的语音信息进行语音识别。
其中,声学模型利用训练语音特征及其对应的标注信息进行有标注的声学模型建模。声学模型构建语音信号中的观测特征和语音建模单元间的映射关系,以此进行音素或音素状态的分类。本公开实施例中,声学模型可以采用HMM作为声学模型的建模基础。
其中,语言模型可以采用基于统计学习建模的语音识别框架下的,N-gram统计语言模型,其包括一个马尔可夫链表示词序列的生成过程,即将产生词序列W的概率p(W)表示为:
Figure PCTCN2019106806-appb-000001
其中,w k表示词序列中的第k个词,上式可见,产生当前词的概率仅与前面n-1个词有关;
本公开实施例中,语言模型的训练和评价指标可以采用语言模型困惑度(Perplexity,PP),它的定义是词序列生成概率集合平均的倒数,即:
Figure PCTCN2019106806-appb-000002
从公式中可见,语言模型对生成词序列的期望困惑度越小,则该语言模型在给定历史词序列的情况下对产生何种当前词的预测准确度越高,因此语言模型的训练目标就是最小化训练集语料的困惑度。
在训练过程中,首先统计训练集语料中出现的各词及相关词组合的概率,并以此为基础估计语言模型的相关参数。
然而相关词组合的数目是随可能出现的词表规模呈几何级增长的,统计所有可能出现的情况并不可行,而且在现实情况下,训练数据通常是稀疏的,有些词间组合出现的概率很小甚至是根本没有出现过。针对这些问题,可以通过降权(Discounting)和回溯(Backing-off)等方法,并利用递归神经网络(Recurrent Neural Network,RNN)建模语言模型的方法优化语言模型。
步骤三、将语音模型得到的结果输入至解码器解码后,获得语音识别的可能的文本信息。
结合声学模型计算得到的语音特征声学概率和由语言模型计算出的语言模型概率,利用解码器通过相关搜索算法分析出最有可能的词序列W',以输出语音信息中可能的文本信息。
在步骤102中,根据预先训练完成的语音匹配模型,确定所述语音信息满足第一匹配阈值的语音识别结果,包括:
步骤一、将所述语音信息输入至所述语音匹配模型,识别所述语音信息中的拼音序列,组成所有可能的候选字;
为了对每一个音节确认出正确的字符,先根据输入的拼音序列,组成所有可能的字符假设或单音字节、多音节的词假设。如:以输入序列“郑凯的视频”为例,其对应的拼音序列为[zheng4,kai3,de1,shi4,pin2]。如图3所示,每条路径为一个可能的识别结果。
步骤二、针对每个可能的候选字,通过语法规则和统计方法,确定可能的汉字序列及汉字序列的得分;
具体的,通过每个待识别音的多个候选字,利用文法规则和统计学原理,得到汉字序列的得分,且纠正一些拼音识别的错误。应用概率统计语言模型在字符序列或词序列中搜索可能正确的路径。本公开实施例中的解码器可以采用动态规划思想的Viterbi算法,并通过一定的算法(例如高斯选择算法(Gaussian Selection)、语言模型前看算法(Language Mode Look-Ahead)等)进行快速的同步概率计算以及对搜索空间进行裁剪,以此来降低计算复杂度和内存开销,提高实现搜索算法的效率。
步骤三、将得分满足第一匹配阈值的汉字序列作为所述语音识别结果。
具体的,将语言模型匹配的至少一个汉字序列及得分,进行排序。从模板匹配的结果来看,匹配度高的识别结果,识别正确的概率更高。当然也存在由于某个模型中未补充语料导致虽然模型匹配得分较高,但并非正确结果的情况。因此,可以选取匹配得分需要满足第一阈值的语音识别结果进行语义识别。
具体的,得分需要满足第一阈值,即得分大于第一阈值ρ的语音识别结果作为可能正确的识别结果。举例来说,识别结果如下:
识别结果 匹配得分
郑恺的视频 0.641
郑凯的视频 0.629
正楷的视频 0.457
正凯德食品 0.231
如上表所示,第一阈值可以为0.4;则此时的语音识别结果为:“郑恺的视频、郑凯的视频、正楷的视频”。
在一些实施方式中,当所有的识别结果模型匹配分值均小于第一阈值ρ时,则取得分最高的识别结果执行步骤104,进行语义处理。
为提高用户体验,有效显示搜索进程,在步骤103之前,还可以将目标语音识别结果,输出到终端的界面并显示。其中,终端的界面可以为采集语音信息的语音助手的客户端的显示界面,也可以是终端的其他界面,在此不做限定。举例来说,可以如图4a所示,所述目标语音识别结果为“郑恺的视频”。
如图4所示,显示识别结果的流程包括以下步骤:
步骤一、创建界面的布局文件;
其中,所述布局文件包括显示语音识别结果的文本控件。
步骤二、创建界面加载布局文件,初始化文本控件。
步骤三、在终端的显示界面显示语音识别结果,即识别的文本信息。
为了有效地提高识别的准确度和覆盖度,服务器中存储有预设的词典库,该词典库中包括大量的语料数据,具有语义解析功能,在云服务器判断接收到的语音信息后,利用自身的语义解析功能对该语音识别结果进行语义解析处理。具体的,服务器中保存有语义识别模型,该语义识别模型可以识别出语音信息的分词;确定该语音信息中的分词,识别分词的语义,确定每个语义对应的目标文件。当然,若需要检索的词典库较小,为提高解析速率,语义识别可以在终端上完成,在此不做限定。
在步骤104中,包括:
步骤一、对目标语音识别结果进行语义识别,确定目标语音识别结果对应的业务类型;
若终端执行语义识别,则可以根据终端上的语义识别模型,输出语音识别结果的分词,解析分词的语义,及语义对应的标注结果,查找该标注结果中是否包含与业务类型相关的业务类型。
若服务器执行语义识别,服务器在接收到终端发送的语音识别结果后,根据服务器上的语义识别模型,输出语音识别结果的分词,解析分词的语义,及语义对应的标注结果,查找该标注结果中是否包含与业务类型相关的业务类型。
步骤二、从资源库中在所述目标语音识别结果对应的业务类型中查找所述目标语音识别结果对应的目标文件。
为进一步提高语义识别的准确度,本公开实施例中,语义识别模型识别具体实施过程,可以包括:
步骤一、根据预设的词典库,对目标语音识别结果进行分词处理,并对目标语音识别结果中各分词进行语义识别,确定各分词对应的业务类型;
其中,预设的词典库可以通过网络爬虫等方法获取语料,以更新分词及对应的业务类型的标注。
步骤二、根据各分词对应的业务类型的权重,确定目标语音识别结果对应的业务类型。
为进一步提高检索效率,针对除目标语音识别结果,超过第一阈值的语音识别结果中的其他语音识别结果也可以与目标语音识别结果同时执行上述操作。当然,也可以在收到用户的切换指令后,再执行上述操作,在此不做限定。
如图5所示,具体的,可以包括:
步骤一、对所述至少一个语音识别结果中的每个语音识别结果进行语义识别,确定所述语音识别结果对应的业务类型;
具体地,将语音识别结果1输入语义识别模型,如果语义识别模型输出的结果中包含的业务类型1,则认为该语音识别结果1中包含业务类型1,需要在业务类型1对应的应用程序中执行后续处理。
举例来说,语音识别结果1为“郑恺的视频”,语义识别模型输出的分词结果为:郑恺,的,视频。其中,视频的业务类型为视频类型,则语音识别结果的业务类型为视频类型。
在一些实施方式中,业务类型,还可以根据分词的属性确定。例如,语音识别结果2为“天气预报”,语义识别模型确定的分词结果为:天气,预报;“天气”具有天气属性(weatherKeys),则确定业务类型为天气查询类型。
为进一步提高语义识别的准确度,本公开实施例中,语义识别模型识别具体实施过程,可以包括:
步骤一、根据预设的词典库,对所述语音识别结果进行分词处理,并对所述语音识别结果中各分词进行语义识别,确定各分词对应的业务类型;
其中,预设的词典库可以通过网络爬虫等方法获取语料,以更新分词及对应的业务类型的标注。
步骤二、根据各分词对应的业务类型的权重,确定所述语音识别结果对应的业务类型。
在一些实施方式中,所述业务类型的权重为根据所述业务类型在所述终端中的优先级、所述预设的词典库中的分词所来源的资料库的优先级或所述终端用户的用户偏好中的至少一项确定的。
举例来说,语音识别结果3为“正楷的视频”,语义识别模型确定的分词结果为:正楷,的,视频。视频的业务类型为视频类型;正楷的业务类型为教育类型;若确定“视频”对应的视频类型的权重大于“正楷”对应的教育类型的权重,则确定语音识别结果3的业务类型为视频类型。若确定“视频”对应的视频类型的权重与“正楷”对应的教育类型的权重相同,也可以将语音识别结果3对应的业务类型确定为教育类型和视频类型。
再比如,语音识别结果4为“天气预爆”,语义识别模型确定的分词结果为:天气预爆;根据预设的词典库确定天气预爆为一电影,其对应的业务类型包括视频类型、歌曲类型等;则根据“天气预爆”对应的视频类型的权重,及“天气预爆”对应的歌曲类型的权重,确定语音识别结果4的业务类型。
在步骤二中,从资源库中在至少一个语音识别结果对应的业务类型中查找至少一个语音识别结果对应的目标文件。
结合上述举例,对于语音识别结果1,可以从资源库中的视频类型中搜索 郑恺的目标文件。对于语音识别结果2,可以从资源库中的天气查询业务中搜索天气预报的目标文件。对于语音识别结果3,可以从资源库中的视频类型或教育类型或教育视频类型中搜索正楷的目标文件。对于语音识别结果4,可以从资源库中的视频类型或歌曲类型中搜索天气预爆的目标文件。
在步骤105中,具体包括:
步骤一、确定所述至少一个语音识别结果中每个语音识别结果的优先级;
具体的,结合语义分析用户界面(User Interface,UI)以键盘制表定位键(Tabulator key,TAB)形式展示搜索结果,展示结果的排序主要是根据热搜排名进行TAB排序。
步骤二、在所述终端的显示界面上按照所述优先级排列显示每个语音识别结果;
其优先级可以基于用户大数据分析、得分及用户偏好等方式确定,在此不做限定。
步骤三、将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面。
在具体实施过程中,如图6所示,包括:
语义识别模块将语音识别结果对应的TAB数据和目标文件转换为JSON数据,传输至终端的显示模块;
终端的显示模块获取所述JSON数据后,解析对应的语音识别结果和目标文件;
根据解析结果显示每个语音识别结果及对应的目标文件。
结合上述举例,若确定排序结果为:郑恺>郑凯>正楷,则显示结果可以如图7所示。
在一些实施方式中,对于语义分析无法确定的业务类型或无法确定对应的目标文件时,则不将该语音识别结果显示至终端。如“正楷的视频”语义无法理解或资源库中搜不到“正楷”书法相关内容,则不将该语音识别结果显示至终端。
结合上述举例,若确定排序结果为:天气预报>天气预爆,则显示结果可以如图8和图9所示。
进一步的,若用户想要切换目标语音识别结果,则可以进行语音识别结果的切换,具体包括:
获取用户对所述目标语音识别结果的切换指令;
根据所述切换指令,确定更改后的目标语音识别结果对应的目标文件;
将更改后的所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示;同时显示更改后的所述目标语音识别结果对应的目标文件。
具体的,确定更改后的目标语音识别结果对应的目标文件可以参考上述实施例,在此不再赘述。
为了进一步提高识别的语音识别的准确率,在本公开实施例中,所述方法还包括:
获取用户对所述语音识别结果或目标文件的操作指令;
增加所述操作指令对应的语音识别结果或目标文件的匹配度,以更新用户偏好。
例如,用户在显示界面中选择“天气预爆”,则在用户的用户偏好中,记录“天气预爆”,并增加“天气预爆”的匹配度。
为了进一步提高识别的语音识别的准确率,本公开实施例,还提供在一些实施方式中,包括:
判断所述语音信息是否包括对所述终端进行控制的第一控制指令;
如果所述用户语音信息为对所述终端进行控制的第一控制指令,则在所述终端中执行所述第一控制指令。
在一些实施方式中,如果语音信息中还包含操作类型的分词,说明终端有必要根据该语音信息进行相应操作。此时可以直接向终端发送根据所述语音信息进行处理的指令。例如,打开,观看,播放等操作类型的分词。
在一些实施方式中,对语音信息的语义中是否包含针对终端设置的目标 控制指令进行判断,如果是,则在所述终端中执行所述第一控制指令。
例如,识别的语音识别结果为“打开郑凯的视频”,则可以确定第一控制指令为打开。
在一些实施方式中,若确定“郑凯的视频”的目标文件唯一,则可以直接执行打开“郑凯的视频”的目标文件。
在一些实施方式中,若确定“郑凯的视频”的目标文件有多个,可以先显示多个目标文件,在获取用户的操作指令后,执行打开的控制指令。
本公开实施例,通过在语音匹配模型识别所述语音信息将匹配得分最高的识别结果展示给用户,同时将满足第一匹配阈值的至少一个语音识别结果分别进行语义识别,结合语义处理结果,可以通过UI交互展示不同的业务搜索结果给用户,更好地理解用户意图,与现有技术中的语音识别方法相比,通过多次语义分析请求,实现同音词名称业务的搜索和展现,用户可根据意向选择想要的结果。
在上述各实施例的基础上,本公开实施例还提供了一种服务器1100,如图10所示,包括:处理器1101、通信接口1102、存储器1103和通信总线1104,其中,处理器1101,通信接口1102,存储器1103通过通信总线1104完成相互间的通信;
所述存储器1103中存储有计算机程序;
所述处理器1101,配置为运行所述计算机程序以使服务器1100实现:
接收终端发送的语音信息;根据预先训练完成的语音匹配模型,确定所述语音信息满足第一匹配阈值的至少一个语音识别结果;确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;获取所述目标语音识别结果对应的目标文件;将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至终端的显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。
在一些实施方式中,所述处理器1101,配置为通过下述执行所述获取所述目标语音识别结果对应的目标文件:
对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型;从资源库中在所述目标语音识别结果对应的业务类型中查找所述目标语音识别结果对应的目标文件。
在一些实施方式中,所述处理器1101,配置为通过下述执行所述对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型:
根据预设的词典库,对所述目标语音识别结果进行分词处理,并对所述目标语音识别结果中各分词进行语义识别,确定各分词对应的业务类型;根据各分词对应的业务类型的权重,确定所述目标语音识别结果对应的业务类型。
在一些实施方式中,所述处理器1101,配置为通过下述执行所述将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面:
确定每个语音识别结果的优先级,并将每个语音识别结果的优先级发送给终端,以使终端在显示界面上按照所述优先级排列显示所述每个语音识别结果,并将所述目标语音识别结果对应的目标文件在所述终端的显示界面上显示。
在一些实施方式中,所述处理器1101,还配置为运行所述计算机程序以使所述服务器1100实现:
通过通信接口1102接收终端发送的用户对所述目标语音识别结果的切换指令;根据所述切换指令,确定更改后的目标语音识别结果对应的目标文件并发送给终端,以使终端将更改后的所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示,同时显示更改后的所述目标语音识别结果对应的目标文件。
在一些实施方式中,所述处理器1101,配置为通过下述执行所述根据预先训练完成的语音匹配模型,确定所述语音信息满足第一匹配阈值的语音识别结果:
将所述语音信息输入至所述语音匹配模型,识别所述语音信息中的拼音 序列,组成所有可能的候选字;针对每个可能的候选字,通过语法规则和统计方法,确定可能的汉字序列及汉字序列的得分;将得分满足第一匹配阈值的汉字序列作为所述语音识别结果。
上述服务器提到的通信总线可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
通信接口1102用于上述服务器与其他设备之间的通信。
存储器可以包括随机存取存储器(Random Access Memory,RAM),也可以包括非易失性存储器(Non-Volatile Memory,NVM),例如至少一个磁盘存储器。可选地,存储器还可以是至少一个位于远离前述处理器的存储装置。
上述处理器可以是通用处理器,包括中央处理器、网络处理器(Network Processor,NP)等;还可以是数字指令处理器(Digital Signal Processing,DSP)、专用集成电路、现场可编程门陈列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。
在上述各实施例的基础上,本公开实施例还提供了一种计算机存储可读非易失性存储介质,所述计算机可读非易失性存储介质内存储有可由服务器执行的计算机程序,当所述程序在所述服务器上运行时,使得所述服务器执行时实现上述实施例中的任一一种方法。
上述计算机可读非易失性存储介质可以是服务器中的处理器能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器如软盘、硬盘、磁带、磁光盘(MO)等、光学存储器如光盘(CD)、数字通用光盘(DVD)、蓝光光盘(BD)、高清通用光盘(HVD)等、以及半导体存储器如只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、带电可擦除可编程只读存储器(EEPROM)、非易失性存储器(NAND FLASH)、固态硬盘(SSD)等。
在上述各实施例的基础上,本公开实施例还提供了一种终端1200,如图 11所示,包括:处理器1201、通信接口1202、存储器1203和通信总线1204,其中,处理器1201,通信接口1202,存储器1203通过通信总线1204完成相互间的通信;
所述存储器1203中存储有计算机程序;
所述处理器1201,配置为运行所述计算机程序以使终端1200实现:
接收输入的语音信息;根据预先训练的语音匹配模型,确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果;确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;获取所述目标语音识别结果对应的目标文件;将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。
在一些实施方式中,所述处理器1201,配置为通过下述执行所述获取所述目标语音识别结果对应的目标文件:
对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型;从资源库中在所述目标语音识别结果对应的业务类型中查找所述目标语音识别结果对应的目标文件。
在一些实施方式中,所述处理器1201,配置为通过下述执行所述对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型:
根据预设的词典库,对所述目标语音识别结果进行分词处理,并对所述目标语音识别结果中各分词进行语义识别,确定各分词对应的业务类型;根据各分词对应的业务类型的权重,确定所述目标语音识别结果对应的业务类型。
在一些实施方式中,所述处理器1201,配置为通过下述执行所述将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面:
确定每个语音识别结果的优先级;在所述终端的显示界面上按照所述优先级排列显示所述每个语音识别结果;将所述目标语音识别结果对应的目标 文件在所述终端的显示界面上显示。
在一些实施方式中,所述处理器1201,还配置为运行所述计算机程序以使所述终端实现:
获取用户对所述目标语音识别结果的切换指令;根据所述切换指令,确定更改后的目标语音识别结果对应的目标文件;将更改后的所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示;同时显示更改后的所述目标语音识别结果对应的目标文件。其中,切换指令为通过通信接口1202获取的用户对所述目标语音识别结果的切换指令。
在一些实施方式中,所述处理器1201,配置为通过下述执行所述根据预先训练完成的语音匹配模型,确定所述语音信息满足第一匹配阈值的语音识别结果:
将所述语音信息输入至所述语音匹配模型,识别所述语音信息中的拼音序列,组成所有可能的候选字;针对每个可能的候选字,通过语法规则和统计方法,确定可能的汉字序列及汉字序列的得分;将得分满足第一匹配阈值的汉字序列作为所述语音识别结果。
上述终端提到的通信总线可以是外设部件互连标准(Peripheral Component Interconnect,PCI)总线或扩展工业标准结构(Extended Industry Standard Architecture,EISA)总线等。该通信总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
通信接口1202用于上述终端与其他设备之间的通信。
存储器可以包括RAM,也可以包括NVM,例如至少一个磁盘存储器。可选地,存储器还可以是至少一个位于远离前述处理器的存储装置。
上述处理器可以是通用处理器,包括中央处理器、NP等;还可以是DSP、专用集成电路、现场可编程门陈列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。
在上述各实施例的基础上,本公开实施例还提供了一种计算机存储可读 非易失性存储介质,所述计算机可读非易失性存储介质内存储有可由终端执行的计算机程序,当所述程序在所述终端上运行时,使得所述终端执行时实现上述实施例中的任一种方法。
上述计算机可读非易失性存储介质可以是终端中的处理器能够存取的任何可用介质或数据存储设备,包括但不限于磁性存储器如软盘、硬盘、磁带、磁光盘(MO)等、光学存储器如CD、DVD、BD、HVD等、以及半导体存储器如ROM、EPROM、EEPROM、非易失性存储器(NAND FLASH)、固态硬盘(SSD)等。
对于系统/装置实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者一个操作与另一个实体或者另一个操作区分开来,而不一定要求或者暗示这些实体或者操作之间存在任何这种实际的关系或者顺序。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全应用实施例、或结合应用和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本公开是参照根据本公开实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设 备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本公开的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本公开范围的所有变更和修改。
显然,本领域的技术人员可以对本公开进行各种改动和变型而不脱离本公开的精神和范围。这样,倘若本公开的这些修改和变型属于本公开权利要求及其等同技术的范围之内,则本公开也意图包含这些改动和变型在内。

Claims (13)

  1. 一种语音识别方法,所述方法包括:
    接收输入的语音信息;
    根据预先训练的语音匹配模型,确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果;
    确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;
    获取所述目标语音识别结果对应的目标文件;
    将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。
  2. 如权利要求1所述的方法,所述获取所述目标语音识别结果对应的目标文件,包括:
    对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型;
    从资源库中在所述目标语音识别结果对应的业务类型中查找所述目标语音识别结果对应的目标文件。
  3. 如权利要求2所述的方法,所述对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型,包括:
    根据预设的词典库,对所述目标语音识别结果进行分词处理,并对所述目标语音识别结果中各分词进行语义识别,确定各分词对应的业务类型;
    根据各分词对应的业务类型的权重,确定所述目标语音识别结果对应的业务类型。
  4. 如权利要求1所述的方法,所述将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,包括:
    确定每个语音识别结果的优先级;
    在所述终端的显示界面上按照所述优先级排列显示所述每个语音识别结果;
    将所述目标语音识别结果对应的目标文件在所述终端的显示界面上显示。
  5. 如权利要求4所述的方法,所述将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面之后,所述方法还包括:
    获取用户对所述目标语音识别结果的切换指令;
    根据所述切换指令,确定更改后的目标语音识别结果对应的目标文件;
    将更改后的所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示;同时显示更改后的所述目标语音识别结果对应的目标文件。
  6. 如权利要求1至5任一项所述的方法,所述根据预先训练完成的语音匹配模型,确定所述语音信息满足第一匹配阈值的语音识别结果,包括:
    将所述语音信息输入至所述语音匹配模型,识别所述语音信息中的拼音序列,组成所有可能的候选字;
    针对每个可能的候选字,通过语法规则和统计方法,确定可能的汉字序列及汉字序列的得分;
    将得分满足第一匹配阈值的汉字序列作为所述语音识别结果。
  7. 一种终端,包括存储器和处理器,其中,
    所述存储器,存储有计算机程序;
    所述处理器,与所述存储器通信,配置为运行所述计算机程序以使所述终端实现:
    接收输入的语音信息;
    根据预先训练的语音匹配模型,确定满足第一匹配阈值的所述语音信息的至少一个语音识别结果;
    确定所述至少一个语音识别结果中匹配度最高的语音识别结果为目标语音识别结果;
    获取所述目标语音识别结果对应的目标文件;
    将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面,其中所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示。
  8. 如权利要求7所述的终端,所述处理器,配置为通过下述执行所述获取所述目标语音识别结果对应的目标文件:
    对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型;
    从资源库中在所述目标语音识别结果对应的业务类型中查找所述目标语音识别结果对应的目标文件。
  9. 如权利要求8所述的终端,所述处理器,配置为通过下述执行所述对所述目标语音识别结果进行语义识别,确定所述目标语音识别结果对应的业务类型:
    根据预设的词典库,对所述目标语音识别结果进行分词处理,并对所述目标语音识别结果中各分词进行语义识别,确定各分词对应的业务类型;
    根据各分词对应的业务类型的权重,确定所述目标语音识别结果对应的业务类型。
  10. 如权利要求7所述的终端,所述处理器,配置为通过下述执行所述将每个语音识别结果及所述目标语音识别结果对应的目标文件显示至显示界面:
    确定每个语音识别结果的优先级;
    在所述终端的显示界面上按照所述优先级排列显示所述每个语音识别结果;
    将所述目标语音识别结果对应的目标文件在所述终端的显示界面上显示。
  11. 如权利要求10所述的终端,所述处理器,还配置为运行所述计算机程序以使所述终端实现:
    获取用户对所述目标语音识别结果的切换指令;
    根据所述切换指令,确定更改后的目标语音识别结果对应的目标文件;
    将更改后的所述目标语音识别结果以第一显示方式显示,其他语音识别结果以第二显示方式显示;同时显示更改后的所述目标语音识别结果对应的目标文件。
  12. 如权利要求7至11任一项所述的终端,所述处理器,配置为通过下述执行所述根据预先训练完成的语音匹配模型,确定所述语音信息满足第一匹配阈值的语音识别结果:
    将所述语音信息输入至所述语音匹配模型,识别所述语音信息中的拼音序列,组成所有可能的候选字;
    针对每个可能的候选字,通过语法规则和统计方法,确定可能的汉字序列及汉字序列的得分;
    将得分满足第一匹配阈值的汉字序列作为所述语音识别结果。
  13. 一种计算机可读非易失性存储介质,所述可读非易失性存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1至6中任一项所述的方法。
PCT/CN2019/106806 2019-03-20 2019-09-19 一种语音识别方法、装置及终端 WO2020186712A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910211472.4 2019-03-20
CN201910211472.4A CN109976702A (zh) 2019-03-20 2019-03-20 一种语音识别方法、装置及终端

Publications (1)

Publication Number Publication Date
WO2020186712A1 true WO2020186712A1 (zh) 2020-09-24

Family

ID=67079603

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/106806 WO2020186712A1 (zh) 2019-03-20 2019-09-19 一种语音识别方法、装置及终端

Country Status (2)

Country Link
CN (1) CN109976702A (zh)
WO (1) WO2020186712A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109976702A (zh) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 一种语音识别方法、装置及终端
CN110427459B (zh) * 2019-08-05 2021-09-17 思必驰科技股份有限公司 语音识别网络的可视化生成方法、系统及平台
CN110335606B (zh) * 2019-08-07 2022-04-19 广东电网有限责任公司 一种用于工器具管控的语音交互装置
CN112447168A (zh) * 2019-09-05 2021-03-05 阿里巴巴集团控股有限公司 语音识别系统、方法、音箱、显示设备和交互平台
CN112802474A (zh) * 2019-10-28 2021-05-14 中国移动通信有限公司研究院 语音识别方法、装置、设备及存储介质
CN110931018A (zh) * 2019-12-03 2020-03-27 珠海格力电器股份有限公司 智能语音交互的方法、装置及计算机可读存储介质
CN111192572A (zh) * 2019-12-31 2020-05-22 斑马网络技术有限公司 语义识别的方法、装置及系统
CN112735394B (zh) * 2020-12-16 2022-12-30 青岛海尔科技有限公司 一种语音的语义解析方法及装置
CN113823271B (zh) * 2020-12-18 2024-07-16 京东科技控股股份有限公司 语音分类模型的训练方法、装置、计算机设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1021254A (ja) * 1996-06-28 1998-01-23 Toshiba Corp 音声認識機能付き情報検索装置
CN101557651A (zh) * 2008-04-08 2009-10-14 Lg电子株式会社 移动终端及其菜单控制方法
CN101557432A (zh) * 2008-04-08 2009-10-14 Lg电子株式会社 移动终端及其菜单控制方法
CN106356056A (zh) * 2016-10-28 2017-01-25 腾讯科技(深圳)有限公司 语音识别方法和装置
CN109976702A (zh) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 一种语音识别方法、装置及终端

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7577569B2 (en) * 2001-09-05 2009-08-18 Voice Signal Technologies, Inc. Combined speech recognition and text-to-speech generation
US20070055520A1 (en) * 2005-08-31 2007-03-08 Microsoft Corporation Incorporation of speech engine training into interactive user tutorial
JP5042799B2 (ja) * 2007-04-16 2012-10-03 ソニー株式会社 音声チャットシステム、情報処理装置およびプログラム
CN101604520A (zh) * 2009-07-16 2009-12-16 北京森博克智能科技有限公司 基于统计模型和语法规则的口语语音识别方法
CN102867512A (zh) * 2011-07-04 2013-01-09 余喆 自然语音识别方法和装置
CN103176591A (zh) * 2011-12-21 2013-06-26 上海博路信息技术有限公司 一种基于语音识别的文本定位和选择方法
CN103176998A (zh) * 2011-12-21 2013-06-26 上海博路信息技术有限公司 一种基于语音识别的阅读辅助系统
KR101990037B1 (ko) * 2012-11-13 2019-06-18 엘지전자 주식회사 이동 단말기 및 그것의 제어 방법
CN105489220B (zh) * 2015-11-26 2020-06-19 北京小米移动软件有限公司 语音识别方法及装置
CN105679318A (zh) * 2015-12-23 2016-06-15 珠海格力电器股份有限公司 一种基于语音识别的显示方法、装置、显示系统和空调
CN105869636A (zh) * 2016-03-29 2016-08-17 上海斐讯数据通信技术有限公司 一种语音识别装置及其方法、一种智能电视及其控制方法
CN106098063B (zh) * 2016-07-01 2020-05-22 海信集团有限公司 一种语音控制方法、终端设备和服务器
CN109492175A (zh) * 2018-10-23 2019-03-19 青岛海信电器股份有限公司 应用程序界面的显示方法及装置、电子设备、存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH1021254A (ja) * 1996-06-28 1998-01-23 Toshiba Corp 音声認識機能付き情報検索装置
CN101557651A (zh) * 2008-04-08 2009-10-14 Lg电子株式会社 移动终端及其菜单控制方法
CN101557432A (zh) * 2008-04-08 2009-10-14 Lg电子株式会社 移动终端及其菜单控制方法
CN106356056A (zh) * 2016-10-28 2017-01-25 腾讯科技(深圳)有限公司 语音识别方法和装置
CN109976702A (zh) * 2019-03-20 2019-07-05 青岛海信电器股份有限公司 一种语音识别方法、装置及终端

Also Published As

Publication number Publication date
CN109976702A (zh) 2019-07-05

Similar Documents

Publication Publication Date Title
WO2020186712A1 (zh) 一种语音识别方法、装置及终端
CN111933129B (zh) 音频处理方法、语言模型的训练方法、装置及计算机设备
CN108711421B (zh) 一种语音识别声学模型建立方法及装置和电子设备
CN108711420B (zh) 多语言混杂模型建立、数据获取方法及装置、电子设备
US10176804B2 (en) Analyzing textual data
CN110364171B (zh) 一种语音识别方法、语音识别系统及存储介质
KR102390940B1 (ko) 음성 인식을 위한 컨텍스트 바이어싱
US11823678B2 (en) Proactive command framework
US20200327883A1 (en) Modeling method for speech recognition, apparatus and device
JP2021033255A (ja) 音声認識方法、装置、機器及びコンピュータ可読記憶媒体
Żelasko et al. Punctuation prediction model for conversational speech
CN103956169B (zh) 一种语音输入方法、装置和系统
CN109754809A (zh) 语音识别方法、装置、电子设备及存储介质
US10909972B2 (en) Spoken language understanding using dynamic vocabulary
US20170032781A1 (en) Collaborative language model biasing
Scharenborg et al. Building an ASR system for a low-research language through the adaptation of a high-resource language ASR system: preliminary results
WO2018192186A1 (zh) 语音识别方法及装置
JP2021018413A (ja) ストリーミングアテンションモデルに基づく音声認識復号化方法、装置、機器及びコンピュータ可読記憶媒体
US11810556B2 (en) Interactive content output
US20240211206A1 (en) System command processing
Li et al. Combining CNN and BLSTM to Extract Textual and Acoustic Features for Recognizing Stances in Mandarin Ideological Debate Competition.
Pan et al. A chapter-wise understanding system for text-to-speech in Chinese novels
Mary et al. Searching speech databases: features, techniques and evaluation measures
US20220392439A1 (en) Rescoring Automatic Speech Recognition Hypotheses Using Audio-Visual Matching
CN112037772A (zh) 基于多模态的响应义务检测方法、系统及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19920045

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19920045

Country of ref document: EP

Kind code of ref document: A1