CN110070859B

CN110070859B - Voice recognition method and device

Info

Publication number: CN110070859B
Application number: CN201810063341.1A
Authority: CN
Inventors: 万玉龙; 高杰
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-01-23
Filing date: 2018-01-23
Publication date: 2023-07-14
Anticipated expiration: 2038-01-23
Also published as: CN110070859A

Abstract

The embodiment of the application discloses a voice recognition method and device. The method comprises the following steps: acquiring voice data in an offline environment; carrying out semantic analysis on the voice data, and identifying real-time keywords in the voice data; acquiring at least one candidate word matched with the real-time keyword from a local data source; generating a search model based on the at least one candidate word; and determining the recognition result of the voice data by using the search model. By utilizing the embodiment of the application, the recognition efficiency of the voice data can be improved, and the higher recognition accuracy of the voice data under the off-line condition can be ensured.

Description

Voice recognition method and device

Technical Field

The present disclosure relates to the field of intelligent speech recognition technologies, and in particular, to a speech recognition method and apparatus.

Background

In recent years, intelligent voice interaction (Intelligent Speech Interaction) technology is rapidly developed, and the intelligent voice interaction technology is based on technologies such as voice recognition, voice synthesis, natural language understanding and the like, so that a product can be endowed with intelligent human-computer interaction experience of 'listening, speaking and understanding you' under various actual application scenes for enterprises. The intelligent voice interaction technology can be suitable for various scenes such as intelligent question answering, intelligent quality inspection, court trial real-time recording, real-time speech captions, interview recording and transcription, and the like, and has application cases in a plurality of fields such as finance, insurance, judicial, electronic commerce, and the like.

In most of the current intelligent voice systems, the intelligent voice interaction device usually works in a state of being connected to the internet, even in many states based on the cloud, that is, for example, in the voice recognition process, various data such as a language model can be set in the cloud, so that voice recognition can be performed on the cloud. However, if the performance of the smart device itself is limited in an offline environment, problems such as the following occur:

1. some data (such as an acoustic model, a language model, etc.) needed by using scenes (such as navigation, music playing, etc.) have large data volume, and the data in an offline environment cannot meet the requirement of voice recognition, so that the offline voice recognition rate cannot be always guaranteed, and thus the intelligent voice equipment is basically unavailable in the offline environment;

2. even if data that can satisfy the voice recognition requirement is set in the local database, a large amount of data including an acoustic model, a language model, and the like needs to be stored in the local database. Based on this, in the process of performing voice recognition in an offline environment, a large amount of data needs to be loaded, and the more data is loaded, the slower the loading speed is relatively. In addition, data such as acoustic models and language models are continuously updated, so that the data in the local database needs to be updated, and the data update needs to consume resources of the device, such as storage resources, network resources and the like.

Therefore, there is a need in the art for a way to accurately recognize speech in an offline environment.

Disclosure of Invention

An object of the embodiments of the present application is to provide a method and an apparatus for voice recognition, which not only can improve the recognition efficiency of voice data, but also can ensure higher recognition accuracy of voice data under an offline condition.

The voice recognition method and device provided by the embodiment of the application are specifically realized in the following way:

a method of speech recognition, the method comprising:

acquiring voice data in an offline environment;

carrying out semantic analysis on the voice data, and identifying real-time keywords in the voice data;

acquiring at least one candidate word matched with the real-time keyword from a local data source;

generating a search model based on the at least one candidate word;

and determining the recognition result of the voice data by using the search model.

A speech recognition device comprising a processor and a memory for storing processor-executable instructions that when executed by the processor implement:

acquiring voice data in an offline environment;

generating a search model based on the at least one candidate word;

A computer readable storage medium having stored thereon computer instructions which when executed implement the steps of the speech recognition method.

An in-vehicle system comprising a processor and a memory for storing processor-executable instructions that when executed by the processor implement the steps of the speech recognition method.

A conference system comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the speech recognition method.

The voice recognition method and the voice recognition device can process voice data in an off-line environment. In the processing, keywords of the voice data can be identified, and a plurality of candidate words similar to the pronunciation of the keywords are obtained from a local data source based on the keywords. And then, decoding the voice data for a plurality of times by utilizing the candidate words, and finally generating a recognition result of the voice data. According to the technical scheme, the range can be narrowed according to the key information of the voice data, and a plurality of candidate information can be searched from the local data source. Because the range of the candidate information is smaller and the target is clear, the voice information is decoded for the second time or even more times based on the candidate information, so that the recognition efficiency of the voice data can be improved, and the higher recognition accuracy of the voice data under the off-line condition can be ensured. In addition, compared with the prior art that a large-scale language model is arranged at a client, the method can be avoided, and the storage burden and performance requirements on intelligent equipment are greatly reduced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic view of an application scenario of a speech recognition method provided in the present application;

fig. 2 is a schematic diagram of an application scenario of a speech recognition method provided in the present application;

fig. 3 is a schematic diagram of an application scenario of a speech recognition method provided in the present application;

FIG. 4 is a method flow diagram of one embodiment of a speech recognition method provided herein;

fig. 5 is a schematic block diagram of an embodiment of a speech recognition device provided in the present application.

Detailed Description

In order to better understand the technical solutions in the present application, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.

In order to facilitate understanding of the technical solutions provided by the embodiments of the present application by those skilled in the art, a technical environment in which the technical solutions are implemented is described below.

From the foregoing, it is known that the intelligent voice interaction devices in the prior art generally operate in an internet-connected environment. When the device is in an offline environment, the voice recognition rate is often not guaranteed, resulting in the intelligent voice device being substantially unavailable in the offline environment. Although some intelligent voice interaction devices may be provided with databases of acoustic models, language models, maps, music, etc. necessary for voice recognition. In order to implement offline speech recognition, multiple corresponding search spaces need to be set in the local memory of the smart device for different application scenarios (e.g. navigation, music playing) etc., including finite state diagram language models for command word recognition and NGram language models for personal point of interest (POI) or other large vocabulary recognition. However, the storage limitation and the performance limitation of the intelligent device are limited, the requirements cannot be met, the acoustic model and the language model in the existing intelligent device are weak in performance, and particularly the language model cannot accurately identify the identification result of the voice data due to the limitation of the calculation performance. For example, in one example, when a user utters a voice "cumin clothes plug", the intelligent voice interaction device often recognizes the recognition result "siui clothes plug" according to the pronunciation sequence of the voice. The incorrect recognition result is obtained, and the accurate result cannot be obtained through large-scale and accurate calculation just because the local acoustic model and/or the language model of the intelligent voice interaction equipment are weak in performance.

Based on the technical requirements similar to those described above, the voice recognition method provided by the application can utilize limited local resources of the intelligent equipment to perform multiple decoding processes on voice data, and an accurate voice recognition result is generated.

The following describes a specific implementation of the method according to the present embodiment through a specific application scenario.

As shown in the schematic view of the scenario in fig. 1, the car with a small user has the function of intelligent speech recognition. When the Ming is driven to a place with weak communication signals such as open field, expressway, field forest, tunnel and the like, the intelligent voice recognition client on the vehicle can recognize the user voice signals in an off-line voice recognition mode. For example, when a small mine runs in a tunnel, voice data is sent out to "please navigate to a road exit", and when the client detects that the current network signal is weak, the client determines to recognize the voice data in an offline voice recognition mode. The client may first perform a first speech recognition on the semantic data. Then, the static keywords "navigate to" in the voice signal can be identified according to the locally stored static keyword set, and navigate to < place name > "according to the static keywords" navigate to "match grammar rule". According to the grammar rule, the real-time keyword in the voice signal can be determined to be a 'trunk road', and the real-time keyword is a place name. After determining that the real-time keyword is a place name, the client searches a plurality of candidate words similar to the "trunk road" in pronunciation, such as Ganjiang road, ganqiang road and the like, from the local offline map. After the plurality of candidate words are obtained, a search model, such as an FSG network or the like, may be generated based on the plurality of candidate words. Finally, based on the search model, the voice data can be subjected to voice recognition again, and a recognition result of the voice data can be determined.

Of course, the technical scheme provided by the application can also be applied to other scenes, such as the public intelligent ticket purchasing machine shown in fig. 2, even if no network signal is caused by emergency or the network signal is poor, the intelligent ticket purchasing machine can normally recognize the voice data of the user and normally work so as to maintain the public order. For another example, the technical scheme provided by the application can be also applied to a conference scene shown in fig. 3, and similarly, the intelligent conference assistant can accurately recognize the voice of the user and conduct normal conference recording without network signals in time. In addition, the technical scheme provided by the application can be applied to other intelligent clients, and the clients can be electronic equipment with a recording function. Specifically, for example, the client may be a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, a shopping guide terminal, a television, a smart speaker, a microphone, and the like. Wherein, intelligent wearable equipment includes but is not limited to intelligent bracelet, intelligent wrist-watch, intelligent glasses, intelligent helmet, intelligent necklace etc.. Alternatively, the client may be software that can be run in the electronic device. For example, a recording function is provided in the electronic device, and the software can record an audio file by calling the recording function.

The following describes the speech recognition method described in the present application in detail with reference to the accompanying drawings. Fig. 4 is a method flow diagram of one embodiment of a speech recognition method provided herein. Although the present application provides method operational steps as illustrated in the following examples or figures, more or fewer operational steps may be included in the method, either on a routine or non-inventive basis. In steps where there is logically no necessary causal relationship, the execution order of the steps is not limited to the execution order provided in the embodiments of the present application. The methods may be performed sequentially or in parallel (e.g., in a parallel processor or multithreaded environment) in accordance with the methods shown in the embodiments or figures, when the methods are performed in an actual speech recognition process or apparatus.

S401: and acquiring voice data in an offline environment.

S403: and carrying out semantic analysis on the voice data, and identifying real-time keywords in the voice data.

S405: at least one candidate word similar to the real-time keyword pronunciation is obtained from a local data source.

S407: generating a search model based on the at least one candidate word;

s409: and determining the recognition result of the voice data by using the search model.

In this embodiment, first, voice data may be acquired in an offline environment. Wherein, the offline environment may include a weak network or a no-network environment, which may be determined after network bandwidth detection. Specifically, when the client detects voice data, the bandwidth of the current network may be detected, and when the detected bandwidth is smaller than a preset bandwidth threshold, the current network may be determined to be an offline environment. For example, in one example, when a user drives in the field and sends a voice command "navigate to XX" to the intelligent voice interaction device, the intelligent voice interaction device detects the bandwidth of the current network, for example, the bandwidth of the current network is 30k/s, and if the set bandwidth threshold is 200k/s, the current network can be determined to be an offline environment. The bandwidth threshold may be set according to the requirement of voice recognition, and if a bandwidth of at least 250k/s is required in the process of voice recognition through calculation, the bandwidth threshold may be set to be 250k/s.

In this embodiment, when the network environment is determined to be an offline environment, offline speech recognition described in the following embodiments of the present application may be performed. In the offline speech recognition process, semantic analysis can be performed on the speech data, and real-time keywords in the speech data are recognized. In this embodiment, the real-time keyword may include a word in the voice data for expressing the user's intention target, and the real-time keyword may include names of specific things, such as a person name (contact name, artist name in an address book), a song name, a place (e.g., road name, market name, scenic spot name), and the like. For example, in the voice data "navigate to" Zhongguancun ", which is the real-time keyword, and in the voice data" play My sky ", which is the real-time keyword. In this application, semantic parsing is required for the voice data to obtain real-time keywords in the voice data.

In one embodiment of the present application, in the process of identifying the real-time keyword, a static keyword in the language data may be first obtained, where the static keyword may be used to express the intention of the user, and may include specific rule actions, such as "navigate", "play" in the above example, and further such as "XX block", "call XX", "open XX", "close XX" in the smart car related voice command, and so on. Based on this, in one embodiment of the present application, a static keyword set may be set in the client, where the static keyword set may include not only static keywords in each application scenario, but also grammar rules corresponding to the static keywords. For example, for the static keyword "blocking" a grammar rule "< road > blocking? "; for the static keyword "call to", a grammar rule "call to < name >" may be set. Because the grammar rule is set simply and occupies small storage space, the grammar rule is very suitable for being used in an off-line environment. Then, in the use process, the voice data can be matched with a plurality of grammar rules in the static keyword set, and if the grammar rules are matched, the static keywords in the voice data can be determined. After determining the static keywords, the real-time keywords may be determined from the static keywords, for example, if grammar rules "< road > are matched according to certain voice data? ", it is possible to determine the static keyword" is blocked? "the preceding word is the real-time keyword < head > of the voice data. For grammar rules "call < name >", if the grammar rules are matched according to a certain voice signal, the real-time keywords < name > "of words after the static keywords" call "are determined.

In one embodiment of the present application, the domain information of the keyword may be further determined according to the grammar rule. For example, if the voice data matches the grammar rules "< road > is blocked? The real-time keyword of the voice data can be determined to be the name of the route, namely, the real-time keyword belongs to geographic information. For another example, if the voice data matches the grammar rules "play < music >? The real-time keyword of the voice data can be determined to be the name of the music or the radio station, namely, the entertainment information. Based on this, the domain information of the real-time keyword can be set in the process of setting the grammar rule, and the data search range can be narrowed down in the case of knowing the domain information in the process of searching candidate words of the real-time keyword later.

In this embodiment, after determining the real-time keyword, at least one candidate word that matches the real-time keyword may be obtained from a local data source. The local data source may be stored in a memory of the intelligent device, or may be stored in a memory externally connected to the intelligent device, which is not limited herein. The local data source may store related data under multiple application scenarios, such as offline maps, music, station names, and so on. In one embodiment, the at least one candidate word that matches the real-time keyword may include at least one candidate word that has a similar pronunciation as the real-time keyword. For example, in the case of "do cumin dress plug? The method can obtain a plurality of candidate words such as 'thought dress', 'Mirui dress', 'Lirui dress', 'Si brilliant dress', 'Si rain dress' and the like which have similar pronunciation with 'Cumin dress' from a local data source. In one embodiment of the present application, according to the information of the domain to which the real-time keyword belongs, a sub-data source matched with the domain may be obtained from a local data source, and a candidate word of the real-time keyword may be obtained from the sub-data source. For example, when the real-time keyword of the voice data is determined to be the name of the road, an offline map may be found from a local data source of the client, and the road name similar to the pronunciation of the real-time keyword may be obtained from the offline map. For another example, when the real-time keyword of the voice data is determined to be a music name, a music library can be searched from a local data source, and song names similar to the real-time keyword in pronunciation are obtained from the music library. Thus, the time for searching candidate words can be greatly shortened, and the efficiency of offline speech recognition is improved.

In this embodiment, after the voice data is acquired, the voice data may be subjected to preliminary voice recognition, that is, first acoustic decoding, based on the local acoustic model and the language model. In the first acoustic decoding process, data preprocessing and feature extraction can be performed on the voice data, and the data preprocessing can remove non-voice data from the beginning to the end of the voice data. The speech data may then be feature extracted using a feature extraction scheme such as MFCC to convert the speech data from a sonic form to a matrix form. Then, a pronunciation sequence of the voice data may be extracted using an acoustic model, and in particular, the pronunciation sequence may be acquired using a pronunciation dictionary or the like. For example, for the speech data "do cumin dress plug? ", the resulting pronunciation sequence is zi rui fu shi du ma. Based on the pronunciation sequence, a recognition result "is a SiRui dress plug? ". Obviously, the above-mentioned recognition results are not very accurate, and such recognition results are likely to cause incorrect location recognition and even navigation to the wrong location.

In this embodiment, in the first speech recognition process, the pronunciation sequence of the speech data has been acquired. Thus, in the process of obtaining candidate words, the pronunciation sequence of the real-time keyword may be extracted from the pronunciation sequence of the voice data, for example, the voice data "does cumin clothing blocking? The pronunciation sequence of the real-time keyword 'cumin dress' is zi rui furhi. In the process of obtaining candidate words similar to the pronunciation sequence of the real-time keyword, the similarity between the pronunciation sequence of the word in the local data source and the pronunciation sequence of the keyword can be calculated, and when the similarity is larger than a preset threshold value, the corresponding word can be used as the candidate word of the real-time keyword.

After at least one candidate word of the real-time keyword is acquired, the candidate word may be substituted into the voice data, and acoustic decoding may be performed again. As can be seen from the foregoing, in the embodiment, the recognition result obtained by performing the first acoustic decoding on the voice data is not very accurate, and a search model may be generated based on the at least one candidate word, and the recognition result of the voice data may be searched out from the search model. In a specific embodiment, the search model may include a finite state diagram (Finite State Graph, FSG) in which the at least one candidate word may be respectively associated with a path in the finite state diagram during construction of the finite state diagram. And the process of acoustic decoding can determine the path closest to the voice data in the finite state diagram, and take the closest path as the final recognition result.

In one embodiment of the present application, the finite state diagram may be searched for decoding by using a viterbi algorithm, and a global optimal path may be searched for from the finite state diagram. In a specific embodiment, the voice data on each path may be subjected to frame processing to generate a plurality of voice frames, where each voice frame may further include a preset number of phonemes, where the phonemes are minimum units of sound, and factors in chinese include initials, finals, and the like. The probability (such as the maximum posterior probability) of the audio frame on each path is calculated step by step, the probability is accumulated, and finally, the path with the largest accumulated probability is taken as the final recognition result. The probability may characterize how close the pronunciation of the word on the path is to the speech data, the greater the probability is to the speech data. In this embodiment, the probabilities may include observation probabilities, transition probabilities, language probabilities, and the like.

The voice recognition method provided by the application can process voice data in an off-line environment. In the processing, keywords of the voice data can be identified, and a plurality of candidate words similar to the pronunciation of the keywords are obtained from a local data source based on the keywords. And then, decoding the voice data for a plurality of times by utilizing the candidate words, and finally generating a recognition result of the voice data. According to the technical scheme, the range can be narrowed according to the key information of the voice data, and a plurality of candidate information can be searched from the local data source. Because the range of the candidate information is smaller and the target is clear, the voice information is decoded for the second time or even more times based on the candidate information, so that the recognition efficiency of the voice data can be improved, and the higher recognition accuracy of the voice data under the off-line condition can be ensured. In addition, compared with the prior art that a large-scale language model is arranged at a client, the method can be avoided, and the storage burden and performance requirements on intelligent equipment are greatly reduced.

As shown in fig. 5, another aspect of the present application further provides a voice recognition apparatus, including a processor and a memory for storing instructions executable by the processor, where the processor may implement:

acquiring voice data in an offline environment;

generating a search model based on the at least one candidate word;

Optionally, in an embodiment of the present application, the processor may perform semantic parsing on the voice data in the implementation step, and when identifying a real-time keyword in the voice data, the method may include:

matching static keywords in the voice data from a preset static keyword set;

and determining real-time keywords in the voice data according to the static keywords.

Optionally, in an embodiment of the present application, the processor when the implementing step obtains at least one candidate word matching the real-time keyword from a local data source may include:

Determining domain information to which the real-time keywords belong;

according to the domain information, sub data sources matched with the domain are obtained from a local data source;

and acquiring at least one candidate word matched with the real-time keyword from the sub-data source.

Optionally, in an embodiment of the present application, the at least one candidate word that matches the real-time keyword includes at least one candidate word that has a similar pronunciation as the real-time keyword.

identifying the pronunciation sequence of the real-time keyword;

respectively calculating the similarity between the pronunciation sequences of the words in the local data source and the pronunciation sequences of the real-time keywords;

and taking the words with the similarity larger than a preset threshold as candidate words of the real-time keywords.

Optionally, in an embodiment of the present application, the processor when implementing step to generate the search model based on the at least one candidate word may include:

and constructing a finite state diagram by utilizing the at least one candidate word, wherein the at least one candidate word corresponds to a path in the finite state diagram respectively.

Optionally, in an embodiment of the present application, the determining, by the processor, the recognition result of the voice data using the search model in the implementing step may include:

acoustically decoding paths in the finite state diagram;

and taking a path closest to the pronunciation of the voice data in the finite state diagram as a recognition result of the voice data.

Optionally, in an embodiment of the present application, the method may further include, when the implementing step performs acoustic decoding on the path in the finite state diagram:

acquiring a first decoding result of the voice data;

and based on the first decoding result, acoustically decoding the finite state diagram by using a Viterbi algorithm.

Optionally, in an embodiment of the present application, the acquiring, by the processor, the voice data in the offline environment in the implementing step may include:

acquiring voice data;

detecting the bandwidth of the network;

and when the bandwidth is smaller than a preset bandwidth threshold value, determining that the voice data is voice data in an offline environment.

Alternatively, in one embodiment of the present application, the preset bandwidth threshold may be set according to the requirements of speech recognition.

Another aspect of the present application also provides a computer-readable storage medium having stored thereon computer instructions which, when executed, implement the steps of the method of any of the above embodiments.

The computer readable storage medium may include physical means for storing information, typically by digitizing the information and then storing the information in a medium using electrical, magnetic, or optical means. The computer readable storage medium according to the present embodiment may include: means for storing information using electrical energy such as various memories, e.g., RAM, ROM, etc.; devices for storing information using magnetic energy such as hard disk, floppy disk, magnetic tape, magnetic core memory, bubble memory, and USB flash disk; devices for optically storing information, such as CDs or DVDs. Of course, there are other ways of readable storage medium, such as quantum memory, graphene memory, etc.

In another aspect, the present application further provides an in-vehicle system, where the in-vehicle system includes a processor and a memory for storing instructions executable by the processor, where the processor executes the instructions to implement the steps of the method according to any one of the embodiments above.

In another aspect, the present application further provides a conference system, where the conference system includes a processor and a memory for storing instructions executable by the processor, where the processor executes the instructions to implement the steps of the method according to any of the embodiments above.

In this embodiment, the client may be an electronic device having a recording function. Depending on the client data processing capabilities, it may be divided into the following categories.

TABLE 1

In this embodiment, the hardware device of the primary network device is relatively simple, and recording through the microphone can be performed to generate audio information. And transmitting the generated audio information to a server through a network communication module. The primary network device may include a microphone, a network communication unit, a sensor, and a speaker. The primary network device may not substantially require processing of the data. The primary network device may also be provided with other sensors for acquiring operating parameters of the primary network device. Specifically, for example, the primary network device may be an internet of things device, an edge node device, or the like.

In this embodiment, the simple network device may mainly include: microphones, network communication units, processors, memories, speakers, etc. Simple network devices enhance the ability to process data as compared to primary network devices. The simple network device may have a processor that can process simple logic operations so that the simple network device, after collecting the data, may perform preliminary preprocessing of the data, such as generating a feature matrix from the audio information. The simple network device may have a display module with a simple display function, which may be used to feed back information to the user. Specifically, for example, the simple network device may be a smart wearable device, POS (point of sale) machine, or the like. Such as smart bracelets, more elementary smart watches, smart glasses, or settlement devices in off-line shopping sites (e.g., POS), mobile settlement devices (e.g., handheld POS, settlement modules attached to handheld devices), etc.

In this embodiment, the intermediate network device may mainly include a microphone, a network communication unit, a processor, a memory display, a speaker, and the like. The main frequency of the processor of the medium level network device is typically less than 2.0GHz, the memory capacity is typically less than 2GB, and the memory capacity is typically less than 128GB. The intermediate network device may perform a certain degree of processing on the recorded audio information, for example, generating a feature matrix, and performing endpoint detection processing, noise reduction processing, voice recognition and the like on the feature matrix. Specifically, for example, the intermediate network device may include: intelligent household appliances in intelligent home, intelligent home terminals, intelligent sound boxes, higher-level intelligent watches, primary intelligent mobile phones (for example, the price is about 1000 yuan), and vehicle-mounted intelligent terminals.

In this embodiment, the intelligent network device may mainly include hardware such as a microphone, a network communication unit, a processor, a memory, a display, and a speaker. The intelligent network device may have a relatively high data processing capability. The main frequency of the processor of the intelligent network device is typically greater than 2.0GHz, the capacity of the memory is typically less than 12GB, and the capacity of the memory is typically less than 1TB. After the feature matrix may be generated for the audio information, an end point detection process, a noise reduction process, a voice recognition process, and the like may be performed. Further, the intelligent network device can also generate a voice feature vector according to the audio information. In some cases, the speech feature vector may be matched with the user feature vector to identify the identity of the user. But such matching is limited to a limited number of user feature vectors, such as user feature vectors of individual family members in a family. Specifically, for example, the intelligent network device may include: smart phones, tablet computers, desktop computers, notebook computers and the like with better performance.

In this embodiment, the high-performance device may mainly include hardware such as a microphone, a network communication unit, a processor, a memory, a display, a speaker, and the like. High performance devices can have large-scale data processing capabilities and can also provide powerful data storage capabilities. The main frequency of the processor of the high-performance device is usually above 3.0GHz, the capacity of the memory is usually more than 12GB, and the capacity of the memory can be above 1 TB. The high performance device may generate a feature matrix for the audio information, end point detection processing, noise reduction processing, speech recognition, generate speech feature vectors, and match the speech feature vectors with a stored number of user feature vectors. Specifically, for example, the high performance device may be a workstation, a highly configured desktop computer, a Kiosk, a Kiosk, or the like.

Of course, the foregoing is by way of example only, as a few clients are listed. With the progress of science and technology, the performance of the hardware device may be improved, so that the electronic device with weak data processing capability may also have strong processing capability. The following embodiments are therefore incorporated by reference above in table 1, again by way of example only, and not by way of limitation.

It should be noted that the above data update optimization method can be implemented by all of the five types of hardware shown in table 1.

Although the present application provides method operational steps as described in the examples or flowcharts, more or fewer operational steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an actual device or client product, the instructions may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment) as shown in the embodiments or figures.

Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller can be regarded as a hardware component, and means for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general purpose hardware platform. Based on such understanding, the technical solutions of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to perform the methods described in the various embodiments or some parts of the embodiments of the present application.

Various embodiments in this specification are described in a progressive manner, and identical or similar parts are all provided for each embodiment, each embodiment focusing on differences from other embodiments. The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable electronic devices, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

Although the present application has been described by way of example, those of ordinary skill in the art will recognize that there are many variations and modifications of the present application without departing from the spirit of the present application, and it is intended that the appended claims encompass such variations and modifications without departing from the spirit of the present application.

Claims

1. A method of speech recognition, the method comprising:

acquiring voice data in an offline environment;

Generating a search model based on the at least one candidate word;

determining a recognition result of the voice data by using the search model;

the semantic analysis is performed on the voice data, and the recognition of the real-time keywords in the voice data comprises the following steps:

acquiring a preset static keyword set; the static keyword set comprises static keywords of each application scene in a plurality of application scenes and grammar rules corresponding to the static keywords of each application scene;

matching the voice data with grammar rules corresponding to the static keywords of each application scene to obtain the static keywords corresponding to the voice data;

2. The method of claim 1, wherein the obtaining at least one candidate word from the local data source that matches the real-time keyword comprises:

determining domain information to which the real-time keywords belong;

3. The method of claim 1 or 2, wherein the at least one candidate word that matches the real-time keyword comprises at least one candidate word that has a similar pronunciation as the real-time keyword.

4. The method of claim 3, wherein said obtaining at least one candidate word from a local data source that matches the real-time keyword comprises:

identifying the pronunciation sequence of the real-time keyword;

5. The method of claim 1, wherein generating a search model based on the at least one candidate word comprises:

6. The method of claim 5, wherein determining the recognition result of the voice data using the search model comprises:

acoustically decoding paths in the finite state diagram;

7. The method of claim 6, wherein the acoustically decoding paths in the finite state diagram comprises:

acquiring a first decoding result of the voice data;

8. The method of claim 1, wherein the acquiring voice data in an offline environment comprises:

acquiring voice data;

detecting the bandwidth of the network;

9. The method of claim 8, wherein the preset bandwidth threshold is set according to a requirement for speech recognition.

10. A speech recognition apparatus comprising a processor and a memory for storing processor-executable instructions, the processor implementing when executing the instructions:

acquiring voice data in an offline environment;

generating a search model based on the at least one candidate word;

determining a recognition result of the voice data by using the search model;

11. The apparatus of claim 10, wherein the processor, when implementing the step to obtain at least one candidate word from a local data source that matches the real-time keyword, comprises:

determining domain information to which the real-time keywords belong;

12. The apparatus of claim 10 or 11, wherein the at least one candidate word that matches the real-time keyword comprises at least one candidate word that has a similar pronunciation as the real-time keyword.

13. The apparatus of claim 12, wherein the processor, when implementing the step to obtain at least one candidate word from a local data source that matches the real-time keyword, comprises:

identifying the pronunciation sequence of the real-time keyword;

14. The apparatus of claim 10, wherein the processor, when performing the step of generating a search model based on the at least one candidate word, comprises:

15. The apparatus of claim 14, wherein the processor, when performing the step of determining the recognition result of the voice data using the search model, comprises:

Acoustically decoding paths in the finite state diagram;

16. The apparatus of claim 15, wherein the processor, when performing the step of acoustically decoding the path in the finite state diagram, comprises:

acquiring a first decoding result of the voice data;

17. The apparatus of claim 10, wherein the processor in the step of implementing to obtain voice data in an offline environment comprises:

acquiring voice data;

detecting the bandwidth of the network;

18. The apparatus of claim 17, wherein the preset bandwidth threshold is set according to a requirement for speech recognition.

19. A computer readable storage medium having stored thereon computer instructions which when executed implement the steps of the method of any of claims 1 to 9.

20. An in-vehicle system comprising a processor and a memory for storing processor-executable instructions, the processor, when executing the instructions, implementing the steps of the method of any one of claims 1 to 9.

21. A conference system comprising a processor and a memory for storing processor-executable instructions, which when executed by the processor implement the steps of the method of any one of claims 1 to 9.