CN110060662B

CN110060662B - Voice recognition method and device

Info

Publication number: CN110060662B
Application number: CN201910293280.2A
Authority: CN
Inventors: 马赛; 杜念冬
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2021-02-23
Anticipated expiration: 2039-04-12
Also published as: CN110060662A

Abstract

The invention provides a voice recognition method and a voice recognition device, wherein the method comprises the following steps: obtaining the voice to be recognized and parameter information; the parameter information includes: the current mode, the recognition serial number of the voice, the internal and external noise information and the direction information; extracting a feature vector corresponding to the voice; determining whether the voice belongs to non-primary voice in a single awakening multi-recognition mode or not according to the current mode and the recognition serial number; if the voice belongs to non-primary voice in a single awakening multi-recognition mode, acquiring a voice recognition result, an acoustic judgment result and a semantic judgment result according to the parameter information and the feature vector, and determining whether the voice belongs to the music field according to the acoustic judgment result and the semantic judgment result; and if the voice belongs to the music field, determining the instruction and/or the resource corresponding to the voice according to the voice recognition result. Because the internal and external noise information, the direction information and the like are collected, the accuracy of voice recognition is improved, the voice in the music field can be automatically and accurately recognized in a single awakening multi-recognition mode, and the follow-up high-quality music service is conveniently provided for users.

Description

Voice recognition method and device

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech recognition method and apparatus.

Background

With the development of artificial intelligence technology, more and more conversational artificial intelligence products emerge. Taking a conversational artificial intelligence product as an example of an intelligent sound box, a voice awakening technology and a voice recognition technology are core technologies of the intelligent sound box, and interaction experience of a user and the intelligent sound box is directly influenced.

In an actual application scene, the smart sound box faces a complex environment, for example, various internal and external noises (such as sound of an indoor television program, outdoor car sound, and the like), speaking sounds of other people except for a user, stumbled speaking modes, and the like can interfere with a voice recognition effect, and parameters such as internal and external noise information and direction information are not collected when voice is collected at present, so that a voice recognition result is prone to deviation, the accuracy is poor, and the interaction effect is influenced.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, a first objective of the present invention is to provide a speech recognition method, which is used to solve the problems in the prior art that the speech recognition result is prone to have deviation, the accuracy is poor, and the interaction effect is affected.

A second object of the present invention is to provide a speech recognition apparatus.

A third object of the present invention is to propose another speech recognition apparatus.

A fourth object of the invention is to propose a non-transitory computer-readable storage medium.

A fifth object of the invention is to propose a computer program product.

In order to achieve the above object, an embodiment of a first aspect of the present invention provides a speech recognition method, including:

acquiring voice to be recognized and parameter information; the parameter information includes: the current mode, the recognition serial number of the voice, the internal and external noise information and the direction information;

extracting a feature vector corresponding to the voice;

determining whether the voice belongs to non-primary voice in a single awakening multi-recognition mode or not according to the current mode and the recognition serial number;

if the voice belongs to non-primary voice in a single awakening multi-recognition mode, acquiring a voice recognition result, an acoustic judgment result and a semantic judgment result according to the parameter information and the feature vector, and determining whether the voice belongs to the music field according to the acoustic judgment result and the semantic judgment result;

and if the voice belongs to the music field, determining the instruction and/or the resource corresponding to the voice according to the voice recognition result.

Further, the determining whether the voice belongs to a non-primary voice in a single wake-up multiple recognition mode according to the current mode and the recognition serial number includes:

judging whether the current mode is a single-time awakening multi-time identification mode;

and if the current mode is a single-time awakening multi-time recognition mode, determining whether the voice is non-primary voice according to the recognition serial number.

Further, the obtaining a speech recognition result, an acoustic judgment result and a semantic judgment result according to the parameter information and the feature vector, and determining whether the speech belongs to the music field according to the acoustic judgment result and the semantic judgment result includes:

inputting the internal and external noise information, the azimuth information and the feature vector into an acoustic recognition model to obtain the voice recognition result and an acoustic judgment result;

determining whether the voice belongs to the music field acoustically according to the acoustic judgment result;

if the voice belongs to the music field acoustically, inputting the voice recognition result, the internal and external noise information and the orientation information into a semantic recognition model to obtain the semantic judgment result;

determining whether the voice belongs to the music field semantically according to the semantic judgment result;

and if the voice semantically belongs to the music field, determining that the voice belongs to the music field.

Further, the obtaining a speech recognition result, an acoustic determination result, and a semantic determination result according to the parameter information and the feature vector, and determining whether the speech belongs to the music field according to the acoustic determination result and the semantic determination result further includes:

and if the voice does not belong to the music field acoustically or semantically, determining that the voice does not belong to the music field.

Further, the method further comprises:

if the current mode is a single-time awakening single recognition mode or a passenger mode, or the voice belongs to the first voice in the single-time awakening multi-time recognition mode, acquiring a voice recognition result according to the parameter information and the feature vector;

and determining corresponding instructions and/or resources according to the voice recognition result.

Further, after determining the instruction and/or resource corresponding to the voice according to the voice recognition result, the method further includes:

executing the instructions, and/or providing the resources to a user of the smart sound box.

The voice recognition method of the embodiment of the invention obtains the voice to be recognized and the parameter information; the parameter information includes: the current mode, the recognition serial number of the voice, the internal and external noise information and the direction information; extracting a feature vector corresponding to the voice; determining whether the voice belongs to non-primary voice in a single awakening multi-recognition mode or not according to the current mode and the recognition serial number; if the voice belongs to non-primary voice in a single awakening multi-recognition mode, acquiring a voice recognition result, an acoustic judgment result and a semantic judgment result according to the parameter information and the feature vector, and determining whether the voice belongs to the music field according to the acoustic judgment result and the semantic judgment result; and if the voice belongs to the music field, determining the instruction and/or the resource corresponding to the voice according to the voice recognition result. Because the internal and external noise information, the direction information and the like are collected, the accuracy of voice recognition is improved, the voice in the music field can be automatically and accurately recognized in a single awakening multi-recognition mode, and the follow-up high-quality music service is conveniently provided for users.

In order to achieve the above object, a second embodiment of the present invention provides a speech recognition apparatus, including:

the acquisition module is used for acquiring the voice to be recognized and the parameter information; the parameter information includes: the current mode, the recognition serial number of the voice, the internal and external noise information and the direction information;

the extraction module is used for extracting the feature vector corresponding to the voice;

the determining module is used for determining whether the voice belongs to non-primary voice in a single awakening multi-recognition mode according to the current mode and the recognition serial number;

the determining module is further configured to, when the speech belongs to a non-primary speech in a single-wake multi-recognition mode, obtain a speech recognition result, an acoustic determination result, and a semantic determination result according to the parameter information and the feature vector, and determine whether the speech belongs to the music field according to the acoustic determination result and the semantic determination result;

and the determining module is further used for determining the instruction and/or the resource corresponding to the voice according to the voice recognition result when the voice belongs to the music field.

Further, the determining module is specifically configured to,

Further, the determining module is specifically further configured to,

Further, the obtaining module is further configured to obtain a voice recognition result according to the parameter information and the feature vector when the current mode is a single-time awakening single recognition mode or a guest mode, or the voice belongs to a first voice in a single-time awakening multiple recognition mode;

the determining module is further configured to determine a corresponding instruction and/or resource according to the voice recognition result.

Further, the apparatus further comprises: and the execution module is used for executing the instruction and/or providing the resource for a user of the intelligent sound box.

The voice recognition device of the embodiment of the invention obtains the voice to be recognized and the parameter information; the parameter information includes: the current mode, the recognition serial number of the voice, the internal and external noise information and the direction information; extracting a feature vector corresponding to the voice; determining whether the voice belongs to non-primary voice in a single awakening multi-recognition mode or not according to the current mode and the recognition serial number; if the voice belongs to non-primary voice in a single awakening multi-recognition mode, acquiring a voice recognition result, an acoustic judgment result and a semantic judgment result according to the parameter information and the feature vector, and determining whether the voice belongs to the music field according to the acoustic judgment result and the semantic judgment result; and if the voice belongs to the music field, determining the instruction and/or the resource corresponding to the voice according to the voice recognition result. Because the internal and external noise information, the direction information and the like are collected, the accuracy of voice recognition is improved, the voice in the music field can be automatically and accurately recognized in a single awakening multi-recognition mode, and the follow-up high-quality music service is conveniently provided for users.

In order to achieve the above object, a third embodiment of the present invention provides another speech recognition apparatus, including: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech recognition method as described above when executing the program.

In order to achieve the above object, a fourth aspect of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the speech recognition method as described above.

In order to achieve the above object, a fifth embodiment of the present invention provides a computer program product, which when executed by an instruction processor in the computer program product, implements the speech recognition method as described above.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another speech recognition method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of another speech recognition apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

A speech recognition method and apparatus according to an embodiment of the present invention will be described with reference to the drawings.

Fig. 1 is a flowchart illustrating a speech recognition method according to an embodiment of the present invention. As shown in fig. 1, the speech recognition method includes the steps of:

s101, voice to be recognized and parameter information are obtained.

The execution main body of the voice recognition method provided by the invention is a voice recognition device, and the voice recognition device can be hardware equipment such as terminal equipment and a server, or software installed on the hardware equipment. The voice recognition device has a conversational artificial intelligence function, and has basic functions of weather check, information, audio listening, alarm setting, dressing guide, traffic, stock market quotation and the like besides controlling the intelligent household equipment by performing voice interaction with a user.

In this embodiment, after the voice recognition device is awakened, the user can talk with the voice recognition device. Generally, a voice recognition device receives a first voice as a voice of a user; and for received non-first speech, it may not be the user's speech.

Taking a voice recognition device as an example of an intelligent sound box, in an actual application scenario, the intelligent sound box faces a complex environment, for example, various internal and external noises (such as sound of an indoor television program, outdoor car sound, etc.), speaking sounds of other people except for a user, and stumbled speaking modes can interfere with a voice processing effect. Generally, after the user wakes up the smart speaker, the first voice received by the smart speaker can be understood as the user voice reflecting the real needs of the user. For example, a user wants to listen to a song, and immediately says that "i want to listen to a song that i listen to frequently" after waking up the smart speaker, the smart speaker plays the corresponding song to meet the user's requirement. After the smart speaker plays the corresponding song, if the user turns on the television to watch the television program, the second voice received by the smart speaker may be the sound of the television program, or the second voice is still the voice of the user, but the second voice is doped with a lot of noise such as the sound of the television program, and at this time, the voice to be recognized received by the smart speaker needs to be discriminated.

In this embodiment, the parameter information of the speech to be recognized includes the current mode, the recognition sequence number, the internal and external noise information, and the orientation information of the speech, and by acquiring the internal and external noise information, the orientation information, and other information, the accuracy of the speech recognition can be improved, and the interactive experience can be improved.

In this embodiment, the current mode is any one of the following modes, but not limited to, a single-wake single-recognition mode, a single-wake multi-recognition mode, and a super guest mode.

Wherein, the single-time awakening single-time identification mode can be understood as: when a user interacts with the voice recognition device, the user needs to speak the voice containing the awakening word to awaken the voice recognition device and then speak the interactive voice. Taking the voice recognition device as an intelligent sound box as an example, the voice recognition device activates the single-time awakening recognition mode, and the voice containing the awakening words is in a small degree and small degree mode, and the single-time awakening recognition mode is activated. The first time the user wants to check the weather, the user needs to say ' small, small ' first, activate the single awakening single recognition mode ', and the intelligent sound box enters the sleep mode after executing the weather checking task. When the user checks the stock next time, the user still needs to speak the 'small degree (awakening word) first and activate the single awakening single recognition mode' to awaken the smart sound box.

Wherein, the single wake-up multiple recognition mode can be understood as: the user can speak the voice to the voice recognition device a plurality of times after speaking the "voice including the wakeup word" to wake up the voice recognition device each time the user interacts with the voice recognition device. Taking the voice recognition device as an intelligent sound box as an example, the voice recognition device activates the single-wake-up multiple-recognition mode, and the voice containing the wake-up word is the small degree (wake-up word) and activates the single-wake-up multiple-recognition mode. The first time the user wants to check weather, the second time the user wants to listen to songs, and the third time the user wants to check stocks, when the user interacts with the voice recognition device for three times, the voice recognition device is always in an awakening mode.

The jike mode can be understood as that when a user interacts with the voice recognition device every time, the user can have continuous conversations with the voice recognition device for many times in a short time without speaking voice containing awakening words every time. Taking the voice recognition device as an intelligent sound box as an example, the voice including the wake-up word in the active guest mode is "small degree (wake-up word), and the active guest mode is activated". The first time the user wants to check weather, the second time the user wants to listen to songs, and the third time the user wants to check stocks, when the user interacts with the voice recognition device for three times, the voice recognition device is always in an awakening mode. The difference between the extremely-guest mode and the single-wake-up multi-recognition mode is that the single-wake-up multi-recognition mode can continuously collect a plurality of voices, and other voices cannot be collected before the extremely-guest mode completes processing of a certain voice.

In this embodiment, the identification sequence number may be understood as representing that the received voice to be identified is the voice of the second time after the voice recognition apparatus wakes up. For example, the recognition serial number is 1, which represents that the voice to be recognized is the first voice; the recognition serial number is 2, and the voice to be recognized is represented as a second voice; the recognition serial number is 3, which represents that the voice to be recognized is the third voice. In this embodiment, the internal and external noise information may be understood as the environmental noise of the environment where the speech recognition apparatus is located. Taking the voice recognition device as an example of the intelligent sound box, when the intelligent sound box is placed in a living room, a bedroom, a kitchen and the like, the intelligent sound box can recognize the environmental noise of the living room, the bedroom, the kitchen and the like firstly, and when the subsequent voice to be recognized is processed, the voice to be recognized can be subjected to noise processing according to the internal and external noise information, so that the recognition accuracy of the voice is improved.

In this embodiment, the azimuth information may be understood as sound source position information, and the speech recognition device may recognize the sound source position information of the speech to be recognized by using a sound source localization technology, so as to improve the speech recognition accuracy.

It should be noted that the parameters of the speech to be recognized are not limited to the current mode, the recognition serial number of the speech, the internal and external noise information, and the orientation information, for example, the parameters of the speech to be recognized may also include a recognition identifier. The identification mark may be understood as the number of activations used to characterize the current mode. Taking the current mode as a single-wake-up multi-recognition mode as an example, the recognition identifier is 1, and the representation that the single-wake-up multi-recognition mode is activated once. The identification mark is 2, which represents that the single-time awakening multi-time identification mode is activated twice. The identification mark is 3, which represents that the identification mode is activated three times in a single awakening and multiple times.

And S102, extracting the feature vector corresponding to the voice.

In this embodiment, the feature vector of the speech to be recognized is the basis of the speech recognition. Specifically, feature vectors of the speech to be recognized are obtained by performing feature extraction on the speech to be recognized. The method for extracting the features of the speech to be recognized can be any one of the following methods: linear Prediction analysis (LPC), Perceptual Linear Prediction Coefficients (PLP), Linear Prediction Cepstrum Coefficients (LPCC), Mel-Frequency Cepstrum Coefficients (MFCC), but is not limited to the above feature extraction method.

S103, determining whether the voice belongs to non-primary voice in a single awakening multi-recognition mode or not according to the current mode and the recognition serial number.

In this embodiment, the process of the voice recognition device executing step S103 may specifically be to determine whether the current mode is a single-wake-up multiple-recognition mode; and if the current mode is a single-time awakening multi-time recognition mode, determining whether the voice is non-primary voice according to the recognition serial number.

It should be noted that if the current mode is a single-time awakening single recognition mode or a visitor mode, or the voice belongs to a first voice in a single-time awakening multi-time recognition mode, a voice recognition result is obtained according to the parameter information and the feature vector; and determining corresponding instructions and/or resources according to the voice recognition result.

In this embodiment, the instruction and/or resource corresponding to the voice is set according to the actual situation. For example, if the voice to be recognized is "turn on the air conditioner", the smart sound box execution instruction is an air conditioner starting instruction; and if the voice to be recognized is 'on light', the intelligent sound box execution instruction is a light-on instruction. If the voice to be recognized is 'want to listen to a song of a singer', the intelligent sound box executes the song of which the resource is the singer.

S104, if the voice belongs to non-primary voice in a single-awakening multi-recognition mode, acquiring a voice recognition result, an acoustic judgment result and a semantic judgment result according to the parameter information and the feature vector, and determining whether the voice belongs to the music field according to the acoustic judgment result and the semantic judgment result.

In this embodiment, when the current mode of the speech recognition apparatus is a single-wake-up multiple-recognition mode, the received non-primary speech is discriminated, and only when the acoustically and semantically affiliated fields of the non-primary speech are consistent, the non-primary speech is considered to be the speech of the user, and it is stated that the user wants to obtain the service in the field again.

In this embodiment, according to the parameter information and the feature vector of the speech to be recognized, the speech recognition result, the acoustic determination result, and the semantic determination result may be obtained, and then, according to the acoustic determination result and the semantic determination result, it is determined whether the speech to be recognized belongs to the music field.

The voice recognition result is obtained by performing voice recognition on the voice to be recognized.

The acoustic determination result is used for representing the field to which the acoustic feature of the voice to be recognized belongs, for example, the acoustic determination result is 1, the field to which the voice to be recognized belongs is a music field, the acoustic determination result is 2, the field to which the voice to be recognized belongs is a weather query field, the acoustic determination result is 3, the field to which the voice to be recognized belongs is a stock query field, the acoustic determination result is 4, and the field to which the voice to be recognized belongs is an intelligent home equipment control field.

The semantic judgment result is used for representing the field to which the semantic features of the voice to be recognized belong. For example, the semantic determination result is 1, the domain to which the speech to be recognized belongs is a music domain, the semantic determination result is 2, the domain to which the speech to be recognized belongs is a weather query domain, the semantic determination result is 3, the domain to which the speech to be recognized belongs is a stock query domain, the semantic determination result is 4, the domain to which the speech to be recognized belongs is an intelligent home equipment control domain, and the like.

And S105, if the voice belongs to the music field, determining the instruction and/or the resource corresponding to the voice according to the voice recognition result.

Further, after step S105, the method may further include the steps of: executing the instructions, and/or providing the resources to a user of the smart sound box.

In this embodiment, the instructions and/or resources corresponding to the voice are set according to the actual situation. For example, if the voice to be recognized is "the song sound is too small or too large", the smart sound box executes the instruction as a sound adjusting instruction; and if the voice to be recognized is 'change a song', the intelligent sound box execution instruction is a song switching instruction. If the voice to be recognized is 'want to listen to a song of a singer', the intelligent sound box executes the song of which the resource is the singer.

Fig. 2 is a flowchart illustrating another speech recognition method according to an embodiment of the present invention. As shown in fig. 2, based on the embodiment shown in fig. 1, a specific implementation manner of "obtaining a speech recognition result, an acoustic determination result, and a semantic determination result according to the parameter information and the feature vector, and determining whether the speech belongs to the music field according to the acoustic determination result and the semantic determination result" includes the following steps:

s1041, inputting the internal and external noise information, the direction information and the feature vector into an acoustic recognition model, obtaining the voice recognition result and an acoustic judgment result, and executing the step S1042.

In this embodiment, a massive training sample is used to train any one of a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like, so as to obtain an acoustic recognition model. It should be noted that, when the acoustic recognition model is trained, information such as internal and external noise information and orientation information is considered, so that the accuracy of speech recognition can be improved.

The training samples are feature vectors of known voice, internal and external noise information of the known voice, orientation information of the known voice and the field to which the known voice belongs acoustically. The known voice is, for example, "i want to listen to a happy song", and the known voice belongs acoustically to the field, for example, "the known voice belongs to the field of music". The known speech is, for example, "what is the weather today", and the known speech belongs acoustically to the field, for example, "the known speech belongs to the field of weather enquiry". The known speech is, for example, "how the stock market is today", and the known speech belongs acoustically to the field, for example, "the known speech belongs to the field of stock market inquiry". The known internal and external noise information of the voice can be obtained by the existing noise detection technology, and the known azimuth information of the voice can be obtained by the existing sound source positioning technology. For a specific model training method, more details are given in the related art, and are not further described here.

S1042, determining whether the voice belongs to the music field acoustically according to the acoustic judgment result, if so, executing step S1043, and if not, executing step S1046.

S1043, if the voice belongs to the music field acoustically, inputting the voice recognition result, the internal and external noise information and the direction information into a semantic recognition model, obtaining the semantic judgment result, and executing the step S1044.

In this embodiment, after determining that the speech to be recognized belongs to the music field according to the acoustic determination result, the semantic determination result of the speech to be recognized is obtained, and when determining that the speech to be recognized belongs to the music field according to the semantic determination result, the speech to be recognized is considered to belong to the music field.

In this embodiment, a semantic recognition model is obtained by training any one of a Deep Neural Network (DNN), a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), and the like with a large number of training samples. The training sample is a speech recognition result of known speech, internal and external noise information of the known speech, orientation information of the known speech and a field to which the known speech belongs semantically. The result of speech recognition of a known speech is, for example, "i want to listen to a happy song", the field to which the known speech semantically belongs is the music field; the speech recognition result of the known speech is, for example, "how much weather is today", and the field to which the known speech semantically belongs is the field of weather query; the speech recognition result of the known speech is, for example, "what is there today, and the field to which the known speech semantically belongs is the field of stock inquiry. The known internal and external noise information of the voice can be obtained by the existing noise detection technology, and the known azimuth information of the voice can be obtained by the existing sound source positioning technology. For a specific model training method, more details are given in the related art, and are not further described here.

S1044, determining whether the voice semantically belongs to the music field according to the semantic judgment result, if so, executing a step S1045, and if not, executing a step S1046.

S1045, if the voice semantically belongs to the music field, determining that the voice belongs to the music field.

S1046, if the voice does not belong to the music field acoustically or does not belong to the music field semantically, determining that the voice does not belong to the music field.

Fig. 3 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present invention. As shown in fig. 3, the method includes: an acquisition module 31, an extraction module 32, and a determination module 33.

An obtaining module 31, configured to obtain a voice to be recognized and parameter information; the parameter information includes: the current mode, the recognition serial number of the voice, the internal and external noise information and the direction information;

an extracting module 32, configured to extract a feature vector corresponding to the speech;

a determining module 33, configured to determine whether the voice belongs to a non-primary voice in a single wake-up multiple recognition mode according to the current mode and the recognition serial number;

the determining module 33 is further configured to, when the speech belongs to a non-primary speech in a single-wake multi-recognition mode, obtain a speech recognition result, an acoustic determination result, and a semantic determination result according to the parameter information and the feature vector, and determine whether the speech belongs to the music field according to the acoustic determination result and the semantic determination result;

the determining module 33 is further configured to determine, according to the speech recognition result, an instruction and/or a resource corresponding to the speech when the speech belongs to the music field.

Further, the determining module 33 is specifically configured to,

Further, the determining module 33 is specifically further configured to,

Further, the obtaining module 31 is further configured to obtain a speech recognition result according to the parameter information and the feature vector when the current mode is a single-time awakening single recognition mode or a guest mode, or the speech belongs to a first speech in a single-time awakening multiple recognition mode;

the determining module 33 is further configured to determine a corresponding instruction and/or resource according to the voice recognition result.

It should be noted that the foregoing explanation of the embodiment of the speech recognition method is also applicable to the speech recognition apparatus of the embodiment, and is not repeated herein.

Fig. 4 is a schematic structural diagram of another speech recognition apparatus according to an embodiment of the present invention. The speech recognition apparatus includes:

memory 1001, processor 1002, and computer programs stored on memory 1001 and executable on processor 1002.

The processor 1002, when executing the program, implements the speech recognition method provided in the above-described embodiments.

Further, the speech recognition apparatus further includes:

a communication interface 1003 for communicating between the memory 1001 and the processor 1002.

A memory 1001 for storing computer programs that may be run on the processor 1002.

Memory 1001 may include high-speed RAM memory and may also include non-volatile memory (e.g., at least one disk memory).

The processor 1002 is configured to implement the speech recognition method according to the foregoing embodiment when executing the program.

If the memory 1001, the processor 1002, and the communication interface 1003 are implemented independently, the communication interface 1003, the memory 1001, and the processor 1002 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

Optionally, in a specific implementation, if the memory 1001, the processor 1002, and the communication interface 1003 are integrated on one chip, the memory 1001, the processor 1002, and the communication interface 1003 may complete communication with each other through an internal interface.

The processor 1002 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.

The invention also provides a non-transitory computer-readable storage medium on which a computer program is stored which, when executed by a processor, implements a speech recognition method as described above.

The invention also provides a computer program product, which when executed by an instruction processor in the computer program product, implements the speech recognition method as described above.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A speech recognition method, comprising:

extracting a feature vector corresponding to the voice;

determining whether the voice belongs to non-first voice in a single awakening multi-recognition mode according to the current mode and the recognition serial number, wherein the single awakening multi-recognition mode is an interaction mode in which the user can speak voice for a plurality of times to a voice recognition device after speaking the voice containing awakening words to awaken the voice recognition device each time when interacting with the voice recognition device;

2. The method according to claim 1, wherein the determining whether the speech belongs to a non-first speech in a single wake-up multiple recognition mode according to the current mode and the recognition sequence number comprises:

3. The method according to claim 1, wherein the obtaining a speech recognition result, an acoustic determination result, and a semantic determination result according to the parameter information and the feature vector, and determining whether the speech belongs to the music field according to the acoustic determination result and the semantic determination result comprises:

4. The method according to claim 3, wherein the obtaining a speech recognition result, an acoustic determination result, and a semantic determination result according to the parameter information and the feature vector, and determining whether the speech belongs to the music field according to the acoustic determination result and the semantic determination result further comprises:

5. The method of claim 1, further comprising:

if the current mode is a single awakening single recognition mode or a passenger mode, or the voice belongs to the first voice in a single awakening multi-recognition mode, acquiring a voice recognition result according to the parameter information and the feature vector, wherein the single awakening single recognition mode refers to an interactive mode in which a user needs to firstly speak a 'voice containing awakening words' to awaken a voice recognition device and then speak an interactive voice when interacting with the voice recognition device each time; the extremely guest mode is an interactive mode which can carry out continuous and repeated conversations with the voice recognition device in a short time without speaking the voice containing the awakening word each time when the user interacts with the voice recognition device each time;

6. The method according to claim 1, wherein after determining the instruction and/or resource corresponding to the speech according to the speech recognition result, further comprising:

7. A speech recognition apparatus, comprising:

a determining module, configured to determine whether the voice belongs to a non-primary voice in a single wake-up multi-recognition mode according to the current mode and the recognition serial number, where the single wake-up multi-recognition mode is an interaction mode in which a user can speak a voice for multiple times to a voice recognition apparatus after speaking a "voice including a wake-up word" to wake up the voice recognition apparatus each time when interacting with the voice recognition apparatus;

8. The apparatus of claim 7, wherein the means for determining is configured to,

9. The apparatus of claim 7, wherein the means for determining is configured to,

10. The apparatus of claim 9, wherein the determination module is further specifically configured to,

11. The apparatus of claim 7,

the obtaining module is further configured to obtain a voice recognition result according to the parameter information and the feature vector when the current mode is a single-awakening single recognition mode or a visitor mode, or the voice belongs to a first voice in a single-awakening multi-recognition mode, where the single-awakening single recognition mode is an interaction mode in which a user needs to speak a "voice including an awakening word" to awaken the voice recognition device and then speak an interactive voice each time the user interacts with the voice recognition device; the extremely guest mode is an interactive mode which can carry out continuous and repeated conversations with the voice recognition device in a short time without speaking the voice containing the awakening word each time when the user interacts with the voice recognition device each time;

12. The apparatus of claim 7, further comprising: and the execution module is used for executing the instruction and/or providing the resource for a user of the intelligent sound box.

13. A speech recognition apparatus, comprising:

memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech recognition method according to any of claims 1-6 when executing the program.

14. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the speech recognition method according to any one of claims 1 to 6.