CN117133274A - Voice recognition method, device, equipment and medium - Google Patents

Voice recognition method, device, equipment and medium Download PDF

Info

Publication number
CN117133274A
CN117133274A CN202210547781.0A CN202210547781A CN117133274A CN 117133274 A CN117133274 A CN 117133274A CN 202210547781 A CN202210547781 A CN 202210547781A CN 117133274 A CN117133274 A CN 117133274A
Authority
CN
China
Prior art keywords
voice recognition
network
voice
recognition
vertical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210547781.0A
Other languages
Chinese (zh)
Inventor
王智超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Rockwell Technology Co Ltd
Original Assignee
Beijing Rockwell Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Rockwell Technology Co Ltd filed Critical Beijing Rockwell Technology Co Ltd
Priority to CN202210547781.0A priority Critical patent/CN117133274A/en
Publication of CN117133274A publication Critical patent/CN117133274A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/01Assessment or evaluation of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding

Abstract

The present disclosure relates to a method, apparatus, device and medium for speech recognition, comprising: acquiring voice to be recognized collected by a terminal device and a vertical voice recognition network constructed by the terminal device; acquiring a universal voice recognition network; the method comprises the steps of respectively decoding voice to be recognized based on a general voice recognition network and a vertical voice recognition network, determining a voice recognition result, and ensuring the accuracy of the voice recognition result on the basis of protecting the privacy of a user.

Description

Voice recognition method, device, equipment and medium
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a speech recognition method, apparatus, device, and medium.
Background
With the rapid development of technologies such as mobile internet and artificial intelligence, human-computer interaction scenes are widely used in the daily life production process of people, and voice recognition is used as an important interface for human-computer interaction, so that the application of the voice recognition is more and more widespread.
In the prior art, the voice recognition method is difficult to balance between protecting the privacy of a user and ensuring the accuracy of voice recognition, for example, when the user information such as address book, song list, high-frequency instruction and the like is not used, the voice recognition ensures the privacy of the user but has lower recognition accuracy, and when the user information is used for voice recognition, the voice recognition process has higher recognition accuracy but has the risk of leakage of the user information.
Therefore, how to ensure accuracy of the speech recognition result on the basis of ensuring user privacy becomes a problem to be solved.
Disclosure of Invention
To solve or at least partially solve the above technical problems, the present disclosure provides a method, apparatus, device, and medium for speech recognition.
In a first aspect, an embodiment of the present disclosure provides a method for voice recognition, including:
acquiring voice to be recognized collected by terminal equipment and a vertical voice recognition network constructed by the terminal equipment, wherein the vertical voice recognition network comprises a pronunciation dictionary, an acoustic model and a vertical language model;
acquiring a universal voice recognition network, wherein the universal voice recognition network comprises a pronunciation dictionary, an acoustic model and a universal language model;
and respectively decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network to determine a voice recognition result.
Optionally, the decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively, to determine a voice recognition result, includes:
decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain voice recognition results and recognition scores respectively corresponding to the universal voice recognition network and the vertical voice recognition network;
And determining a voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network.
Optionally, the identification score includes a language score;
the decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a voice recognition result and a recognition score respectively corresponding to the universal voice recognition network and the vertical voice recognition network, comprising:
decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a first recognition result of the universal voice recognition network, a first language score corresponding to the first recognition result, a second recognition result of the vertical voice recognition network and a second language score corresponding to the second recognition result;
the determining the voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score of the recognition result of the vertical voice recognition network comprises the following steps:
and determining a voice recognition result according to the relation between the first language score corresponding to the first recognition result of the universal voice recognition network and the second language score corresponding to the second recognition result of the vertical voice recognition network.
Optionally, the identification score comprises an acoustic score;
the decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a voice recognition result and a recognition score respectively corresponding to the universal voice recognition network and the vertical voice recognition network, comprising:
decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a first recognition result of the universal voice recognition network, a first acoustic score corresponding to the first recognition result, a second recognition result of the vertical voice recognition network and a second acoustic score corresponding to the second recognition result;
the determining the voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score of the recognition result of the vertical voice recognition network comprises the following steps:
and determining a voice recognition result according to the relation between the first acoustic score corresponding to the first recognition result of the universal voice recognition network and the second acoustic score corresponding to the second recognition result of the vertical voice recognition network.
Optionally, the determining the voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network includes:
when the recognition score corresponding to the recognition result of the universal voice recognition network is larger than the recognition score corresponding to the recognition result of the vertical voice recognition network, determining the recognition result of the universal voice recognition network as the voice recognition result;
when the recognition score corresponding to the recognition result of the universal voice recognition network is smaller than the recognition score corresponding to the recognition result of the vertical voice recognition network, determining that the recognition result of the vertical voice recognition network is the voice recognition result;
and when the recognition score corresponding to the recognition result of the universal voice recognition network is equal to the recognition score corresponding to the recognition result of the vertical voice recognition network, determining the recognition result of the vertical voice recognition network or the recognition result of the universal voice recognition network as the voice recognition result.
Optionally, the vertical voice recognition network is a vertical voice recognition network based on a weighted finite state machine; the generic speech recognition network is a weighted finite state machine based generic speech recognition network.
Optionally, the method further comprises:
and sending the voice recognition result to the terminal equipment so that the terminal equipment can execute target control operation based on the voice recognition result.
In a second aspect, embodiments of the present disclosure provide a voice recognition apparatus, including:
the system comprises an acquisition module, a judgment module and a judgment module, wherein the acquisition module is used for acquiring voice to be recognized acquired by a terminal device and a vertical voice recognition network constructed by the terminal device, and the vertical voice recognition network comprises a pronunciation dictionary, an acoustic model and a vertical language model;
the system comprises a universal voice recognition network acquisition module, a universal voice recognition module and a universal language model acquisition module, wherein the universal voice recognition network is used for acquiring a universal voice recognition network, and the universal voice recognition network comprises a pronunciation dictionary, an acoustic model and a universal language model;
and the recognition module is used for respectively decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network to determine a voice recognition result.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
one or more processors;
storage means for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of the first aspects.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method according to any one of the first aspects.
Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:
according to the voice recognition method, the device, the equipment and the medium, the voice to be recognized collected by the terminal equipment and the vertical voice recognition network constructed by the terminal equipment are obtained, then the general voice recognition network is obtained, finally the voice to be recognized is decoded based on the general voice recognition network and the vertical voice recognition network, and the voice recognition result is determined, namely, after the general voice recognition network is obtained and the vertical voice recognition network constructed by the terminal equipment is received, the cloud server decodes the obtained voice to be recognized based on the vertical voice recognition network and the general voice recognition network respectively, and the voice recognition result is determined, wherein the vertical voice recognition network comprises a pronunciation dictionary, an acoustic model and a vertical language model (the vertical language model is constructed based on user information such as an address book, a song list and a high-frequency instruction).
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
In order to more clearly illustrate the embodiments of the present disclosure or the solutions in the prior art, the drawings that are required for the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
Fig. 1 is a schematic flow chart of a voice recognition method according to an embodiment of the disclosure;
FIG. 2 is a flow chart of another speech recognition method provided by an embodiment of the present disclosure;
FIG. 3 is a flow chart of yet another speech recognition method provided by an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a voice recognition device according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order that the above objects, features and advantages of the present disclosure may be more clearly understood, a further description of aspects of the present disclosure will be provided below. It should be noted that, without conflict, the embodiments of the present disclosure and features in the embodiments may be combined with each other.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the disclosure.
With the rapid development of technologies such as mobile internet and artificial intelligence, human-computer interaction scenes are widely used in the daily life production process of people, and voice recognition is used as an important interface for human-computer interaction, so that the application of the voice recognition is more and more widespread.
In general, it is difficult for a voice recognition method to balance between protecting user privacy and ensuring a voice recognition result, for example, when user information such as a user address book, a song list, a high-frequency instruction, etc. is not used, although the user privacy is ensured, the recognition accuracy of voice recognition is low, and when user information is used for voice recognition, although the recognition accuracy is high, there is a risk of leakage of user information in the voice recognition process.
In order to solve the above problems, embodiments of the present disclosure provide a method, apparatus, device, and medium for voice recognition.
The following first describes a speech recognition method provided in an embodiment of the present disclosure with reference to fig. 1 to 3.
In an embodiment of the disclosure, the voice recognition method is performed by a cloud server. The cloud server can be a server corresponding to a mobile phone, a tablet computer, a desktop computer, a notebook computer, a vehicle-mounted terminal, a wearable electronic device, an intelligent home device and the like.
Fig. 1 shows a flowchart of a voice control method according to an embodiment of the disclosure.
As shown in fig. 1, the voice control method may include the following steps.
S10, acquiring the voice to be recognized collected by the terminal equipment and the vertical voice recognition network constructed by the terminal equipment.
Wherein the vertical speech recognition network comprises a pronunciation dictionary, an acoustic model and a vertical language model.
The embodiment of the disclosure exemplifies a terminal device as a vehicle-mounted terminal device, and the vehicle-mounted terminal device comprises a voice acquisition module, when a user enters a cabin and sends out voice, the voice acquisition module acquires the voice to be recognized sent out by the user.
The vertical language model is constructed based on vertical keywords on the vehicle-mounted terminal device, wherein the vertical keywords generally refer to different keywords belonging to the same type, and in the embodiment of the disclosure, the different keywords belonging to user privacy are taken as the vertical keywords, such as a name, a place name, a song list, a high-frequency instruction and the like, namely, the different keywords belonging to user privacy.
On the vehicle-mounted terminal device, the service scene related to the vertical keywords refers to the service scene including the vertical keywords in the corresponding interactive voice, such as the service scene of voice dialing, voice navigation and the like, because the user must speak the name of the person making the call or the name of the place of navigation, for example, the user may say "call XX", "navigate to YY", wherein XX "may be any name in the address book of the mobile phone of the user, and YY" may be a certain place name in the region where the user is located. It can be seen that the voices under the service scenes contain the vertical keywords (such as the names of people and places), so the service scenes are the service scenes related to the vertical keywords.
With the comprehensive popularization of artificial intelligence and intelligent terminals, human-computer interaction scenes are more and more popular, and voice recognition is an important interface for human-computer interaction. For example, on an intelligent terminal, many factories have built-in voice assistants in a terminal operating system, so that a user can control the terminal through voice, for example, the user can make a call to a contact person in communication through voice, send a short message, or inquire about city weather through voice, or open or close a terminal application program through voice, etc. These interaction scenarios belong to specific business scenarios relative to common speech recognition business scenarios, where most of the speech is speech related to vertical keywords (such as address book name, place name, terminal application name).
Compared with common text keywords, the vertical keywords relate to certain user privacy, and when the voice recognition method in the prior art is based, the problem of user privacy disclosure is easy to occur.
Based on the problems existing in the prior art, in the voice recognition method provided by the embodiment of the disclosure, after the terminal device constructs the vertical language model and generates the vertical voice recognition network, the constructed vertical voice recognition network is sent to the vehicle-mounted cloud server, the voice to be recognized collected by the vehicle-mounted terminal device is also sent to the vehicle-mounted cloud server, the recognition of the voice to be recognized is performed at the vehicle-mounted cloud server, and the problem that the user privacy is revealed due to the fact that the plaintext information of the user is directly sent to the cloud is avoided.
Specifically, a vertical speech recognition network based on a weighted finite state machine is constructed based on a pronunciation dictionary, an acoustic model and a vertical language model.
For example, the acoustic features of the speech to be recognized are input into a pre-trained acoustic model, and relevant syllable information of the speech to be recognized is output. From the perspective of a weighted finite state machine, the acoustic model may be a state search space of acoustic features to related pronunciation information, where the state search space includes a plurality of search paths. Then, the pronunciation information is input into the pronunciation dictionary, and words corresponding to the pronunciation information are output. From the point of view of a weighted finite state machine, the pronunciation dictionary may be a state search space for pronunciation information to a corresponding word or word, where the state search space includes a plurality of search paths. Next, the words output in the pronunciation dictionary are input to the vertical class language model, resulting in associated probabilities of the input words. And generating a vertical voice recognition network through a state search space obtained by combining the acoustic model, the pronunciation dictionary and the vertical language model.
S20, acquiring a general voice recognition network.
Wherein the generic speech recognition network comprises a pronunciation dictionary, an acoustic model and a generic language model.
After the vehicle-mounted cloud server acquires the voice to be recognized acquired by the terminal equipment and the vertical voice recognition network constructed by the terminal equipment, the vehicle-mounted cloud server acquires the general voice recognition network, and when the general voice recognition network is used, the cloud server is pre-constructed based on a general language model, a pronunciation dictionary and an acoustic model and is stored in a storage module of the cloud server, and when the cloud server acquires the voice to be recognized acquired by the terminal equipment and the vertical voice recognition network constructed by the terminal equipment, the cloud server acquires the general voice recognition network from the storage module of the cloud server, wherein in the acquired general voice recognition network, the pronunciation dictionary contains a set of words which can be processed by the voice recognition network and marks pronunciation of the words, and acquires a mapping relation between a modeling unit of the acoustic model and a modeling unit of the general language model through the pronunciation dictionary, so that the acoustic model and the general language model are connected, and a searchable state space is formed together with the pronunciation dictionary for language decoding work.
As one implementation, the generic speech recognition network is a generic speech recognition network based on a weighted finite state machine built based on a pronunciation dictionary, an acoustic model, and a generic language model.
For example, the acoustic features of the speech to be recognized are input into a pre-trained acoustic model, and relevant syllable information of the speech to be recognized is output. From the perspective of a weighted finite state machine, the acoustic model may be a state search space of acoustic features to related pronunciation information, where the state search space includes a plurality of search paths. Then, the pronunciation information is input into the pronunciation dictionary, and words corresponding to the pronunciation information are output. From the point of view of a weighted finite state machine, the pronunciation dictionary may be a state search space for pronunciation information to a corresponding word or word, where the state search space includes a plurality of search paths. Next, the words output in the pronunciation dictionary are input to the generic language model, resulting in associated probabilities of the input words. And generating a general speech recognition network by combining the acoustic model, the pronunciation dictionary and the general language model to obtain a state search space.
S30, decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively, and determining a voice recognition result.
According to the application, after the general speech recognition network is acquired and the vertical speech recognition network constructed by the terminal equipment is acquired, the vehicle-mounted cloud server decodes the acquired semantics to be recognized, and determines the speech recognition result, wherein the process of decoding the speech to be recognized based on the general speech recognition network and the vertical speech recognition model is carried out on the vehicle-mounted cloud server, and the cloud server has stronger calculation power compared with the terminal equipment, so that the accuracy of speech recognition is ensured.
When the vehicle-mounted cloud server acquires the universal voice recognition network and the vertical voice recognition network constructed by the terminal equipment, the voice to be recognized acquired by the terminal equipment is respectively input into the universal voice recognition network and the vertical voice recognition network, the voice to be recognized is decoded based on the universal voice recognition network and the vertical voice recognition network, the voice recognition result is determined based on the decoding result of the universal voice recognition network and the decoding result of the vertical voice recognition network, and the voice recognition result is determined based on the decoding of the voice to be recognized by the universal voice recognition network and the vertical voice recognition network.
According to the voice recognition method provided by the embodiment of the disclosure, the cloud server firstly acquires the voice to be recognized collected by the terminal equipment and the vertical voice recognition network constructed by the terminal equipment, then constructs a general voice recognition network, finally decodes the voice to be recognized based on the general voice recognition network and the vertical voice recognition network, and determines a voice recognition result, namely, after the cloud server constructs the general voice recognition network and receives the vertical voice recognition network constructed by the terminal equipment, the cloud server decodes the acquired voice to be recognized based on the vertical voice recognition network and the general voice recognition network respectively, and determines the voice recognition result, wherein the vertical voice recognition network comprises a pronunciation dictionary, an acoustic model and a vertical language model (the vertical language model is constructed based on information of a user, such as an address book, a song list, a high-frequency instruction and the like).
Fig. 2 is a schematic flow chart of another voice recognition method provided by an embodiment of the present disclosure, where, based on the foregoing embodiment, as shown in fig. 2, a specific implementation manner of step S30 includes:
s31, decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively, and obtaining voice recognition results and recognition scores corresponding to the universal voice recognition network and the vertical voice recognition network respectively.
Specifically, the voice to be recognized is respectively input to the universal voice recognition network and the vertical voice recognition network, at this time, the universal voice recognition network and the vertical voice recognition network respectively decode the input voice to be recognized, and respectively output the voice recognition result and the recognition score corresponding to the decoded voice.
Optionally, as an implementation manner, the voice to be recognized is decoded based on the universal voice recognition network and the vertical voice recognition network respectively, so as to obtain a first recognition result of the universal voice recognition network and a first acoustic score corresponding to the first recognition result, and a second recognition result of the vertical voice recognition network and a second acoustic score corresponding to the second recognition result.
After the first recognition result and the second recognition result are obtained, according to the first acoustic score corresponding to the first recognition result and the second acoustic score of the second recognition result, the highest acoustic obtaining one is selected from the first recognition result and the second recognition result by comparing the magnitudes of the first acoustic score and the second acoustic score, and the highest acoustic obtaining one is used as a final voice recognition result.
The first acoustic score of the first recognition result and the second acoustic score of the second recognition result refer to scores of the whole decoding result determined according to the decoding scores of the elements of the acoustic state sequence when the acoustic state sequence of the voice to be recognized is decoded. For example, the decoding scores of the individual acoustic state sequence elements are summed, i.e., can be used as a score for the overall decoding result. The decoding score of an acoustic state sequence element refers to a probability score that the acoustic state sequence element (e.g., a phoneme or a phoneme unit) is decoded into a certain text, and thus the score of the entire decoding result is the probability score that the entire acoustic state sequence is decoded into a certain text.
Alternatively, as another implementation manner, the voice to be recognized is decoded based on the universal voice recognition network and the vertical voice recognition network respectively, so as to obtain a first recognition result of the universal voice recognition network and a first language score corresponding to the first recognition result, and a second recognition result of the vertical voice recognition network and a second language score corresponding to the second recognition result.
After the first recognition result and the second recognition result are obtained, according to the first language score corresponding to the first recognition result and the second language score corresponding to the second recognition result, the highest language is selected from the first recognition result and the second recognition result by comparing the magnitudes of the first language score and the second language score, and the highest language is used as the final voice recognition result.
The first language score of the first recognition result and the second language score of the second recognition result refer to the scores of the whole decoding results determined according to the decoding scores of the elements of the language state sequence when the language state sequence of the voice to be recognized is decoded. For example, the decoding scores of the respective language state sequence elements are summed, i.e., can be used as a score of the entire decoding result. The decoding score of a language state sequence element refers to a probability score that the language state sequence element (e.g., word or word) is decoded into a certain text, and thus the score of the entire decoding result is the probability score that the entire language state sequence is decoded into a certain text.
S32, determining a voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network.
Specifically, when the recognition score corresponding to the recognition result of the universal voice recognition network is larger than the recognition score corresponding to the recognition result of the vertical voice recognition network, determining that the recognition result of the universal voice recognition network is a voice recognition result;
when the recognition score corresponding to the recognition result of the universal voice recognition network is smaller than the recognition score corresponding to the recognition result of the vertical voice recognition network, determining that the recognition result of the vertical voice recognition network is a voice recognition result;
when the recognition score corresponding to the recognition result of the universal voice recognition network is equal to the recognition score corresponding to the recognition result of the vertical voice recognition network, determining the recognition result of the vertical voice recognition network or the recognition result of the universal voice recognition network as the voice recognition result.
For example, if the recognition result of the voice to be recognized by the general voice recognition network is A1, the recognition score is 8 points, the recognition result of the voice to be recognized by the vertical voice recognition network is A2, the recognition score is 7 points, at this time, the voice recognition result is determined according to the recognition score corresponding to the recognition result of the general voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network, that is, the recognition score corresponding to the recognition result of the general voice recognition network is greater than the recognition score corresponding to the recognition result of the vertical voice recognition network, and the recognition result of the general voice recognition network is selected as the voice recognition result, that is, the voice recognition result is A1. If the recognition result of the voice to be recognized by the universal voice recognition network is A1, the recognition score is 7, the recognition result of the voice to be recognized by the vertical voice recognition network is A2, the recognition score is 8, at this time, the voice recognition result is determined according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network, namely, the recognition score corresponding to the recognition result of the universal voice recognition network is smaller than the recognition score corresponding to the recognition result of the vertical voice recognition network, and the recognition result of the universal voice recognition network is selected as the voice recognition result, namely, the voice recognition result is A2. If the recognition result of the voice to be recognized by the universal voice recognition network is A1, the recognition score is 8 points, the recognition result of the voice to be recognized by the vertical voice recognition network is A2, the recognition score is 8 points, at this time, the voice recognition result is determined according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network, namely, the recognition score corresponding to the recognition result of the universal voice recognition network is equal to the recognition score corresponding to the recognition result of the vertical voice recognition network, the recognition result of the universal voice recognition network is selected as the voice recognition result, or the recognition result of the vertical voice recognition network is selected as the voice recognition result, namely, the voice recognition result is A1 or A2.
The corresponding recognition score may be an acoustic score, a language score, or a weighted value of the acoustic score and the language score, for example, the recognition result of the voice to be recognized bY the general-purpose voice recognition network is A1, the first acoustic score is X1, the first language score is Y1, the recognition result of the voice to be recognized bY the vertical-type voice recognition network is A2, the second acoustic score is X2, and the second language score is Y2, and the recognition score of the general-purpose voice recognition network is ax1+by1, and the recognition score of the vertical-type voice recognition network is ax2+by2.
Optionally, as an implementation manner, the voice recognition result is determined according to a relation between a first acoustic score corresponding to a first recognition result of the universal voice recognition network and a second acoustic score corresponding to a second recognition result of the vertical voice recognition network.
The acoustic score of a speech recognition result represents the score by which speech is recognized as the speech recognition result, which can be used to characterize the accuracy of the speech recognition result. Therefore, according to the first acoustic score of the first recognition result and the second acoustic score of the second recognition result, the accuracy of each recognition result can be represented, and the voice recognition result with the highest acoustic score can be selected from the recognition results through the acoustic score PK, namely by comparing the first acoustic score with the second acoustic score, and can be used as the final voice recognition result.
Alternatively, as another implementation manner, the voice recognition result is determined according to the relationship between the first language score corresponding to the first recognition result of the universal voice recognition network and the second language score corresponding to the second recognition result of the vertical voice recognition network.
The language score of a speech recognition result represents the score by which speech is recognized as the speech recognition result, which can be used to characterize the accuracy of the speech recognition result. Therefore, according to the first language score of the first recognition result and the second language score of the second recognition result, the accuracy of each recognition result can be represented, and the speech recognition result with the highest language score can be selected from the recognition results through the language score PK, namely by comparing the first language score with the second language score, and can be used as the final speech recognition result.
According to the voice recognition method provided by the embodiment of the disclosure, firstly, the voice to be recognized is respectively decoded based on the universal voice recognition network and the vertical voice recognition network, so that the voice recognition result and the recognition score corresponding to the universal voice recognition network and the vertical voice recognition network are obtained, then, the voice recognition result is determined according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network, namely, the recognition result corresponding to the voice recognition network with higher recognition score is selected as the voice recognition result, and the accuracy of the voice recognition result is further ensured.
Fig. 3 is a schematic flow chart of yet another voice recognition method according to an embodiment of the present disclosure, where, based on the foregoing embodiment, as shown in fig. 3, the voice recognition method further includes:
and S40, sending the voice recognition result to the terminal equipment so that the terminal equipment can execute target control operation based on the voice recognition result.
After the vehicle-mounted cloud server determines the voice recognition result, the vehicle-mounted cloud server sends the voice recognition result to the vehicle-mounted terminal equipment, so that the vehicle-mounted terminal equipment executes target control operation corresponding to the voice recognition result based on the voice recognition result.
Specifically, the vehicle-mounted terminal device queries a target control instruction matched with the voice recognition result in each control instruction of the control instruction set to determine the voice control intention of the user.
Fig. 4 is a schematic structural diagram of a voice recognition device according to an embodiment of the present disclosure, where the voice recognition device shown in fig. 4 includes:
an obtaining module 410, configured to obtain a voice to be recognized collected by a terminal device and a vertical voice recognition network constructed by the terminal device, where the vertical voice recognition network includes a pronunciation dictionary, an acoustic model and a vertical language model;
A universal speech recognition network acquisition module 420 for acquiring a universal speech recognition network, wherein the universal speech recognition network comprises a pronunciation dictionary, an acoustic model and a universal language model;
the recognition module 430 is configured to decode the speech to be recognized based on the generic speech recognition network and the vertical speech recognition network, respectively, and determine a speech recognition result.
According to the voice recognition device provided by the embodiment of the disclosure, the cloud server firstly acquires the voice to be recognized collected by the terminal equipment and the vertical voice recognition network constructed by the terminal equipment, then constructs a general voice recognition network, finally decodes the voice to be recognized based on the general voice recognition network and the vertical voice recognition network, and determines a voice recognition result, namely, after the cloud server constructs the general voice recognition network and receives the vertical voice recognition network constructed by the terminal equipment, the cloud server decodes the acquired voice to be recognized based on the vertical voice recognition network and the general voice recognition network respectively, and determines the voice recognition result, wherein the vertical voice recognition network comprises a pronunciation dictionary, an acoustic model and a vertical language model (the vertical language model is constructed based on information of a user, such as an address book, a song list, a high-frequency instruction and the like).
Optionally, the identification module includes:
the recognition result determining unit is used for respectively decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network to obtain voice recognition results and recognition scores respectively corresponding to the universal voice recognition network and the vertical voice recognition network;
and the recognition unit is used for determining the voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network.
Optionally, an embodiment of the identification result determining unit includes:
decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a first recognition result of the universal voice recognition network, a first language score corresponding to the first recognition result, a second recognition result of the vertical voice recognition network and a second language score corresponding to the second recognition result;
one embodiment of the identification unit comprises:
and determining a voice recognition result according to the relation between the first language score corresponding to the first recognition result of the universal voice recognition network and the second language score corresponding to the second recognition result of the vertical voice recognition network.
Alternatively, another embodiment of the identification result determining unit includes:
decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a first recognition result of the universal voice recognition network, a first acoustic score corresponding to the first recognition result, a second recognition result of the vertical voice recognition network and a second acoustic score corresponding to the second recognition result;
another embodiment of the identification unit comprises:
and determining a voice recognition result according to the relation between the first acoustic score corresponding to the first recognition result of the universal voice recognition network and the second acoustic score corresponding to the second recognition result of the vertical voice recognition network.
Optionally, a specific embodiment of the identification module includes:
when the recognition score corresponding to the recognition result of the universal voice recognition network is larger than the recognition score corresponding to the recognition result of the vertical voice recognition network, determining that the recognition result of the universal voice recognition network is a voice recognition result;
when the recognition score corresponding to the recognition result of the universal voice recognition network is smaller than the recognition score corresponding to the recognition result of the vertical voice recognition network, determining that the recognition result of the vertical voice recognition network is a voice recognition result;
When the recognition score corresponding to the recognition result of the universal voice recognition network is equal to the recognition score corresponding to the recognition result of the vertical voice recognition network, determining the recognition result of the vertical voice recognition network or the recognition result of the universal voice recognition network as the voice recognition result.
Optionally, one embodiment of the voice recognition network construction module includes:
the vertical voice recognition network is a vertical voice recognition network based on a weighted finite state machine; the generic speech recognition network is a weighted finite state machine based generic speech recognition network.
Optionally, the method further comprises:
and the sending module is used for sending the voice recognition result to the terminal equipment so that the terminal equipment can execute target control operation based on the voice recognition result.
The device provided by the embodiment of the invention can execute the method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the method.
It should be noted that, in the embodiment of the apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.
The present disclosure also provides an electronic device, including: and a processor for executing a computer program stored in a memory, which when executed by the processor implements the steps of the method embodiments described above.
Fig. 5 is a schematic structural diagram of an electronic device provided in the present disclosure, and fig. 5 shows a block diagram of an exemplary electronic device suitable for implementing the embodiment of the present invention. The electronic device shown in fig. 5 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the present invention.
As shown in fig. 5, the electronic device 500 is embodied in the form of a general purpose computing device. The components of electronic device 500 may include, but are not limited to: one or more processors 510, a system memory 520, and a bus 530 that connects the different system components (including the system memory 520 and the processors).
Bus 530 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Electronic device 500 typically includes many types of computer system readable media. Such media can be any medium that is accessible by electronic device 500 and includes both volatile and non-volatile media, removable and non-removable media.
The system memory 520 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 640 and/or cache memory 550. Electronic device 500 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 560 may be used to read from or write to non-removable, nonvolatile magnetic media (commonly referred to as a "hard disk drive"). Disk drives for reading from and writing to removable nonvolatile magnetic disks (e.g., a "floppy disk"), and optical disk drives for reading from and writing to removable nonvolatile optical disks (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 530 through one or more data media interfaces. The system memory 520 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
A program/utility 580 having a set (at least one) of program modules 570 may be stored in, for example, system memory 520, such program modules 570 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 570 generally perform the functions and/or methodologies of the embodiments described herein.
The processor 510 executes various functional applications and information processing, such as implementing method embodiments provided by embodiments of the present invention, by running at least one program of a plurality of programs stored in the system memory 520.
The present disclosure also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the above-described method embodiments.
Any combination of one or more computer readable media may be employed. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
The present disclosure also provides a computer program product which, when run on a computer, causes the computer to perform the steps of implementing the method embodiments described above.
It should be noted that in this document, relational terms such as "first" and "second" and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing is merely a specific embodiment of the disclosure to enable one skilled in the art to understand or practice the disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown and described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method of speech recognition, comprising:
acquiring voice to be recognized collected by terminal equipment and a vertical voice recognition network constructed by the terminal equipment, wherein the vertical voice recognition network comprises a pronunciation dictionary, an acoustic model and a vertical language model;
acquiring a universal voice recognition network, wherein the universal voice recognition network comprises a pronunciation dictionary, an acoustic model and a universal language model;
and respectively decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network to determine a voice recognition result.
2. The method of claim 1, wherein the decoding the speech to be recognized based on the generic speech recognition network and the vertical-class speech recognition network, respectively, to determine a speech recognition result comprises:
decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain voice recognition results and recognition scores respectively corresponding to the universal voice recognition network and the vertical voice recognition network;
and determining a voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score corresponding to the recognition result of the vertical voice recognition network.
3. The method of claim 2, wherein the identification score comprises a language score;
the decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a voice recognition result and a recognition score respectively corresponding to the universal voice recognition network and the vertical voice recognition network, comprising:
decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a first recognition result of the universal voice recognition network, a first language score corresponding to the first recognition result, a second recognition result of the vertical voice recognition network and a second language score corresponding to the second recognition result;
the determining the voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score of the recognition result of the vertical voice recognition network comprises the following steps:
and determining a voice recognition result according to the relation between the first language score corresponding to the first recognition result of the universal voice recognition network and the second language score corresponding to the second recognition result of the vertical voice recognition network.
4. The method of claim 2, wherein the identification score comprises an acoustic score;
the decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a voice recognition result and a recognition score respectively corresponding to the universal voice recognition network and the vertical voice recognition network, comprising:
decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network respectively to obtain a first recognition result of the universal voice recognition network, a first acoustic score corresponding to the first recognition result, a second recognition result of the vertical voice recognition network and a second acoustic score corresponding to the second recognition result;
the determining the voice recognition result according to the recognition score corresponding to the recognition result of the universal voice recognition network and the recognition score of the recognition result of the vertical voice recognition network comprises the following steps:
and determining a voice recognition result according to the relation between the first acoustic score corresponding to the first recognition result of the universal voice recognition network and the second acoustic score corresponding to the second recognition result of the vertical voice recognition network.
5. The method according to claim 2, wherein the determining the speech recognition result according to the recognition score corresponding to the recognition result of the generic speech recognition network and the recognition score corresponding to the recognition result of the vertical type speech recognition network comprises:
when the recognition score corresponding to the recognition result of the universal voice recognition network is larger than the recognition score corresponding to the recognition result of the vertical voice recognition network, determining the recognition result of the universal voice recognition network as the voice recognition result;
when the recognition score corresponding to the recognition result of the universal voice recognition network is smaller than the recognition score corresponding to the recognition result of the vertical voice recognition network, determining that the recognition result of the vertical voice recognition network is the voice recognition result;
and when the recognition score corresponding to the recognition result of the universal voice recognition network is equal to the recognition score corresponding to the recognition result of the vertical voice recognition network, determining the recognition result of the vertical voice recognition network or the recognition result of the universal voice recognition network as the voice recognition result.
6. The method of claim 1, wherein the vertical voice recognition network is a weighted finite state machine based vertical voice recognition network; the generic speech recognition network is a weighted finite state machine based generic speech recognition network.
7. The method according to claim 1, wherein the method further comprises:
and sending the voice recognition result to the terminal equipment so that the terminal equipment can execute target control operation based on the voice recognition result.
8. A speech recognition apparatus, comprising:
the system comprises an acquisition module, a judgment module and a judgment module, wherein the acquisition module is used for acquiring voice to be recognized acquired by a terminal device and a vertical voice recognition network constructed by the terminal device, wherein the vertical voice recognition network comprises a pronunciation dictionary, an acoustic model and a vertical voice model;
the system comprises a universal voice recognition network acquisition module, a universal voice recognition module and a universal language model acquisition module, wherein the universal voice recognition network is used for acquiring a universal voice recognition network, and the universal voice recognition network comprises a pronunciation dictionary, an acoustic model and a universal language model;
and the recognition module is used for respectively decoding the voice to be recognized based on the universal voice recognition network and the vertical voice recognition network to determine a voice recognition result.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs,
when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-7.
CN202210547781.0A 2022-05-18 2022-05-18 Voice recognition method, device, equipment and medium Pending CN117133274A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210547781.0A CN117133274A (en) 2022-05-18 2022-05-18 Voice recognition method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210547781.0A CN117133274A (en) 2022-05-18 2022-05-18 Voice recognition method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN117133274A true CN117133274A (en) 2023-11-28

Family

ID=88861482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210547781.0A Pending CN117133274A (en) 2022-05-18 2022-05-18 Voice recognition method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN117133274A (en)

Similar Documents

Publication Publication Date Title
US10192545B2 (en) Language modeling based on spoken and unspeakable corpuses
CN110069608B (en) Voice interaction method, device, equipment and computer storage medium
CN103077714B (en) Information identification method and apparatus
US9805718B2 (en) Clarifying natural language input using targeted questions
JP2020030408A (en) Method, apparatus, device and medium for identifying key phrase in audio
KR102046486B1 (en) Information inputting method
US20220092276A1 (en) Multimodal translation method, apparatus, electronic device and computer-readable storage medium
WO2016048582A1 (en) Systems and methods for providing non-lexical cues in synthesized speech
CN109979450B (en) Information processing method and device and electronic equipment
CN105469789A (en) Voice information processing method and voice information processing terminal
CN113486170B (en) Natural language processing method, device, equipment and medium based on man-machine interaction
JP2006053906A (en) Efficient multi-modal method for providing input to computing device
CN112669842A (en) Man-machine conversation control method, device, computer equipment and storage medium
CN110827822A (en) Intelligent voice interaction method and device, travel terminal, equipment and medium
CN112836521A (en) Question-answer matching method and device, computer equipment and storage medium
KR102536944B1 (en) Method and apparatus for speech signal processing
CN111508497B (en) Speech recognition method, device, electronic equipment and storage medium
CN111368504A (en) Voice data labeling method and device, electronic equipment and medium
CN112908315A (en) Question-answer intention judgment method based on voice characteristics and voice recognition
US10600405B2 (en) Speech signal processing method and speech signal processing apparatus
CN117133274A (en) Voice recognition method, device, equipment and medium
US20040073540A1 (en) Method and architecture for consolidated database search for input recognition systems
CN112037772A (en) Multi-mode-based response obligation detection method, system and device
CN111489742A (en) Acoustic model training method, voice recognition method, device and electronic equipment
CN111508481A (en) Training method and device of voice awakening model, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination