CN114299945A - Voice signal recognition method and device, electronic equipment, storage medium and product - Google Patents

Voice signal recognition method and device, electronic equipment, storage medium and product Download PDF

Info

Publication number
CN114299945A
CN114299945A CN202111539867.0A CN202111539867A CN114299945A CN 114299945 A CN114299945 A CN 114299945A CN 202111539867 A CN202111539867 A CN 202111539867A CN 114299945 A CN114299945 A CN 114299945A
Authority
CN
China
Prior art keywords
decoding
path
determining
speech
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111539867.0A
Other languages
Chinese (zh)
Inventor
李良斌
李志勇
陈孝良
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing SoundAI Technology Co Ltd
Original Assignee
Beijing SoundAI Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing SoundAI Technology Co Ltd filed Critical Beijing SoundAI Technology Co Ltd
Priority to CN202111539867.0A priority Critical patent/CN114299945A/en
Publication of CN114299945A publication Critical patent/CN114299945A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Telephone Function (AREA)

Abstract

The application provides a voice signal identification method and device, electronic equipment, a storage medium and a product, and belongs to the technical field of voice interaction. The method comprises the following steps: receiving a target voice signal, and determining a plurality of voice frames included in the target voice signal; determining a first decoding parameter of a first path of the plurality of speech frames in a first decoding graph and determining a second decoding parameter of a second path of the plurality of speech frames in a second decoding graph; under the condition that the difference value between the first decoding parameter and the second decoding parameter is not larger than a preset difference value, determining a plurality of first nodes included in the first path and the decoding parameter of each first node; and determining the identification result of the target voice signal based on the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node, wherein the identification result is used for indicating whether the electronic equipment is awakened or not. Because the similarity relation between the voice signal and the wake-up signal is considered, and the decoding path information of the voice signal is also considered, the accuracy of the recognition result is improved.

Description

Voice signal recognition method and device, electronic equipment, storage medium and product
Technical Field
The present application relates to the field of voice interaction technologies, and in particular, to a method and an apparatus for recognizing a voice signal, an electronic device, a storage medium, and a product.
Background
At present, the voice wake-up function is widely applied to the voice interaction technology. Before a user performs voice interaction with the electronic device, the electronic device needs to be awakened through an awakening word.
In the related art, the electronic equipment identifies the received voice signal and determines the similarity between an identification result and a wakeup word; if the similarity is larger than the preset threshold, the electronic equipment is awakened, and if the similarity is not larger than the preset threshold, the electronic equipment is not awakened.
However, in the related art, in order to meet the requirements of different users, the preset threshold is generally set to a moderate fixed value, but there may be a word with a similar pronunciation to the awakening word (for example, a word "rabbit" with a similar pronunciation to the awakening word "small degree"), and the similarity between the recognition result and the awakening word is greater than the preset threshold, so that the electronic device is awakened by mistake, and therefore the false awakening rate of the method is high.
Disclosure of Invention
The embodiment of the application provides a voice signal identification method, a voice signal identification device, electronic equipment, a storage medium and a product, which can reduce the false awakening rate of voice awakening. The technical scheme is as follows:
in one aspect, a method for recognizing a speech signal is provided, the method including:
receiving a target voice signal, and determining a plurality of voice frames included in the target voice signal;
determining a first decoding parameter of a first path of the plurality of speech frames in a first decoding graph, and determining a second decoding parameter of a second path of the plurality of speech frames in a second decoding graph, wherein the first decoding graph comprises decoding paths corresponding to a plurality of basic speech signals, and the second decoding graph comprises decoding paths corresponding to a plurality of wake-up speech signals;
determining a plurality of first nodes included in the first path and a decoding parameter of each first node under the condition that a difference value between the first decoding parameter and the second decoding parameter is not larger than a preset difference value;
determining a recognition result of the target voice signal based on the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node, wherein the recognition result is used for indicating whether to wake up the electronic equipment.
In one possible implementation, the determining the recognition result of the target speech signal based on the first decoding parameters, the plurality of first nodes, and the decoding parameters of each first node includes:
and inputting the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node into a voice recognition model to obtain a recognition result of the target voice signal, wherein the voice recognition model is used for obtaining the recognition result based on the decoding parameters of the path, the plurality of nodes included by the path and the decoding parameters of each node.
In another possible implementation, the process of training the speech recognition model includes:
acquiring a sample voice signal, wherein the sample voice signal comprises a first voice signal and a second voice signal, the first voice signal is a voice signal corresponding to a wakeup word, and the second voice signal is a voice signal corresponding to a non-wakeup word;
training an initial recognition model based on the first voice signal and the second voice signal until the accuracy of the initial recognition model reaches a preset threshold value, and obtaining the voice recognition model.
In another possible implementation manner, the training an initial recognition model based on the first speech signal and the second speech signal includes:
determining a third path of a plurality of speech frames included in the first speech signal in the first decoding graph and a fourth path of a plurality of speech frames included in the second speech signal in the first decoding graph;
determining first path information and second path information, wherein the first path information comprises decoding parameters of a third path, a plurality of third nodes included in the third path and decoding parameters of each third node, and the second path information comprises decoding parameters of a fourth path, a plurality of fourth nodes included in the fourth path and decoding parameters of each fourth node;
and training an initial recognition model based on the first path information and the second path information.
In another possible implementation manner, the obtaining a sample speech signal includes:
receiving a voice signal corresponding to a wakeup word and a voice signal corresponding to a non-wakeup word;
and carrying out noise adding processing on the voice signals corresponding to the awakening words to obtain first voice signals, and carrying out noise adding processing on the voice signals corresponding to the non-awakening words to obtain second voice signals.
In another possible implementation manner, the determining a first decoding parameter of a first path of the plurality of speech frames in a first decoding graph includes:
determining decoding parameters of a plurality of decoding paths of the plurality of speech frames in the first decoding graph;
and determining the decoding parameter with the largest value as the first decoding parameter of the first path from the decoding parameters of the plurality of decoding paths.
In another possible implementation manner, the determining decoding parameters of a plurality of decoding paths of the plurality of speech frames in the first decoding graph includes:
for each decoding path in the first decoding graph, determining a basic voice signal corresponding to the decoding path; determining a first language decoding parameter and a first acoustic decoding parameter of the plurality of speech frames under the decoding path, wherein the first language decoding parameter is used for representing the matching probability between the plurality of speech frames and a word sequence corresponding to the basic speech signal, the first acoustic decoding parameter is used for representing the matching probability between the plurality of speech frames and a first phoneme sequence, and the first phoneme sequence is obtained based on the word sequence decomposition;
and determining the product of the first language decoding parameter and the first acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
In another possible implementation manner, the determining a second decoding parameter of a second path of the plurality of speech frames in a second decoding graph includes:
determining decoding parameters of a plurality of decoding paths of the plurality of speech frames in the second decoding graph;
and determining the decoding parameter with the largest value as the second decoding parameter of the second path from the decoding parameters of the plurality of decoding paths.
In another possible implementation manner, the determining decoding parameters of a plurality of decoding paths of the plurality of speech frames in the second decoding graph includes:
for each decoding path in the second decoding graph, determining a wake-up voice signal corresponding to the decoding path; determining a second language decoding parameter and a second acoustic decoding parameter of the plurality of voice frames under the decoding path, wherein the second language decoding parameter is used for representing the matching probability between the plurality of voice frames and a wake-up word sequence corresponding to a wake-up voice signal, the second acoustic decoding parameter is used for representing the matching probability between the plurality of voice frames and a second phoneme sequence, and the second phoneme sequence is obtained by decomposing based on the wake-up word sequence;
and determining the product of the second language decoding parameter and the second acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
In another possible implementation manner, each voice frame includes a voice signal of a first preset duration;
the determining a plurality of speech frames included in the target speech signal comprises:
and dividing the target voice signal according to a preset period to obtain a plurality of voice frames included by the target voice signal.
In another possible implementation manner, the determining the plurality of first nodes included in the first path and the decoding parameter of each first node includes:
determining a jump sequence of a plurality of first nodes included in the first path;
determining a voice frame corresponding to each first node according to the skipping sequence;
and determining a probability value that the phoneme corresponding to each first node is consistent with the phoneme corresponding to the voice frame, and taking the probability value as a decoding parameter of each first node.
In another aspect, an apparatus for recognizing a speech signal is provided, the apparatus including:
the receiving module is used for receiving a target voice signal and determining a plurality of voice frames included in the target voice signal;
a first determining module, configured to determine a first decoding parameter of a first path of the multiple speech frames in a first decoding graph, and determine a second decoding parameter of a second path of the multiple speech frames in a second decoding graph, where the first decoding graph includes decoding paths corresponding to multiple base speech signals, and the second decoding graph includes decoding paths corresponding to multiple wake-up speech signals;
a second determining module, configured to determine, when a difference between the first decoding parameter and the second decoding parameter is not greater than a preset difference, a plurality of first nodes included in the first path and a decoding parameter of each first node;
a third determining module, configured to determine, based on the first decoding parameter, the plurality of first nodes, and the decoding parameter of each first node, a recognition result of the target speech signal, where the recognition result is used to indicate whether to wake up the electronic device.
In a possible implementation manner, the third determining module is configured to input the first decoding parameter, the plurality of first nodes, and the decoding parameter of each first node into a speech recognition model to obtain a recognition result of the target speech signal, where the speech recognition model is configured to obtain the recognition result based on the decoding parameters of the path, the plurality of nodes included in the path, and the decoding parameter of each node.
In another possible implementation manner, the apparatus further includes a training module, where the training module includes:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a sample voice signal, the sample voice signal comprises a first voice signal and a second voice signal, the first voice signal is a voice signal corresponding to an awakening word, and the second voice signal is a voice signal corresponding to a non-awakening word;
and the training unit is used for training an initial recognition model based on the first voice signal and the second voice signal until the accuracy of the initial recognition model reaches a preset threshold value, so as to obtain the voice recognition model.
In another possible implementation manner, the training unit is configured to determine a third path of a plurality of speech frames included in the first speech signal in the first decoding diagram, and a fourth path of a plurality of speech frames included in the second speech signal in the first decoding diagram; determining first path information and second path information, wherein the first path information comprises decoding parameters of a third path, a plurality of third nodes included in the third path and decoding parameters of each third node, and the second path information comprises decoding parameters of a fourth path, a plurality of fourth nodes included in the fourth path and decoding parameters of each fourth node; and training an initial recognition model based on the first path information and the second path information.
In another possible implementation manner, the obtaining unit is configured to receive a voice signal corresponding to a wakeup word and a voice signal corresponding to a non-wakeup word; and carrying out noise adding processing on the voice signals corresponding to the awakening words to obtain first voice signals, and carrying out noise adding processing on the voice signals corresponding to the non-awakening words to obtain second voice signals.
In another possible implementation manner, the first determining module is configured to determine decoding parameters of a plurality of decoding paths of the plurality of speech frames in the first decoding graph; and determining the decoding parameter with the largest value as the first decoding parameter of the first path from the decoding parameters of the plurality of decoding paths.
In another possible implementation manner, the first determining module is configured to determine, for each decoding path in the first decoding graph, a base speech signal corresponding to the decoding path; determining a first language decoding parameter and a first acoustic decoding parameter of the plurality of speech frames under the decoding path, wherein the first language decoding parameter is used for representing the matching probability between the plurality of speech frames and a word sequence corresponding to the basic speech signal, the first acoustic decoding parameter is used for representing the matching probability between the plurality of speech frames and a first phoneme sequence, and the first phoneme sequence is obtained based on the word sequence decomposition; and determining the product of the first language decoding parameter and the first acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
In another possible implementation manner, the first determining module is further configured to determine decoding parameters of a plurality of decoding paths of the plurality of speech frames in the second decoding graph; and determining the decoding parameter with the largest value as the second decoding parameter of the second path from the decoding parameters of the plurality of decoding paths.
In another possible implementation manner, the first determining module is further configured to determine, for each decoding path in the second decoding graph, a wake-up speech signal corresponding to the decoding path; determining a second language decoding parameter and a second acoustic decoding parameter of the plurality of voice frames under the decoding path, wherein the second language decoding parameter is used for representing the matching probability between the plurality of voice frames and a wakeup word sequence corresponding to the wakeup voice signal, the second acoustic decoding parameter is used for representing the matching probability between the plurality of voice frames and a second phoneme sequence, and the second phoneme sequence is obtained by decomposing based on the wakeup word sequence; and determining the product of the second language decoding parameter and the second acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
In another possible implementation manner, each voice frame includes a voice signal of a first preset duration;
the receiving module is used for dividing the target voice signal according to a preset period to obtain a plurality of voice frames included by the target voice signal.
In another possible implementation manner, the second determining module is configured to determine a jump sequence of a plurality of first nodes included in the first path; determining a voice frame corresponding to each first node according to the skipping sequence; and determining a probability value that the phoneme corresponding to each first node is consistent with the phoneme corresponding to the voice frame, and taking the probability value as a decoding parameter of each first node.
In another aspect, an electronic device is provided, which includes one or more processors and one or more memories, where at least one program code is stored in the one or more memories, and the at least one program code is loaded by the one or more processors and executed to implement the method for recognizing a voice signal according to any of the above implementations.
In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, the at least one program code being loaded and executed by a processor to implement the method for recognizing a speech signal according to any of the above-mentioned implementations.
In another aspect, a computer program product is provided, which comprises at least one program code, which is loaded and executed by a processor, to implement the method for recognizing a speech signal according to any of the above-mentioned implementations.
The technical scheme provided by the embodiment of the application has the beneficial effects that at least:
the embodiment of the application provides a voice signal identification method, because the similarity relation between a voice signal and an awakening signal is considered through the difference value of a first decoding parameter and a second decoding parameter, the decoding path information corresponding to the voice signal is considered through a plurality of parameters such as the decoding parameter corresponding to a first path, a plurality of first nodes on the first path, the decoding parameter of each first node and the like, so that the similarity relation between the voice signal and the awakening signal is considered, the decoding path information corresponding to the voice signal is also considered, the accuracy of an identification result is improved, and the false awakening rate is reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;
fig. 2 is a flowchart of a speech signal recognition method according to an embodiment of the present application;
fig. 3 is a flowchart of a speech signal recognition method according to an embodiment of the present application;
fig. 4 is a block diagram of a speech signal recognition apparatus according to an embodiment of the present application;
fig. 5 is a block diagram of a speech signal recognition apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "comprising" and "having," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes an electronic device 101 and a server 102. A client that the server 102 provides services is installed on the electronic device 101, and a user corresponding to the electronic device 101 may implement functions such as data transmission and voice interaction with the server 102 through the client. The client at least has a function of recognizing the voice signal, that is, whether the voice signal wakes up the electronic device 101 or not, and the client may also have a function of voice control, and the like. Wherein the client may be a voice assistant or a voice control application, etc.
In a possible implementation manner, the electronic device 101 identifies a voice signal, and after the voice signal is identified to wake up the electronic device 101, the electronic device 101 is woken up, and then the voice signal is collected again, a control instruction corresponding to the collected voice signal is identified, and an operation corresponding to the control instruction is executed. The electronic device 101 recognizes the control instruction corresponding to the voice signal collected again, or sends the voice signal collected again to the server 102, and the server 102 recognizes the control instruction corresponding to the voice signal collected again and returns the control instruction to the electronic device 101.
The electronic device 101 may be a computer, a mobile phone, a stereo, an air conditioner, a television, or other electronic devices. The server 102 may be a server, a server cluster composed of several servers, or a cloud computing service center.
Fig. 2 is a flowchart of a speech signal recognition method provided in an embodiment of the present application, and referring to fig. 2, the method includes:
201. receiving a target speech signal, and determining a plurality of speech frames included in the target speech signal.
202. Determining a first decoding parameter of a first path of the plurality of speech frames in a first decoding graph, and determining a second decoding parameter of a second path of the plurality of speech frames in a second decoding graph, wherein the first decoding graph comprises decoding paths corresponding to the plurality of basic speech signals, and the second decoding graph comprises decoding paths corresponding to the plurality of wake-up speech signals.
203. And under the condition that the difference value between the first decoding parameter and the second decoding parameter is not larger than the preset difference value, determining a plurality of first nodes included in the first path and the decoding parameter of each first node.
204. And determining the identification result of the target voice signal based on the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node, wherein the identification result is used for indicating whether the electronic equipment is awakened or not.
In one possible implementation, determining the recognition result of the target speech signal based on the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node includes:
and inputting the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node into a voice recognition model to obtain a recognition result of the target voice signal, wherein the voice recognition model is used for obtaining the recognition result based on the decoding parameters of the path, the plurality of nodes included by the path and the decoding parameters of each node.
In another possible implementation, a process of training a speech recognition model includes:
acquiring a sample voice signal, wherein the sample voice signal comprises a first voice signal and a second voice signal, the first voice signal is a voice signal corresponding to an awakening word, and the second voice signal is a voice signal corresponding to a non-awakening word;
training the initial recognition model based on the first voice signal and the second voice signal until the accuracy of the initial recognition model reaches a preset threshold value, and obtaining the voice recognition model.
In another possible implementation manner, training the initial recognition model based on the first speech signal and the second speech signal includes:
determining a third path of a plurality of speech frames included in the first speech signal in the first decoding graph and a fourth path of a plurality of speech frames included in the second speech signal in the first decoding graph;
determining first path information and second path information, wherein the first path information comprises decoding parameters of a third path, a plurality of third nodes included in the third path and decoding parameters of each third node, and the second path information comprises decoding parameters of a fourth path, a plurality of fourth nodes included in the fourth path and decoding parameters of each fourth node;
and training the initial recognition model based on the first path information and the second path information.
In another possible implementation, obtaining a sample speech signal includes:
receiving a voice signal corresponding to a wakeup word and a voice signal corresponding to a non-wakeup word;
and carrying out noise adding processing on the voice signals corresponding to the awakening words to obtain first voice signals, and carrying out noise adding processing on the voice signals corresponding to the non-awakening words to obtain second voice signals.
In another possible implementation manner, determining a first decoding parameter of a first path of the plurality of speech frames in a first decoding graph includes:
determining decoding parameters of a plurality of decoding paths of a plurality of speech frames in a first decoding graph;
and determining the decoding parameter with the largest value as the first decoding parameter of the first path from the decoding parameters of the plurality of decoding paths.
In another possible implementation manner, determining decoding parameters of a plurality of decoding paths of a plurality of speech frames in a first decoding graph includes:
for each decoding path in the first decoding graph, determining a basic voice signal corresponding to the decoding path; determining a first language decoding parameter and a first acoustic decoding parameter of the plurality of speech frames under a decoding path, wherein the first language decoding parameter is used for representing the matching probability between the plurality of speech frames and a word sequence corresponding to a basic speech signal, the first acoustic decoding parameter is used for representing the matching probability between the plurality of speech frames and a first phoneme sequence, and the first phoneme sequence is obtained based on word sequence decomposition;
and determining the product of the first language decoding parameter and the first acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
In another possible implementation manner, determining a second decoding parameter of a second path of the plurality of speech frames in a second decoding graph includes:
determining decoding parameters of a plurality of decoding paths of the plurality of speech frames in a second decoding graph;
and determining the decoding parameter with the largest value as the second decoding parameter of the second path from the decoding parameters of the plurality of decoding paths.
In another possible implementation, one wake-up speech signal corresponds to one wake-up word sequence;
determining decoding parameters for a plurality of decoding paths of the plurality of speech frames in a second decoding graph, comprising:
for each decoding path in the second decoding graph, determining a wake-up voice signal corresponding to the decoding path;
determining a second language decoding parameter and a second acoustic decoding parameter of the plurality of voice frames under the decoding path, wherein the second language decoding parameter is used for representing the matching probability between the plurality of voice frames and an awakening word sequence corresponding to the awakening voice signal, the second acoustic decoding parameter is used for representing the matching probability between the plurality of voice frames and a second phoneme sequence, and the second phoneme sequence is obtained by decomposing based on the awakening word sequence;
and determining the product of the second language decoding parameter and the second acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
In another possible implementation manner, each voice frame includes a voice signal of a first preset duration;
determining a plurality of speech frames comprised by the target speech signal, comprising:
and dividing the target voice signal according to a preset period to obtain a plurality of voice frames included by the target voice signal.
In another possible implementation manner, determining the plurality of first nodes included in the first path and the decoding parameter of each first node includes:
determining a jump sequence of a plurality of first nodes included in a first path;
determining a voice frame corresponding to each first node according to the skipping sequence;
and determining the probability value of the phoneme corresponding to each first node being consistent with the phoneme corresponding to the speech frame, and taking the probability value as the decoding parameter of each first node.
The embodiment of the application provides a voice signal identification method, because the similarity relation between a voice signal and an awakening signal is considered through the difference value of a first decoding parameter and a second decoding parameter, the decoding path information corresponding to the voice signal is considered through a plurality of parameters such as the decoding parameter corresponding to a first path, a plurality of first nodes on the first path, the decoding parameter of each first node and the like, so that the similarity relation between the voice signal and the awakening signal is considered, the decoding path information corresponding to the voice signal is also considered, the accuracy of an identification result is improved, and the false awakening rate is reduced.
Fig. 3 is a flowchart of a speech signal recognition method provided in an embodiment of the present application, which is executed by an electronic device, and referring to fig. 3, the method includes:
301. the electronic device receives a target speech signal and determines a plurality of speech frames included in the target speech signal.
The electronic equipment comprises a dormant state and a wake-up state, when the electronic equipment is in the dormant state, the electronic equipment is waken up through a voice signal, and the electronic equipment is switched to the wake-up state from the dormant state. In one possible implementation, the target speech signal is any speech signal received by the electronic device in a sleep state. Optionally, the voice signal is a voice signal corresponding to a wakeup word sent by the user.
In one possible implementation, each speech frame includes a speech signal of a first preset duration. Correspondingly, the step of the electronic device determining the plurality of speech frames included in the target speech signal is as follows: the electronic equipment divides the target voice signal according to a preset period to obtain a plurality of voice frames included by the target voice signal. Optionally, the preset period is a first preset time, and the electronic device is divided once every interval of the first preset time. In the embodiment of the present application, the value of the first preset duration is not specifically limited, and may be set and modified as needed. Optionally, the first preset time period is any value between 0.01s and 0.1s, for example: the first preset time period is 0.01s, 0.05s, 0.1s, etc.
In one possible implementation manner, the target speech signal is a speech signal with a signal duration greater than a second preset duration. Correspondingly, the step of receiving the target voice signal by the electronic device is as follows: the electronic equipment receives the voice signal, determines the signal duration of the voice signal, and determines the voice signal as a target voice signal if the signal duration is greater than a second preset duration. In the embodiment of the present application, the value of the second preset duration is not specifically limited, and may be set and modified as needed. Optionally, the second preset time period is any value between 0.5s and 5s, for example: the second preset time period is 0.5s, 1s, 1.5s, etc.
In the embodiment of the application, the voice signal is determined to be the target voice signal only when the signal duration of the voice signal meets the preset duration, so that invalid voice signals with too short signal duration can be screened out, the effectiveness of the target voice signal is improved, and the accuracy of the identification method is further improved.
302. The electronic device determines a first decoding parameter of a first path of the plurality of speech frames in a first decoding graph, wherein the first decoding graph comprises decoding paths corresponding to the plurality of basic speech signals.
In one possible implementation, the first decoding diagram is a basic decoding diagram in a WFST (Weighted State transmitters) decoding diagram, where the basic decoding diagram includes decoding paths corresponding to a plurality of basic voice signals, and the plurality of basic voice signals include a voice signal corresponding to a wakeup word and a voice signal corresponding to a non-wakeup word. And when the voice signal is decoded through the first decoding graph, determining the optimal path with the highest path score in the basic decoding graph as the decoding path corresponding to the voice signal.
In one possible implementation manner, when the target speech signal is decoded by the first decoding map, the value of the first decoding parameter of the first path in the first decoding map is maximum; the first decoding parameter is a path score of the first path, that is, the first path is an optimal path having a highest score in the first decoded picture of the target speech signal. Correspondingly, the step of the electronic device determining the first decoding parameter of the first path of the plurality of speech frames in the first decoding graph is: the electronic device determines decoding parameters of a plurality of decoding paths of a plurality of speech frames in a first decoding graph; and determining the decoding parameter with the largest value as the first decoding parameter of the first path from the decoding parameters of the plurality of decoding paths.
In one possible implementation manner, the electronic device determines the decoding parameters of the plurality of speech frames under each decoding path of the first decoding graph through the acoustic decoding parameters and the language decoding parameters. Correspondingly, the step of determining, by the electronic device, the decoding parameters of the multiple decoding paths of the multiple speech frames in the first decoding graph is: the electronic equipment determines a basic voice signal corresponding to each decoding path in the first decoding diagram; determining a first language decoding parameter and a first acoustic decoding parameter of the plurality of speech frames under the decoding path, wherein the first language decoding parameter is used for representing the matching probability between the plurality of speech frames and a word sequence corresponding to a basic speech signal, the first acoustic decoding parameter is used for representing the matching probability between the plurality of speech frames and a first phoneme sequence, and the first phoneme sequence is obtained by decomposing the word sequence corresponding to the basic speech signal; and determining the product of the first language decoding parameter and the first acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
Optionally, in the first decoding graph, the decoding parameters of the multiple speech frames under the decoding path are used to represent path scores for decoding the multiple speech frames through the decoding path. In a possible implementation manner, the electronic device determines, through a linguistic model in a chain model, a matching probability between a plurality of speech frames and a word sequence corresponding to a base speech signal, where the matching probability is determined as a first language decoding parameter of the plurality of speech frames under the decoding path. The electronic equipment determines the matching probability of the plurality of speech frames and the first phoneme sequence through an acoustic model in the chain model, and the determining of the matching probability is a first acoustic decoding parameter of the plurality of speech frames under the decoding path.
In the embodiment of the application, the electronic device determines the decoding parameters of the decoding path through the linguistic model and the acoustic model in the chain model, so that the linguistic decoding parameters and the acoustic decoding parameters are comprehensively referred to, and the accuracy of the determined decoding parameters is improved.
303. The electronic device determines a second decoding parameter of a second path of the plurality of speech frames in a second decoding graph, where the second decoding graph includes decoding paths corresponding to the plurality of wake-up speech signals.
In one possible implementation, the WFST decoding map can decode the wake-up voice signal to obtain a decoding path corresponding to the wake-up voice signal. The second decoding graph comprises a plurality of decoding paths corresponding to the awakening voice signals. The wake-up voice signal may be a voice signal corresponding to a wake-up word stored in the electronic device. The wake-up word stored within the electronic device may be any wake-up word. For example: and if the awakening word stored in the electronic equipment is 'hello', the awakening voice signal is a voice signal corresponding to the awakening word 'hello'.
In one possible implementation, when the target speech signal is decoded in the second decoded picture, the value of the second decoding parameter of the second path in the second decoded picture is maximum; the second decoding parameter is a path score of the second path, that is, the second path is an optimal path having a highest score in the second decoded picture of the target speech signal. Correspondingly, the step of the electronic device determining the second decoding parameter of the second path of the plurality of speech frames in the second decoding graph is: the electronic device determines decoding parameters of a plurality of decoding paths of the plurality of speech frames in a second decoding graph; and determining the decoding parameter with the largest value as the second decoding parameter of the second path from the decoding parameters of the plurality of decoding paths.
In one possible implementation manner, the electronic device determines the decoding parameters of the plurality of speech frames under each decoding path of the second decoding graph through the acoustic decoding parameters and the language decoding parameters. Correspondingly, the step of determining, by the electronic device, the decoding parameters of the multiple decoding paths of the multiple speech frames in the second decoding graph is: the electronic equipment determines a wake-up voice signal corresponding to each decoding path in the second decoding graph; determining a second language decoding parameter and a second acoustic decoding parameter of the plurality of voice frames under the decoding path, wherein the second language decoding parameter is used for representing the matching probability between the plurality of voice frames and an awakening word sequence corresponding to the awakening voice signal, the second acoustic decoding parameter is used for representing the matching probability between the plurality of voice frames and a second phoneme sequence, and the second phoneme sequence is obtained by decomposing the awakening word sequence; and determining the product of the second language decoding parameter and the second acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
Optionally, in the second decoding diagram, the decoding parameters of the multiple speech frames under the decoding path are used to represent path scores for decoding the multiple speech frames through the decoding path. In a possible implementation manner, the electronic device determines, through a linguistic model in a chain model, a matching probability between a plurality of speech frames and a word sequence corresponding to the wake-up speech signal, and determines that the matching probability is a second language decoding parameter of the plurality of speech frames under the decoding path. The electronic equipment determines the matching probability of the plurality of speech frames and the second phoneme sequence through an acoustic model in the chain model, and the matching probability is determined to be a second acoustic decoding parameter of the plurality of speech frames under the decoding path.
In the embodiment of the application, the electronic device determines the decoding parameters of the decoding path through the linguistic model and the acoustic model in the chain model, so that the linguistic decoding parameters and the acoustic decoding parameters are comprehensively referred to, and the accuracy of the determined decoding parameters is improved.
It should be noted that, there is no necessary order between step 302 and step 303, and the electronic device may execute step 302 first and then execute step 303; step 303 may be performed first, then step 302 may be performed, or step 302 and step 303 may be performed simultaneously.
304. The electronic equipment determines a plurality of first nodes included in the first path and the decoding parameter of each first node under the condition that the difference value between the first decoding parameter and the second decoding parameter is not larger than the preset difference value.
In a possible implementation manner, if the target speech signal is a speech signal corresponding to the awakening word, decoding the target speech signal through the first decoding graph and the second decoding graph, and obtaining a first decoding parameter and a second decoding parameter which have a closer difference; and if the target voice signal is the voice signal corresponding to the non-awakening word, decoding the target voice signal through the first decoding image and the second decoding image, wherein the obtained first decoding parameter has a larger difference with the second decoding parameter. Before the electronic device determines the plurality of first nodes included in the first path and the decoding parameter of each first node, it needs to determine whether a difference between the first decoding parameter and the second decoding parameter is greater than a preset difference. The electronic equipment determines a plurality of first nodes included in the first path and the decoding parameter of each first node only when the difference value between the first decoding parameter and the second decoding parameter is not larger than the preset difference value; and under the condition that the difference value of the first decoding parameter and the second decoding parameter is larger than a preset difference value, determining that the target voice signal is invalid, and not waking up the electronic equipment. In this step, the value of the preset difference is not specifically limited, and may be set and modified as needed. Optionally, the preset difference is any value between 0.001 and 0.1, for example: the preset difference values are 0.005, 0.05, 0.1, etc.
In the embodiment of the application, the electronic device performs primary judgment on the target voice signal through the difference value between the first decoding parameter and the second decoding parameter, that is, when the target voice signal meets the condition of voice awakening, the recognition result of the target voice signal is further determined according to the path information characteristic, so that the interference of invalid voice signals is effectively avoided, and the recognition efficiency of the recognition method is improved.
In a possible implementation manner, the step of determining, by the electronic device, the plurality of first nodes included in the first path and the decoding parameter of each first node is: the electronic equipment determines the jump sequence of a plurality of first nodes included in the first path; determining a voice frame corresponding to each first node according to the skipping sequence; and determining the probability value of the phoneme corresponding to each first node being consistent with the phoneme corresponding to the speech frame, and taking the probability value as the decoding parameter of each first node. Optionally, the phoneme is the smallest phonetic unit. For example, vowels, consonants, etc. in English; for example, the initials and finals in Chinese.
In a possible implementation manner, the step of determining, by the electronic device, a jump sequence of the plurality of first nodes included in the first path includes: the electronic equipment decodes the voice frames in sequence according to the time sequence of the voice frames to obtain a plurality of first nodes; and determining the jump sequence of a plurality of first nodes included in the first path according to the decoding sequence.
For example, the time sequence of the plurality of speech frames is speech frame 1 → speech frame 2 → speech frame 3 → speech frame 4 → speech frame 5; decoding the plurality of voice frames in sequence to obtain a plurality of first nodes which are a node 1, a node 2, a node 3, a node 4 and a node 5; according to the decoding sequence, determining the jumping sequence of the plurality of first nodes as follows: node 1 → node 2 → node 3 → node 4 → node 5, and determines node 1 corresponding to speech frame 1, node 2 corresponding to speech frame 2, node 3 corresponding to speech frame 3, node 4 corresponding to speech frame 4, and node 5 corresponding to speech frame 5.
For example, the phonemes corresponding to the node 1, the node 2, the node 3, the node 4 and the node 5 in sequence are "x, i, ao, y, i"; determining a probability value that the phoneme x corresponding to the node 1 is consistent with the phoneme corresponding to the voice frame 1 to obtain a decoding parameter of the node 1; determining a probability value that the phoneme i corresponding to the node 2 is consistent with the phoneme corresponding to the voice frame 2 to obtain a decoding parameter of the node 2; determining a probability value that the phoneme ao corresponding to the node 3 is consistent with the phoneme corresponding to the voice frame 3 to obtain a decoding parameter of the node 3; determining a probability value that the phoneme y corresponding to the node 4 is consistent with the phoneme corresponding to the voice frame 4 to obtain a decoding parameter of the node 4; and determining the probability value of the phoneme i corresponding to the node 5 being consistent with the phoneme corresponding to the voice frame 5 to obtain the decoding parameter of the node 5.
305. The electronic equipment determines the identification result of the target voice signal based on the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node, wherein the identification result is used for indicating whether the electronic equipment is awakened or not.
In one possible implementation, the electronic device determines a recognition result of the target speech signal according to the speech recognition model. Correspondingly, the method comprises the following steps: the electronic equipment inputs the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node into a voice recognition model to obtain a recognition result of the target voice signal, and the voice recognition model is used for obtaining the recognition result based on the decoding parameters of the path, the plurality of nodes included in the path and the decoding parameters of each node. Optionally, the speech recognition model is a fully connected neural network model.
In a possible implementation manner, the electronic device inputs the first decoding parameter, the plurality of first nodes, and the feature vector corresponding to the decoding parameter of each first node into the speech recognition model, so as to obtain a recognition result of the target speech signal. Optionally, the first decoding parameter is a path score, and the decoding parameter of the first node is a node score. For example, the first decoding parameter is: 0.0279, the decoding parameters of the plurality of first nodes and each first node are: decoding parameter 0.35 for node 1 and node 1, decoding parameter 0.5 for node 2 and node 2, decoding parameter 0.25 for node 3 and node 3, decoding parameter 0.75 for node 4 and node 4, decoding parameter 0.85 for node 5 and node 5. The first decoding parameter, the plurality of first nodes and the feature vector corresponding to the decoding parameter of each first node are as follows: {0.0279, node 1, 0.35, node 2, 0.5, node 3, 0.25, node 4, 0.75, node 5, 0.85 }.
In the embodiment of the application, the recognition result is determined through the voice recognition model, and the voice recognition model can determine the recognition result by combining the decoding parameters corresponding to the path, the plurality of nodes and the decoding parameters of the nodes, so that the decoding path information such as the decoding parameters corresponding to the path, the decoding parameters of the plurality of nodes and the like can be considered when the voice signal is recognized, and the accuracy of the determined recognition result is improved.
It should be noted that before obtaining the recognition result through the speech recognition model, the electronic device may first obtain a sample speech signal, and obtain the speech recognition model through training.
In one possible implementation manner, the process of the electronic device training the speech recognition model is as follows: the electronic equipment acquires a sample voice signal, wherein the sample voice signal comprises a first voice signal and a second voice signal, the first voice signal is a voice signal corresponding to an awakening word, and the second voice signal is a voice signal corresponding to a non-awakening word; training the initial recognition model based on the first voice signal and the second voice signal until the accuracy of the initial recognition model reaches a preset threshold value, and obtaining the voice recognition model. Optionally, the non-wake word is a false wake word. The false wake-up word is a wake-up word which can wake up the electronic device through a voice wake-up method in the prior art in the non-wake-up word. In the embodiment of the present application, the value of the preset threshold is not specifically limited, and may be set and modified as needed. Optionally, the preset threshold is any value between 80% and 100%, for example: the preset threshold values are 85%, 90%, 95%, etc.
In one possible implementation manner, the step of acquiring, by the electronic device, the sample voice signal is: the electronic equipment receives a voice signal corresponding to the awakening word and a voice signal corresponding to the non-awakening word; and carrying out noise adding processing on the voice signals corresponding to the awakening words to obtain first voice signals, and carrying out noise adding processing on the voice signals corresponding to the non-awakening words to obtain second voice signals. Optionally, the noise is background noise. Correspondingly, the method comprises the following steps: the electronic equipment superposes the voice signal corresponding to the awakening word and the background noise signal to obtain a first voice signal, and superposes the voice signal corresponding to the non-awakening word and the background noise signal to obtain a second voice signal.
In the embodiment of the application, the sample voice signal is obtained by adding noise to the voice signal, so that the obtained voice recognition model has higher anti-noise capability after the sample voice signal is trained, and the accuracy of the determined recognition result based on the voice recognition model is further improved.
In one possible implementation, the initial recognition model is trained according to decoding path information of the first speech signal and the second speech signal in the first decoding graph. Correspondingly, the step of training the initial recognition model by the electronic device based on the first speech signal and the second speech signal is as follows: the electronic device determines a third path of a plurality of voice frames included in the first voice signal in the first decoding diagram and a fourth path of a plurality of voice frames included in the second voice signal in the first decoding diagram; determining first path information and second path information, wherein the first path information comprises decoding parameters of a third path, a plurality of third nodes included in the third path and decoding parameters of each third node, and the second path information comprises decoding parameters of a fourth path, a plurality of fourth nodes included in the fourth path and decoding parameters of each fourth node; and training the initial recognition model based on the first path information and the second path information.
Optionally, the initial identification model is a fully-connected neural network model, and the input sample is a feature vector corresponding to the first path information and the second path information. Correspondingly, the step of training the initial recognition model by the electronic device based on the first path information and the second path information is as follows: the electronic equipment determines a feature vector corresponding to the first path information and a feature vector corresponding to the second path information, inputs the feature vector corresponding to the first path information and the feature vector corresponding to the second path information into an initial recognition model, and trains the initial recognition model.
Optionally, the decoding parameter of the path is a path score, and the decoding parameter of the node is a node score. For example, the first path information includes: decoding parameter 0.025 of the third path, decoding parameter 0.25 … … of the third node a, decoding parameter 0.5 of the third node P; the eigenvector corresponding to the first path information is {0.025, A, 0.25 … … P, 0.5 }. Wherein the number of the plurality of third nodes is positively correlated with the number of the plurality of speech frames included in the first speech signal. For example, the second path information includes: decoding parameter 0.015 of the fourth path, decoding parameter 0.35 … … of the fourth node a, decoding parameter 0.45 of the fourth node p; the eigenvector corresponding to the second path information is {0.015, a, 0.35 … … p, 0.45 }. Wherein the number of the plurality of fourth nodes is positively correlated with the number of the plurality of speech frames included in the second speech signal.
In one possible implementation manner, the number of the sample speech signals is multiple, and the dimensions of the feature vectors corresponding to the multiple sample speech signals are the same. Correspondingly, the step of determining, by the electronic device, the feature vectors corresponding to the first path information and the second path information is: the electronic equipment determines the dimension of the characteristic vector corresponding to the first path information and the second path information, if the dimension is smaller than a preset dimension, the dimension of the characteristic vector is complemented to the preset dimension, and if the dimension is larger than the preset dimension, the dimension of the characteristic vector is cut off to the preset dimension. Optionally, the parameter corresponding to the completion dimension is 0. In the embodiment of the present application, the numerical value of the preset dimension is not specifically limited, and may be set and modified as needed. Optionally, the predetermined dimension is any value between 30 and 100 dimensions, for example: the preset dimensions are 50 dimensions, 60 dimensions, 80 dimensions, and the like.
In the embodiment of the application, because the dimensions of the feature vectors corresponding to the plurality of sample voice signals are the same, when the initial recognition model is trained through the feature vectors with the same dimensions, the influence of the dimensions on the training result is avoided, and the efficiency of training the voice recognition model is improved.
306. The electronic device wakes up the electronic device when the recognition result indicates that the electronic device is woken up, and does not wake up the electronic device when the recognition result indicates that the electronic device is not woken up.
In one possible implementation, the electronic device wakes up the electronic device through the wake-up module. Correspondingly, the step is; the electronic equipment sends a wake-up instruction to the wake-up module under the condition that the identification result is used for waking up the electronic equipment, and the wake-up module receives the wake-up instruction and wakes up the electronic equipment; and the electronic equipment does not send the awakening instruction to the awakening module under the condition that the identification result is used for indicating that the electronic equipment is not awakened.
In a possible implementation manner, after the electronic device is awakened, the electronic device is in an awakening state, the electronic device collects a new voice signal, and identifies the new voice signal to obtain a control instruction corresponding to the new voice signal; and controlling the electronic equipment to execute related operation or feedback according to the control instruction so as to realize the control of the voice signal.
The embodiment of the application provides a voice signal identification method, because the similarity relation between a voice signal and an awakening signal is considered through a decoding parameter corresponding to a first path and a decoding parameter corresponding to a second path, the decoding path information corresponding to the voice signal is considered through a plurality of parameters such as the decoding parameter corresponding to the first path, a plurality of first nodes on the first path and the decoding parameter of each first node, and the like, so that the similarity relation between the voice signal and the awakening signal is considered, the decoding path information corresponding to the voice signal is also considered, the accuracy of an identification result is improved, and the false awakening rate is reduced.
Fig. 4 is a block diagram of a speech signal recognition apparatus provided in an embodiment of the present application, and referring to fig. 4, the apparatus includes:
a receiving module 401, configured to receive a target speech signal, and determine a plurality of speech frames included in the target speech signal;
a first determining module 402, configured to determine a first decoding parameter of a first path of a plurality of speech frames in a first decoding diagram, and determine a second decoding parameter of a second path of the plurality of speech frames in a second decoding diagram, where the first decoding diagram includes decoding paths corresponding to a plurality of base speech signals, and the second decoding diagram includes decoding paths corresponding to a plurality of wake-up speech signals;
a second determining module 403, configured to determine, when a difference between the first decoding parameter and the second decoding parameter is not greater than a preset difference, a plurality of first nodes included in the first path and a decoding parameter of each first node;
a third determining module 404, configured to determine a recognition result of the target speech signal based on the first decoding parameter, the plurality of first nodes, and the decoding parameter of each first node, where the recognition result is used to indicate whether to wake up the electronic device.
In a possible implementation manner, the third determining module 404 is configured to input the first decoding parameter, the plurality of first nodes, and the decoding parameter of each first node into a speech recognition model to obtain a recognition result of the target speech signal, where the speech recognition model is configured to obtain the recognition result based on the decoding parameter of the path, the plurality of nodes included in the path, and the decoding parameter of each node.
In another possible implementation manner, referring to fig. 5, the apparatus further includes a training module 405, where the training module 405 includes:
the acquiring unit 4051 is configured to acquire a sample voice signal, where the sample voice signal includes a first voice signal and a second voice signal, the first voice signal is a voice signal corresponding to an awakening word, and the second voice signal is a voice signal corresponding to a non-awakening word;
the training unit 4052 is configured to train the initial recognition model based on the first voice signal and the second voice signal until the accuracy of the initial recognition model reaches a preset threshold, so as to obtain a voice recognition model.
In another possible implementation manner, the training unit 4052 is configured to determine a third path of the multiple speech frames included in the first speech signal in the first decoding diagram, and a fourth path of the multiple speech frames included in the second speech signal in the first decoding diagram; determining first path information and second path information, wherein the first path information comprises decoding parameters of a third path, a plurality of third nodes included in the third path and decoding parameters of each third node, and the second path information comprises decoding parameters of a fourth path, a plurality of fourth nodes included in the fourth path and decoding parameters of each fourth node; and training the initial recognition model based on the first path information and the second path information.
In another possible implementation manner, the obtaining unit 4051 is configured to receive a voice signal corresponding to an awakening word and a voice signal corresponding to a non-awakening word; and carrying out noise adding processing on the voice signals corresponding to the awakening words to obtain first voice signals, and carrying out noise adding processing on the voice signals corresponding to the non-awakening words to obtain second voice signals.
In another possible implementation manner, the first determining module 402 is configured to determine decoding parameters of a plurality of decoding paths of a plurality of speech frames in a first decoding graph; and determining the decoding parameter with the largest value as the first decoding parameter of the first path from the decoding parameters of the plurality of decoding paths.
In another possible implementation manner, the first determining module 402 is configured to determine, for each decoding path in the first decoding diagram, a base speech signal corresponding to the decoding path; determining a first language decoding parameter and a first acoustic decoding parameter of the plurality of speech frames under a decoding path, wherein the first language decoding parameter is used for representing the matching probability between the plurality of speech frames and a word sequence corresponding to a basic speech signal, the first acoustic decoding parameter is used for representing the matching probability between the plurality of speech frames and a first phoneme sequence, and the first phoneme sequence is obtained based on word sequence decomposition; and determining the product of the first language decoding parameter and the first acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
In another possible implementation manner, the first determining module 402 is further configured to determine decoding parameters of a plurality of decoding paths of the plurality of speech frames in the second decoding diagram; and determining the decoding parameter with the largest value as the second decoding parameter of the second path from the decoding parameters of the plurality of decoding paths.
In another possible implementation manner, the first determining module 402 is further configured to determine, for each decoding path in the second decoding diagram, a wake-up speech signal corresponding to the decoding path; determining a second language decoding parameter and a second acoustic decoding parameter of the plurality of voice frames under the decoding path, wherein the second language decoding parameter is used for representing the matching probability between the plurality of voice frames and an awakening word sequence corresponding to the awakening voice signal, the second acoustic decoding parameter is used for representing the matching probability between the plurality of voice frames and a second phoneme sequence, and the second phoneme sequence is obtained by decomposing based on the awakening word sequence; and determining the product of the second language decoding parameter and the second acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
In another possible implementation manner, each voice frame includes a voice signal of a first preset duration;
the receiving module 401 is configured to divide the target speech signal according to a preset period to obtain a plurality of speech frames included in the target speech signal.
In another possible implementation manner, the second determining module 403 is configured to determine a jump sequence of a plurality of first nodes included in the first path; determining a voice frame corresponding to each first node according to the skipping sequence; and determining the probability value of the phoneme corresponding to each first node being consistent with the phoneme corresponding to the speech frame, and taking the probability value as the decoding parameter of each first node.
The embodiment of the application provides a voice signal recognition device, because the similarity between a voice signal and a wake-up signal is considered through a decoding parameter corresponding to a first path and a decoding parameter corresponding to a second path, the decoding path information corresponding to the voice signal is considered through a plurality of parameters such as the decoding parameter corresponding to the first path, a plurality of first nodes on the first path and the decoding parameter of each first node, and the like, not only the similarity between the voice signal and the wake-up signal is considered, but also the decoding path information corresponding to the voice signal is considered, so that the accuracy of a recognition result is improved, and the false wake-up rate is reduced.
Fig. 6 shows a block diagram of an electronic device 600 according to an exemplary embodiment of the present invention. The electronic device 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Electronic device 600 may also be referred to by other names as user equipment, portable electronic device, laptop electronic device, desktop electronic device, and so on.
In general, the electronic device 600 includes: a processor 601 and a memory 602.
The processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the speech signal recognition methods provided by the method embodiments herein.
In some embodiments, the electronic device 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera 606, an audio circuit 607, a positioning component 608, and a power supply 609.
The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other electronic devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, providing the front panel of the electronic device 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the electronic device 600 or in a foldable design; in still other embodiments, the display 605 may be a flexible display disposed on a curved surface or on a folded surface of the electronic device 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.
The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of an electronic apparatus, and a rear camera is disposed on a rear surface of the electronic apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and disposed at different locations of the electronic device 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.
The positioning component 608 is used to locate a current geographic Location of the electronic device 600 to implement navigation or LBS (Location Based Service). The Positioning component 608 can be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.
The power supply 609 is used to supply power to various components in the electronic device 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the electronic device 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.
The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of a coordinate system established with the electronic device 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the display screen 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 612 may detect a body direction and a rotation angle of the electronic device 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the electronic device 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
The pressure sensor 613 may be disposed on a side bezel of the electronic device 600 and/or on a lower layer of the display screen 605. When the pressure sensor 613 is disposed on a side frame of the electronic device 600, a user's holding signal of the electronic device 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be disposed on the front, back, or side of the electronic device 600. When a physical button or vendor Logo is provided on the electronic device 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.
The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of display screen 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the display screen 605 is increased; when the ambient light intensity is low, the display brightness of the display screen 605 is adjusted down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.
Proximity sensor 616, also referred to as a distance sensor, is typically disposed on the front panel of electronic device 600. The proximity sensor 616 is used to capture the distance between the user and the front of the electronic device 600. In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front of the electronic device 600 gradually decreases, the processor 601 controls the display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the electronic device 600 is gradually increased, the processor 601 controls the display 605 to switch from the breath-screen state to the bright-screen state.
Those skilled in the art will appreciate that the configuration shown in fig. 6 does not constitute a limitation of the electronic device 600, and may include more or fewer components than those shown, or combine certain components, or employ a different arrangement of components.
The embodiment of the present application further provides a computer-readable storage medium, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the method for recognizing a speech signal according to any of the above-mentioned implementations.
An embodiment of the present application further provides a computer program product, which includes at least one program code, and the at least one program code is loaded and executed by a processor to implement the method for recognizing a speech signal according to any of the above implementation manners.
In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.
The present application is intended to cover various modifications, alternatives, and equivalents, which may be included within the spirit and scope of the present application.

Claims (15)

1. A method for recognizing a speech signal, the method comprising:
receiving a target voice signal, and determining a plurality of voice frames included in the target voice signal;
determining a first decoding parameter of a first path of the plurality of speech frames in a first decoding graph, and determining a second decoding parameter of a second path of the plurality of speech frames in a second decoding graph, wherein the first decoding graph comprises decoding paths corresponding to a plurality of basic speech signals, and the second decoding graph comprises decoding paths corresponding to a plurality of wake-up speech signals;
determining a plurality of first nodes included in the first path and a decoding parameter of each first node under the condition that a difference value between the first decoding parameter and the second decoding parameter is not larger than a preset difference value;
determining a recognition result of the target voice signal based on the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node, wherein the recognition result is used for indicating whether to wake up the electronic equipment.
2. The method according to claim 1, wherein the determining the recognition result of the target speech signal based on the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node comprises:
and inputting the first decoding parameters, the plurality of first nodes and the decoding parameters of each first node into a voice recognition model to obtain a recognition result of the target voice signal, wherein the voice recognition model is used for obtaining the recognition result based on the decoding parameters of the path, the plurality of nodes included by the path and the decoding parameters of each node.
3. The method of claim 2, wherein training the speech recognition model comprises:
acquiring a sample voice signal, wherein the sample voice signal comprises a first voice signal and a second voice signal, the first voice signal is a voice signal corresponding to a wakeup word, and the second voice signal is a voice signal corresponding to a non-wakeup word;
training an initial recognition model based on the first voice signal and the second voice signal until the accuracy of the initial recognition model reaches a preset threshold value, and obtaining the voice recognition model.
4. The method of claim 3, wherein training an initial recognition model based on the first speech signal and the second speech signal comprises:
determining a third path of a plurality of speech frames included in the first speech signal in the first decoding graph and a fourth path of a plurality of speech frames included in the second speech signal in the first decoding graph;
determining first path information and second path information, wherein the first path information comprises decoding parameters of a third path, a plurality of third nodes included in the third path and decoding parameters of each third node, and the second path information comprises decoding parameters of a fourth path, a plurality of fourth nodes included in the fourth path and decoding parameters of each fourth node;
and training an initial recognition model based on the first path information and the second path information.
5. The method of claim 3, wherein the obtaining a sample speech signal comprises:
receiving a voice signal corresponding to a wakeup word and a voice signal corresponding to a non-wakeup word;
and carrying out noise adding processing on the voice signals corresponding to the awakening words to obtain first voice signals, and carrying out noise adding processing on the voice signals corresponding to the non-awakening words to obtain second voice signals.
6. The method of claim 1, wherein determining the first decoding parameter for the first path of the plurality of speech frames in the first decoding graph comprises:
determining decoding parameters of a plurality of decoding paths of the plurality of speech frames in the first decoding graph;
and determining the decoding parameter with the largest value as the first decoding parameter of the first path from the decoding parameters of the plurality of decoding paths.
7. The method of claim 6, wherein the determining decoding parameters for a plurality of decoding paths of the plurality of speech frames in the first decoding graph comprises:
for each decoding path in the first decoding diagram, determining a base voice signal corresponding to the decoding path;
determining a first language decoding parameter and a first acoustic decoding parameter of the plurality of speech frames under the decoding path, wherein the first language decoding parameter is used for representing the matching probability between the plurality of speech frames and a word sequence corresponding to the basic speech signal, the first acoustic decoding parameter is used for representing the matching probability between the plurality of speech frames and a first phoneme sequence, and the first phoneme sequence is obtained based on the word sequence decomposition;
and determining the product of the first language decoding parameter and the first acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
8. The method of claim 1, wherein determining the second decoding parameter for the second path of the plurality of speech frames in the second decoding graph comprises:
determining decoding parameters of a plurality of decoding paths of the plurality of speech frames in the second decoding graph;
and determining the decoding parameter with the largest value as the second decoding parameter of the second path from the decoding parameters of the plurality of decoding paths.
9. The method of claim 8, wherein the determining decoding parameters for a plurality of decoding paths of the plurality of speech frames in the second decoding graph comprises:
for each decoding path in the second decoding graph, determining a wake-up voice signal corresponding to the decoding path;
determining a second language decoding parameter and a second acoustic decoding parameter of the plurality of voice frames under the decoding path, wherein the second language decoding parameter is used for representing the matching probability between the plurality of voice frames and a wakeup word sequence corresponding to the wakeup voice signal, the second acoustic decoding parameter is used for representing the matching probability between the plurality of voice frames and a second phoneme sequence, and the second phoneme sequence is obtained by decomposing based on the wakeup word sequence;
and determining the product of the second language decoding parameter and the second acoustic decoding parameter to obtain the decoding parameters of the plurality of speech frames under the decoding path.
10. The method of claim 1, wherein each speech frame comprises a speech signal of a first preset duration;
the determining a plurality of speech frames included in the target speech signal comprises:
and dividing the target voice signal according to a preset period to obtain a plurality of voice frames included by the target voice signal.
11. The method of claim 1, wherein determining the plurality of first nodes included in the first path and the decoding parameters of each first node comprises:
determining a jump sequence of a plurality of first nodes included in the first path;
determining a voice frame corresponding to each first node according to the skipping sequence;
and determining a probability value that the phoneme corresponding to each first node is consistent with the phoneme corresponding to the voice frame, and taking the probability value as a decoding parameter of each first node.
12. An apparatus for recognizing a speech signal, the apparatus comprising:
the receiving module is used for receiving a target voice signal and determining a plurality of voice frames included in the target voice signal;
a first determining module, configured to determine a first decoding parameter of a first path of the multiple speech frames in a first decoding graph, and determine a second decoding parameter of a second path of the multiple speech frames in a second decoding graph, where the first decoding graph includes decoding paths corresponding to multiple base speech signals, and the second decoding graph includes decoding paths corresponding to multiple wake-up speech signals;
a second determining module, configured to determine, when a difference between the first decoding parameter and the second decoding parameter is not greater than a preset difference, a plurality of first nodes included in the first path and a decoding parameter of each first node;
a third determining module, configured to determine, based on the first decoding parameter, the plurality of first nodes, and the decoding parameter of each first node, a recognition result of the target speech signal, where the recognition result is used to indicate whether to wake up the electronic device.
13. An electronic device, characterized in that the electronic device comprises:
a processor and a memory, the memory having stored therein at least one program code, the at least one program code being loaded and executed by the processor to implement the method of recognition of a speech signal according to any one of claims 1 to 11.
14. A computer-readable storage medium, wherein at least one program code is stored in the storage medium, and the at least one program code is loaded and executed by a processor to implement the method for recognizing a speech signal according to any one of claims 1 to 11.
15. A computer program product, characterized in that the computer program product comprises at least one program code which is loaded and executed by a processor for implementing a method for recognition of a speech signal according to any one of claims 1 to 11.
CN202111539867.0A 2021-12-15 2021-12-15 Voice signal recognition method and device, electronic equipment, storage medium and product Pending CN114299945A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111539867.0A CN114299945A (en) 2021-12-15 2021-12-15 Voice signal recognition method and device, electronic equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111539867.0A CN114299945A (en) 2021-12-15 2021-12-15 Voice signal recognition method and device, electronic equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN114299945A true CN114299945A (en) 2022-04-08

Family

ID=80968538

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111539867.0A Pending CN114299945A (en) 2021-12-15 2021-12-15 Voice signal recognition method and device, electronic equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN114299945A (en)

Similar Documents

Publication Publication Date Title
CN111933112B (en) Awakening voice determination method, device, equipment and medium
CN112907725B (en) Image generation, training of image processing model and image processing method and device
CN111696570B (en) Voice signal processing method, device, equipment and storage medium
CN108922531B (en) Slot position identification method and device, electronic equipment and storage medium
CN111105788B (en) Sensitive word score detection method and device, electronic equipment and storage medium
CN111681655A (en) Voice control method and device, electronic equipment and storage medium
CN111613213B (en) Audio classification method, device, equipment and storage medium
CN114299933A (en) Speech recognition model training method, device, equipment, storage medium and product
CN114333774A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN111341317B (en) Method, device, electronic equipment and medium for evaluating wake-up audio data
CN111862972A (en) Voice interaction service method, device, equipment and storage medium
CN110992954A (en) Method, device, equipment and storage medium for voice recognition
CN110837557A (en) Abstract generation method, device, equipment and medium
CN113362836B (en) Vocoder training method, terminal and storage medium
CN113744736B (en) Command word recognition method and device, electronic equipment and storage medium
CN113782025B (en) Speech recognition method, device, terminal and storage medium
CN114333821A (en) Elevator control method, device, electronic equipment, storage medium and product
CN115035187A (en) Sound source direction determining method, device, terminal, storage medium and product
CN113162837B (en) Voice message processing method, device, equipment and storage medium
CN114360494A (en) Rhythm labeling method and device, computer equipment and storage medium
CN111145723B (en) Method, device, equipment and storage medium for converting audio
CN111028846B (en) Method and device for registration of wake-up-free words
CN111063372B (en) Method, device and equipment for determining pitch characteristics and storage medium
CN111681654A (en) Voice control method and device, electronic equipment and storage medium
CN114299945A (en) Voice signal recognition method and device, electronic equipment, storage medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination