CN118098206A

CN118098206A - Command word score calculating method, device, equipment and medium

Info

Publication number: CN118098206A
Application number: CN202410464876.5A
Authority: CN
Inventors: 李�杰
Original assignee: Shenzhen Youjie Zhixin Technology Co ltd
Current assignee: Shenzhen Youjie Zhixin Technology Co ltd
Priority date: 2024-04-18
Filing date: 2024-04-18
Publication date: 2024-05-28

Abstract

The invention belongs to the technical field of voice, and discloses a command word score calculating method, device, equipment and medium, wherein the method comprises the following steps: caching the result output by the voice recognition network to form a solution matrix; wherein the shape of the decoding matrix is TC, T represents the time length, C is equal to the phoneme class number +1;1 corresponds to a blank class; constructing a matrix for calculating the score of a preset command word according to the decoding matrix, and taking the matrix as a first matrix; the first matrix has the shape of TS, S is equal to the length +1 of a preset command word; 1 corresponds to a blank class; based on the first matrix, recursion is carried out, two values are calculated for each recursively-obtained node, and the values are respectively the total sum of phonemes when the node is reached and the total probability of blank when the node is reached; determining a node for calculating the score of the preset command word, and adding two values of the node for calculating the score of the preset command word to obtain the score of the preset command word. The invention can reduce the omission of a feasible path and improve the accuracy of command word recognition.

Description

Command word score calculating method, device, equipment and medium

Technical Field

The present application relates to the field of speech technologies, and in particular, to a command word score calculating method, device, apparatus, and medium.

Background

Command word recognition belongs to voice recognition and is widely applied to the field of intelligent home, such as intelligent voice sound boxes, intelligent voice headphones, intelligent voice lamps, intelligent voice fans and the like. Due to cost consideration, compared with intelligent devices such as mobile phones, the embedded device has low calculation power, and small memory and flash. Because the ctc decoding algorithm has the advantages of no alignment for sequence tasks, high efficiency of decoding process, memory saving and the like, the voice recognition algorithm on the common embedded equipment generally adopts the ctc decoding algorithm. The general decoding algorithm calculates the path score according to the forward algorithm, which is approximately the deduplication and the blanc removal, so as to calculate the command word. A drawback of this approach is that, for example, for the phoneme sequence d a k ai corresponding to the command word "open", according to the existing ctc decoding method, the feasible path can only be d a __ k ai or d __ a k ai, but in reality the path d a _a k ai (this sequence characterizes the phonemes or blank (denoted by +) corresponding to the maximum value of each column of the ctc decoding matrix) is also possible, and this path may be the main path, which path may be missed due to the duplication and duplication elimination rules of the ctc decoding method, which also results in a decrease of the score of the command word "open" because multiple paths may be mapped to one command word. Therefore, how to solve the problem that the existing ctc decoding algorithm leaks out the feasible path, resulting in the decrease of the accuracy of command word recognition is a technical problem that needs to be solved at present.

Disclosure of Invention

The invention mainly aims to provide a command word score calculating method, device, equipment and medium, and aims to solve the problem that the accuracy of command word recognition is reduced because a feasible path is missed by the conventional ctc decoding algorithm.

In order to achieve the above object, a first aspect of the present invention provides a command word score calculating method, the method comprising:

Caching the result output by the voice recognition network to form a decoding matrix; wherein the decoding matrix has a shape of T C, T represents the time length, C is equal to the phoneme class number +1;1 corresponds to a blank class;

Constructing a matrix for calculating the score of the preset command word according to the decoding matrix, and marking the matrix for calculating the score of the preset command word as a first matrix; wherein the shape of the first matrix is T S, S is equal to the length +1 of a preset command word; 1 corresponds to a blank class;

performing recursion based on the first matrix, and calculating two values for each recursively-obtained node; one value is the total probability of a phoneme when reaching the node, and the other value is the total probability of a blank when reaching the node;

determining a node for calculating the score of the preset command word, and adding the two values of the node for calculating the score of the preset command word to obtain the score of the preset command word.

Further, the probability of each phoneme and the probability of blank are output at each moment by the speech recognition network, and the step of buffering the result output by the speech recognition network to form a decoding matrix includes: filling probabilities of blank output at each moment into a first row of a first blank matrix according to time sequence, filling probabilities of first phonemes output at each moment into a second row of the first blank matrix, filling probabilities of second phonemes output at each moment into a third row of the first blank matrix, and analogizing until probabilities of C-th phonemes output at each moment are filled into a last row of the first blank matrix to form the decoding matrix; wherein the first row is the uppermost row of the decoding matrix.

Further, the step of constructing a matrix for calculating a preset command word score according to the decoding matrix includes:

copying the numerical value corresponding to the first row of the decoding matrix to the first row of a second blank matrix;

copying a numerical value corresponding to a row of a first phoneme of a phoneme sequence corresponding to the preset command word from the decoding matrix to a second row of a second blank matrix;

copying the numerical value corresponding to the row of the second phoneme of the phoneme sequence corresponding to the preset command word from the decoding matrix to the third row of the second blank matrix, and so on until the numerical value corresponding to the row of the last phoneme of the phoneme sequence corresponding to the preset command word is copied to the last row of the second blank matrix.

Further, the step of determining a node for calculating the score of the preset command word includes:

and taking the last phoneme node corresponding to the last moment in the first matrix as the node of the score of the preset command word.

Further, when the phoneme sequence corresponding to the preset command word is abc, where a is a first phoneme of the phoneme sequence abc, b is a second phoneme of the phoneme sequence abc, c is a third phoneme of the phoneme sequence, and t=5, when t=0, two nodes of the blank and a may be walked, next, a node of the blank at t=0 may walk two nodes of the blank and a at t=1, an a node of the blank at t=0 may walk three nodes of the blank, a and b at t=1, next, a node of the blank at t=1 may walk two nodes of the blank and a at t=2, an a node of the blank at t=1 may walk three nodes of the blank, a and b at t=2, the b node at t=1 can walk three nodes of the blank, b and c at t=2, then the blank node at t=2 can walk two nodes of the blank and a at t=3, the a node at t=2 can walk three nodes of the blank, a and b at t=3, the b node at t=2 can walk three nodes of the blank, b and c at t=3, the c node at t=2 can walk two nodes of the blank and c at t=3, then the blank node at t=3 can walk two nodes of the blank and a at t=4, the a node at t=3 can walk three nodes of the blank, a and b at t=4, the b node at t=3 can walk three nodes of the blank, b and c at t=3, and the c at t=3 can walk two nodes of the blank and c at t=4.

Further, the step of recursively calculating two values for each node to which recursion is performed based on the first matrix includes:

For the blank node, problabel is equal to 0, and probblank is calculated by: blanc [ t ]. Probblank = blanc [ t-1]. Probblank Ctc [ t ] [ blank_id ]; wherein, blank [ t ]. Probblank represents when reaching the node at time t; wherein t represents the time, t >0, and blanc [ t ]. Probblank represents the total probability of blanc when reaching the node of blanc [ t ]; the total probability of a block reaching the node of block t-1 is represented by block t-1, ctc t block id represents the value at the block id at the time t from the decoding matrix, and block id represents the block position;

For the first phoneme node of the phoneme sequence corresponding to the preset command word, its probblank and problabel are calculated as follows: at. probblank =sum (a t-1) ctc[t][blank_id]；a[t].problabel = (blank[t -1].probblank + sum(a[t -1]))/>Ctc [ t ] [ a_id ]; wherein a represents a first phoneme node of a phoneme sequence corresponding to a preset command word, t >0, a [ t ]. Probblank represents a total probability of blank when reaching the node a [ t ], a [ t ]. Problabel represents a total probability of phoneme when reaching the node a [ t ], [ t-1 ])=a [ t-1]. Probblank +a [ t-1]. Problabel; a [ t-1]. Probblank represents the total probability of a blank when reaching the node of a [ t-1], a [ t-1]. Problabel represents the total probability of a phoneme when reaching the node of a [ t-1], ctc [ t ] [ blank_id ] represents the value at the position of blank_id at the time t taken from the decoding matrix, blank_id represents the position of the blank, blank [ t-1]. Probblank represents the total probability of a blank when reaching the node of blank [ t-1], ctc [ t ] [ a_id ] represents the value at the position of a_id at the time t taken from the decoding matrix, and a_id represents the position of a;

For the non-initial phoneme nodes of the phoneme sequence corresponding to the preset command word, probblank and problabel are respectively calculated as follows: b [ t ]. Probblank =sum (b [ t-1 ]) ctc[t][blank_id],b[t].problabel = (sum(a[t - 1]) + sum(b[t - 1]))/>Ctc [ t ] [ b_id ]; b represents a non-initial phoneme node of a phoneme sequence corresponding to a preset command word; t is greater than 0, b [ t ]. Probblank represents the total probability of blank when reaching the node b [ t ], b [ t ]. Problabel represents the total probability of phoneme when reaching the node b [ t ], [ b [ t-1 ]) =b [ t-1], [ probblank ] +b [ t-1] [ problabel; b [ t-1]. Probblank represents the total probability of blank when reaching the node b [ t-1], b [ t-1]. Problabel represents the total probability of phoneme when reaching the node b [ t-1], ctc [ t ] [ b_id ] represents the value at the position b_id of the moment t taken from the decoding matrix, and b_id represents the position where b is located;

For the last phoneme node of the phoneme sequence corresponding to the preset command word, its probblank and problabel are calculated as follows: ct, probblank =sum (c t-1) ctc[t][blank_id]，c[t].problabel =(sum(b[t-1])+sum(c[t -1]) )/>Ctc [ t ] [ c_id ]; c represents the last phoneme node of the phoneme sequence corresponding to the preset command word, t >0, c [ t ]. Probblank represents the total probability of blank when reaching the node of c [ t ], c [ t ]. Problabel represents the total probability of a phoneme when reaching the node of c [ t ], sum (c [ t-1 ])=c [ t-1]. Probblank +c [ t-1]. Problabel, c [ t-1]. Probblank represents the total probability of blank when reaching the node of c [ t-1], c [ t-1]. Problabel represents the total probability of a phoneme when reaching the node of c [ t-1], ctc [ t ] [ blank_id ] represents the value at the position of blank_id at the time t from the decoding matrix, ctc [ t ] [ c_id ] represents the value at the position of c_id at the time t from the decoding matrix;

When t=0, blanc [0]. Probblank =ctc [0] [ blanc_id ]; label [0]. Problabel = ctc [0] [ label_id ]; wherein ctc [0] [ blank_id ] represents a value at a position of t=0 blank_id taken from the decoding matrix, blank [0]. Probblank represents a total probability of blank when reaching the node of blank [0], and label [0]. Problabel represents a total probability of phoneme when reaching the node of label [0 ]; wherein label represents a phoneme.

Further, when the phoneme sequence corresponding to the command word has continuous repeated phonemes, only a downward oblique path exists between two adjacent repeated phonemes during recursion, and no translation path exists.

In a second aspect, an embodiment of the present application provides a command word score calculating apparatus, including:

The buffer module is used for buffering the result output by the voice recognition network to form a decoding matrix; wherein the decoding matrix has a shape of T C, T represents the time length, C is equal to the phoneme class number +1;1 corresponds to a blank class;

The construction module is used for constructing a matrix for calculating the score of the preset command word according to the decoding matrix, and marking the matrix for calculating the score of the preset command word as a first matrix; wherein the shape of the first matrix is T S, S is equal to the length +1 of a preset command word; 1 corresponds to a blank class;

The recurrence module is used for performing recurrence on the basis of the first matrix, and calculating two values for each node to which recurrence is performed; one value is the total probability of a phoneme when reaching the node, and the other value is the total probability of a blank when reaching the node;

And the determining module is used for determining a node for calculating the score of the preset command word, and adding the two values of the node for calculating the score of the preset command word to obtain the score of the command word.

In a third aspect, an embodiment of the present application provides a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the command word score calculating method according to any one of the above when executing the computer program.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the command word score calculating method of any one of the above.

The beneficial effects are that:

the embodiment of the application forms a decoding matrix by caching the result output by the voice recognition network;

And constructing a matrix for calculating the score of the preset command word according to the decoding matrix, wherein the matrix has only one row of blank, providing a basis for realizing the reduction of the omission of a feasible path, and in addition, calculating two values for each node which is recursively obtained based on the first matrix, wherein one value is the total probability of phonemes when reaching the node, and the other value is the total probability of blank when reaching the node, providing a guarantee for realizing the reduction of the omission of the feasible path by calculating the two values, and finally adding the two values of the node for calculating the score of the preset command word by determining the node for calculating the score of the preset command word, thereby obtaining the score of the preset command word. In addition, in the first matrix, since the probability values of a plurality of rows of the blank are required to be stored in the conventional manner, the present application only requires the probability values of a single row of the blank to be stored, and thus the storage space is smaller than in the conventional manner.

Drawings

FIG. 1 is a schematic diagram of a command word score calculating method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a decoding matrix according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a first matrix according to an embodiment of the present invention;

FIG. 4 is a prior art ctc decoding topology;

FIG. 5 is a ctc decoding topology provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a flow of a command word score calculating apparatus according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computer device according to an embodiment of the present invention;

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, modules, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, modules, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any module and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Referring to fig. 1, an embodiment of the present application provides a command word score calculating method, including:

s1, caching a result output by a voice recognition network to form a decoding matrix; wherein the decoding matrix has a shape of T C, T represents the time length, C is equal to the phoneme class number +1;1 corresponds to a blank class;

S2, constructing a matrix for calculating the score of the preset command word according to the decoding matrix, and marking the matrix for calculating the score of the preset command word as a first matrix; wherein the shape of the first matrix is T S, S is equal to the length +1 of a preset command word; 1 corresponds to a blank class;

S3, recursing is carried out based on the first matrix, and two values are calculated for each recursively-fed node; one value is the total probability of a phoneme when reaching the node, and the other value is the total probability of a blank when reaching the node;

and S4, determining a node for calculating the score of the preset command word, and adding the two values of the node for calculating the score of the preset command word to obtain the score of the preset command word.

In step S1, the speech recognition network may be one of various network structures, such as a recurrent neural network (Recurrent Neural Networks, RNN), a convolutional neural network (Convolutional Neural Networks, CNN), or a combination thereof, such as a Long Short-Term Memory (LSTM), etc. After the speech to be recognized is input into the speech recognition network, the speech recognition network outputs the probability value of each phoneme and the probability value of blank at each moment, and the matrix formed by the probability values of each phoneme and the probability values of blank output at a plurality of moments is a decoding matrix, the abscissa of the decoding matrix is time, and the ordinate is the phoneme category plus one blank, as shown in fig. 2, it should be understood that blank refers to a blank, and is generally denoted by symbol "_". The number of phoneme categories in the decoding matrix may be 65, then the shape of the decoding matrix is 66T。

In step S2, the form of a conventional matrix for calculating the score of a preset command word is shown in fig. 3, in which a blank is inserted between each phoneme of a phoneme sequence corresponding to each preset command word, and a blank is inserted before the first phoneme and after the last phoneme. Compared with the prior matrix for calculating the score of the preset command word, the matrix for calculating the score of the preset command word only needs to be inserted into a blank, and is generally inserted before the first phoneme or after the last phoneme. With this design, a basis is provided for the possible path omission that can be reduced later. In addition, it should be understood that the preset command word length refers to the number of phonemes in the corresponding phoneme sequence of the preset command word, for example, the phoneme sequence abc has a length of 3.

In step S3, when calculating the score of the preset command word by recursion, a value is calculated for each node that has been recursively obtained, and this value is the total probability of reaching this node (i.e., this node), whereas the present application calculates two values, one is the total probability of reaching this node (i.e., this node) being a phoneme (label), denoted by problabel, and the other is the total probability of blank when reaching this node, denoted by probblank, unlike the previous method. It should be noted that, to reduce the omission of a feasible path, it is necessary to calculate these two values, and if only one value is calculated in the past, it is impossible to reduce the omission of a feasible path. The sum mentioned below represents the sum of these two probabilities, i.e. sum= problabel + probblank.

In step S4, after the two values corresponding to the plurality of nodes are obtained by the calculation (problabel and probblank) according to the recurrence principle, a node for calculating the score of the preset command word is determined from the nodes, and the score of the preset command word is obtained by adding the two values of the node for calculating the score of the preset command word.

In an embodiment, the probability of each phoneme and the probability of blank are output by the speech recognition network at each moment, and the step of buffering the result output by the speech recognition network to form the decoding matrix includes: filling probabilities of blank output at each moment into a first row of a first blank matrix according to time sequence, filling probabilities of first phonemes output at each moment into a second row of the first blank matrix, filling probabilities of second phonemes output at each moment into a third row of the first blank matrix, and analogizing until probabilities of C-th phonemes output at each moment are filled into a last row of the first blank matrix to form the decoding matrix; wherein the first row is the uppermost row of the decoding matrix.

In the embodiment of the present application, the speech recognition network outputs the probability value of each phoneme and the probability value of blank at each time, and the matrix formed by the probability values of each phoneme and the probability values of blank output at a plurality of times is the decoding matrix, the abscissa of the decoding matrix is time, and the ordinate is the phoneme category plus one blank, as shown in fig. 2, it should be understood that blank refers to a blank, and is generally denoted by symbol "_". The number of phoneme categories in the decoding matrix may be 65, then the shape of the decoding matrix is 66T。

In one embodiment, the step of constructing a matrix for calculating a preset command word score according to the decoding matrix includes:

In the embodiment of the present application, assuming that the phoneme sequence corresponding to the preset command word is abc, the first matrix is a probability value of blank at each time, a probability value of a at each time, b probability value of b at each time, and c probability value of c at each time sequentially from top to bottom, as shown in fig. 3, and fig. 3 is a matrix for calculating the score of the command word.

The first matrix constructed in this way provides the basis for the subsequent implementation of reduced omission of viable paths.

In an embodiment, as shown in fig. 5, when the phoneme sequence corresponding to the preset command word is abc, where a is a first phoneme of the phoneme sequence abc, b is a second phoneme of the phoneme sequence abc, c is a third phoneme of the phoneme sequence, t=5, when t=0, two nodes of the block and a can be walked, next, a node of the block=0 can walk two nodes of the block=1, a node of the block=0 can walk two nodes of the block and a, next, a node of the block=1 can walk two nodes of the block and a node, a node of the block=1 can walk three nodes of the block=2, a node of the block=2, b node of the block=2 can walk two nodes of the block, b and c node, next, and when t=2 can walk two nodes of the block=3, a node of the block=3, and b node of the block=3 can walk two nodes of the block=3, a node of the block=3, and a node of the block=3 can walk two nodes of the block=3, a node of the block=3, and a node of the block=3 can walk three nodes of the block=3.

In this embodiment, the ctc decoding topology of the present invention is shown in fig. 5, the conventional ctc decoding topology is shown in fig. 4, the topology is used to represent a recurrence formula, the recurrence formula of the present invention is different from the conventional recurrence formula, and compared to the conventional ctc forward algorithm (recurrence), in this embodiment, the present invention needs to record the values of blank, a, b, c of these 4 nodes at most every time step, and each node keeps the total probability that two values are one phoneme (label) up to this node as problabel, and the total probability that blank up to this node as probblank. The invention designs a brand new topological graph, the total probability of phonemes (label) when the node is combined is recorded as problabel, and the total probability of blank when the node is combined is recorded, so that the omission of a main path can be reduced.

Unlike the prior art, the following recurrence formula may reflect the above topology, and in fig. 5, when t=0, two nodes, namely, blank and a, may be moved, and the scores are respectively:

blanc [0]. Probblank = ctc [0] [ blanc_id ] and a [0]. Problabel = ctc [0] [ a_id ];

When t=1, three nodes of blank, a and b can be walked, and the scores are respectively:

blank[1].probblank = blank[0].probblankctc[1][blank_id]；

a[1].probblank=sum(a[0])ctc[1][blank_id]；

a[1].problabel = (blank[0].probblank + sum(a[0]))ctc[1][a_id]；

b[1].problabel = sum(a[0])ctc[1][b_id]；

blank, a, b, c4 nodes can be walked when t=2;

blank[2].probblank = blank[1].probblankctc[2][blank_id]；

a[2].probblank = sum(a[1])ctc[2][blank_id]；

a[2].problabel = (blank[1].probblank+sum(a[1]))ctc[2][a_id]；

b[2].probblank = sum(b[1])ctc[2][blank_id]；

b[2].problabel = (sum(a[1]) + sum(b[1]))ctc[2][b_id]；

c[2].problabel = sum(b[1]) ctc [2] [ c_id ]; (here, actually, c 2. Problabel = (sum (b 1) +sum (c 1))/(sum) Ctc [2] [ c_id ], and since sum (c [1 ])=0, c [2]. Problabel =sum (b [1 ])/>ctc[2][c_id]）；

Blank, a, b, c4 nodes can be walked when t=3

blank[3].probblank = blank[2].probblankctc[3][blank_id]；

a[3].probblank=sum(a[2])ctc[3][blank_id]；

a[3].problabel = blank[2].probblank + sum(a[2] )ctc[3][a_id]；

b[3].probblank = sum(b[2])ctc[3][blank_id]；

b[3].problabel = (sum(a[2]) + sum(b[2]))ctc[3][b_id]；

c[3].probblank = sum(c[2])ctc[3][blank_id]；

c[3].problabel = (sum(b[2])+sum(c[2]))ctc[3][c_id]；

When t=4, blank, a, b, c4 nodes can be walked

blank[4].probblank = blank[3].probblankctc[4][blank_id]；

a[4].probblank=sum(a[3])ctc[4][blank_id]；

a[4].problabel = blank[3].probblank + sum(a[3] )ctc[4][a_id]；

b[4].probblank = sum(b[3])ctc[4][blank_id]；

b[4].problabel = (sum(a[3]) + sum(b[3]))ctc[4][b_id]；

c[4].probblank = sum(c[3])ctc[4][blank_id]；

c[4].problabel = （sum(b[3])+sum（c[3]））ctc[4][c_id]。

Note that, in the node recursively written in this example, if the probblank value or problabel value corresponding to the node is not written out, this means that the value is equal to 0 in this example. Further, the block [ t ]. Probblank represents the total probability of the block when the node of the block [ t ] is reached, ctc [ t ] [ block_id ] represents the value at the block_id at the time t position taken from the decoding matrix, and block_id represents the position where the block is located; a [ t ]. Probblank ] represents the total probability of a blank when reaching the node of a [ t ], a [ t ]. Problabel represents the total probability of a phoneme when reaching the node of a [ t ], b [ t ]. Probblank represents the total probability of a blank when reaching the node of b [ t ], b [ t ]. Problabel represents the total probability of a phoneme when reaching the node of b [ t ], c [ t ]. Probblank represents the total probability of a blank when reaching the node of c [ t ], and c [ t ]. Problabel represents the total probability of a phoneme when reaching the node of c [ t ].

Note that the node c at t=4 is a node for calculating the command word score, and the node c at t=4 is the final command word score by adding probblank and problabel.

In one embodiment, the step of recursively calculating two values for each node recursively based on the first matrix includes:

For the first phoneme node of the phoneme sequence corresponding to the preset command word, its probblank and problabel are calculated as follows: at. probblank =sum (a t-1) ctc[t][blank_id]；a[t].problabel = (blank[t -1].probblank + sum(a[t -1]))/>Ctc [ t ] [ a_id ]; wherein a represents a first phoneme node of a phoneme sequence corresponding to a preset command word, t >0, a [ t ]. Probblank represents a total probability of blank when reaching the node a [ t ], a [ t ]. Problabel represents a total probability of phoneme when reaching the node a [ t ], [ t-1 ])=a [ t-1]. Probblank +a [ t-1]. Problabel; a [ t-1]. Probblank represents the total probability of a blank when reaching the node of a [ t-1], a [ t-1]. Problabel represents the total probability of a phoneme when reaching the node of a [ t-1], ctc [ t ] [ blank_id ] represents the value at the position of blank_id at the time t taken from the decoding matrix, blank_id represents the position of blank, blank [ t-1]. Probblank represents the total probability of blank when reaching the node of blank [ t-1], ctc [ t ] [ a_id ] represents the value at the position of a_id at the time t taken from the decoding matrix, and a_id represents the position of a;

For the last phoneme node of the phoneme sequence corresponding to the preset command word, its probblank and problabel are calculated as follows: ct, probblank =sum (c t-1) ctc[t][blank_id]，c[t].problabel = (sum(b[t-1])+c[t -1])/>Ctc [ t ] [ c_id ]; c represents the last phoneme node of the phoneme sequence corresponding to the preset command word, t >0, c [ t ]. Probblank represents the total probability of blank when reaching the node of c [ t ], c [ t ]. Problabel represents the total probability of a phoneme when reaching the node of c [ t ], sum (c [ t-1 ])=c [ t-1]. Probblank +c [ t-1]. Problabel, c [ t-1]. Probblank represents the total probability of blank when reaching the node of c [ t-1], c [ t-1]. Problabel represents the total probability of a phoneme when reaching the node of c [ t-1], ctc [ t ] [ blank_id ] represents the value at the position of blank_id at the time t from the decoding matrix, ctc [ t ] [ c_id ] represents the value at the position of c_id at the time t from the decoding matrix;

For the phoneme node, when it is assigned for the first time, probblank is 0 and problabel is calculated according to the above formula.

It should be noted that, the formulas given in this embodiment are formulas on which the present invention recursively depends, and probblank and problabel corresponding to each node recursively added in fig. 5 (abc in phoneme sequence) are calculated using these formulas.

It should be noted that, in constructing the matrix for calculating the score of the preset command word, the blank may be placed on the first row of the matrix as described in the above embodiment, or may be placed on the last row of the matrix instead of the first row of the matrix, and in general, the step of constructing the matrix for calculating the score of the preset command word according to the decoded matrix includes:

In an embodiment, the step of determining a node for calculating the preset command word score includes:

In the example of fig. 5, the last phoneme node corresponding to the last time in the first matrix is the node c at t=4, and thus, the node c at t=4 is the node of the preset command word score in the above example.

It should be noted that, in the first matrix, no matter the blank is placed in the first row or the last row, the node of the preset command word score is the last phoneme node corresponding to the last moment in the first matrix.

In an embodiment, when the phoneme sequence corresponding to the command word has consecutively repeated phonemes, only a downward oblique path exists between two adjacent repeated phonemes during recursion, and no translation path exists.

In the embodiment of the present application, the processing of the phoneme sequence of the command word with repeated phonemes, such as the processing of a b c, the existing method inserts a blank between the phoneme sequences corresponding to each preset command word, and inserts a blank after the first phoneme and the previous and last phoneme sequence, that is, the first matrix is sequentially from top to bottom: a b c, the application is as follows: abcc, compared with the prior art, the first c to the second c only have oblique downward paths and have no translation paths, so that the omission of feasible paths is further reduced.

In an embodiment, before the step of constructing a matrix for calculating a preset command word score according to the decoding matrix, the method further includes:

within the [0, T-1] window, the length of the normalized path is counted.

Specifically, the length of the normalized path is statistically normalized according to the formula: normlen = (t+1) -blanknum; blanknum =ctc [ t ] [ blank_id ].

Wherein normlen normalizes the path length, t denotes the time, ctc [ t ] [ blank_id ] denotes the value at the position of blank_id where time t is taken from the decoding matrix, and blank_id denotes the position where blank is located.

The statistical normalization of the path length is helpful to improve the accuracy of command word recognition.

As shown in FIG. 6, in a second aspect, an embodiment of the present application provides a command word score calculating apparatus, the apparatus comprising:

the buffer module 1 is used for buffering the result output by the voice recognition network to form a decoding matrix; wherein the decoding matrix has a shape of T C, T represents the time length, C is equal to the phoneme class number +1;1 corresponds to a blank class;

the construction module 2 is used for constructing a matrix for calculating the score of the preset command word according to the decoding matrix, and marking the matrix for calculating the score of the preset command word as a first matrix; wherein the shape of the first matrix is T S, S is equal to the length +1 of a preset command word; 1 corresponds to a blank class;

A recursion module 3, configured to recursively calculate two values for each node that is recursively obtained based on the first matrix; one value is the total probability of a phoneme when reaching the node, and the other value is the total probability of a blank when reaching the node;

And the determining module 4 is used for determining a node for calculating the score of the preset command word, and adding the two values of the node for calculating the score of the preset command word to obtain the score of the command word.

In an embodiment, the probability of each phoneme and the probability of blank are output by the speech recognition network at each moment, and the buffering the result output by the speech recognition network, forming the decoding matrix includes: filling probabilities of blank output at each moment into a first row of a first blank matrix according to time sequence, filling probabilities of first phonemes output at each moment into a second row of the first blank matrix, filling probabilities of second phonemes output at each moment into a third row of the first blank matrix, and analogizing until probabilities of C-th phonemes output at each moment are filled into a last row of the first blank matrix to form the decoding matrix; wherein the first row is the uppermost row of the decoding matrix.

In an embodiment, the constructing a matrix for calculating a preset command word score according to the decoding matrix includes:

In an embodiment, the determining the node for calculating the preset command word score includes:

In an embodiment, when the phoneme sequence corresponding to the preset command word is abc, where a is a first phoneme of the phoneme sequence abc, b is a second phoneme of the phoneme sequence abc, t=5 is a recurrence, when t=0, two nodes of the block and a may be walked, next, a node at t=0 may walk two nodes of the block and a when t=1, a node at t=0 may walk two nodes of the block and a=1, next, a node at t=1 may walk two nodes of the block and a when t=2, b node at t=1 may walk three nodes of the block, a node at t=2, b node at t=2 may walk three nodes of the block, b and c, and when t=2 may walk two nodes of the block and b=3, and b node at t=3 may walk two nodes of the block and b=3, and c may walk two nodes of the block and b=3 when t=2, and c=3 may walk two nodes of the block and b=3, and c=3 may walk two nodes of the block and c=3 when t=3.

In an embodiment, said recursively calculating two values for each node to which recursively the first matrix comprises:

For the last phoneme node of the phoneme sequence corresponding to the preset command word, its probblank and problabel are calculated as follows: ct, probblank =sum (c t-1) ctc[t][blank_id]，c[t].problabel = (sum(b[t-1])+sum(c[t -1]) )/>Ctc [ t ] [ c_id ]; c represents the last phoneme node of the phoneme sequence corresponding to the preset command word, t >0, c [ t ]. Probblank represents the total probability of blank when reaching the node of c [ t ], c [ t ]. Problabel represents the total probability of a phoneme when reaching the node of c [ t ], sum (c [ t-1 ])=c [ t-1]. Probblank +c [ t-1]. Problabel, c [ t-1]. Probblank represents the total probability of blank when reaching the node of c [ t-1], c [ t-1]. Problabel represents the total probability of a phoneme when reaching the node of c [ t-1], ctc [ t ] [ blank_id ] represents the value at the position of blank_id at the time t from the decoding matrix, ctc [ t ] [ c_id ] represents the value at the position of c_id at the time t from the decoding matrix;

Referring to fig. 7, an embodiment of the present invention further provides a computer device, and an internal structure of the computer device may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The nonvolatile storage medium stores an operating device, a computer program, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data of a command word score calculation method and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. Further, the above-mentioned computer apparatus may be further provided with an input device, a display screen, and the like. The computer program is executed by a processor to realize a command word score calculating method, and comprises the following steps: caching the result output by the voice recognition network to form a decoding matrix; wherein the decoding matrix has a shape of TC, T represents the time length, C is equal to the phoneme class number +1;1 corresponds to a blank class; constructing a matrix for calculating the score of the preset command word according to the decoding matrix, and marking the matrix for calculating the score of the preset command word as a first matrix; wherein the shape of the first matrix is T/>S, S is equal to the length +1 of a preset command word; 1 corresponds to a blank class; performing recursion based on the first matrix, and calculating two values for each recursively-obtained node; one value is the total probability of a phoneme when reaching the node, and the other value is the total probability of a blank when reaching the node; determining a node for calculating the score of the preset command word, and adding the two values of the node for calculating the score of the preset command word to obtain the score of the preset command word. It will be appreciated by those skilled in the art that the architecture shown in fig. 7 is merely a block diagram of a portion of the architecture in connection with the present inventive arrangements and is not intended to limit the computer devices to which the present inventive arrangements are applicable.

An embodiment of the present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a command word score calculating method, including the steps of: caching the result output by the voice recognition network to form a decoding matrix; wherein the decoding matrix has a shape of TC, T represents the time length, C is equal to the phoneme class number +1;1 corresponds to a blank class;

Constructing a matrix for calculating the score of the preset command word according to the decoding matrix, and marking the matrix for calculating the score of the preset command word as a first matrix; wherein the shape of the first matrix is T S, S is equal to the length +1 of a preset command word; 1 corresponds to a blank class; performing recursion based on the first matrix, and calculating two values for each recursively-obtained node; one value is the total probability of a phoneme when reaching the node, and the other value is the total probability of a blank when reaching the node; determining a node for calculating the score of the preset command word, and adding the two values of the node for calculating the score of the preset command word to obtain the score of the preset command word. It is understood that the computer readable storage medium in this embodiment may be a volatile readable storage medium or a nonvolatile readable storage medium.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium provided by the present application and used in embodiments may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual speed data rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It should be noted that, the information related to the embodiment of the present application (including but not limited to user equipment information

Information, user personal information, etc.), data (including but not limited to data for analysis, stored data, data,

Data presented, etc.) and signals, both user-authorized or fully authorized by parties, and related

The collection, use and processing of data requires compliance with relevant laws and regulations and standards in the relevant countries and regions.

For example, the voices related to the application are all acquired under the condition of full authorization.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that comprises the element.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the invention, and all equivalent structures or equivalent processes using the descriptions and drawings of the present invention or directly or indirectly applied to other related technical fields are included in the scope of the invention.

Claims

1. A command word score calculating method, the method comprising:

2. The command word score calculating method according to claim 1, wherein the speech recognition network outputs the probability of each phoneme and the probability of blank at each moment, and the step of buffering the result outputted from the speech recognition network to form the decoding matrix comprises: filling probabilities of blank output at each moment into a first row of a first blank matrix according to time sequence, filling probabilities of first phonemes output at each moment into a second row of the first blank matrix, filling probabilities of second phonemes output at each moment into a third row of the first blank matrix, and analogizing until probabilities of C-th phonemes output at each moment are filled into a last row of the first blank matrix to form the decoding matrix; wherein the first row is the uppermost row of the decoding matrix.

3. The command word score calculating method according to claim 2, wherein the step of constructing a matrix for calculating a preset command word score from the decoding matrix comprises:

4. A command word score calculating method according to claim 3, wherein the step of determining a node for calculating the preset command word score comprises:

5. The command word score calculation method according to claim 3, wherein when a phoneme sequence corresponding to the preset command word is abc, wherein a is a first phoneme of the phoneme sequence abc, b is a second phoneme of the phoneme sequence abc, c is a third phoneme of the phoneme sequence, when t=5, when t=0, two nodes of the block and a can be walked, next, a node when t=0 can walk two nodes of the block and a, a node when t=0 can walk two nodes of the block=1, a and b, then a node when t=1 can walk two nodes of the block and a, a node when t=1 can walk three nodes of the block, a and b, when t=1 can walk two nodes of the block=2, b and c, then, when t=2 can walk two nodes of the block=3, b=3 can walk two nodes of the block=3, and b=3 can walk two nodes of the block and b=3, and a node when t=3 can walk two nodes of the block=3, and a node when t=3=3.

6. The command word score calculating method according to claim 3, wherein the step of calculating two values for each node to which the recurrence is performed based on the first matrix includes:

For the first phoneme node of the phoneme sequence corresponding to the preset command word, its probblank and problabel are calculated as follows: at. probblank =sum (a t-1) ctc[t][blank_id]；a[t].problabel = (blank[t -1].probblank + sum(a[t -1])) />Ctc [ t ] [ a_id ]; wherein a represents a first phoneme node of a phoneme sequence corresponding to a preset command word, t >0, a [ t ]. Probblank represents a total probability of blank when reaching the node a [ t ], a [ t ]. Problabel represents a total probability of phoneme when reaching the node a [ t ], [ t-1 ])=a [ t-1]. Probblank +a [ t-1]. Problabel; a [ t-1]. Probblank represents the total probability of a blank when reaching the node of a [ t-1], a [ t-1]. Problabel represents the total probability of a phoneme when reaching the node of a [ t-1], ctc [ t ] [ blank_id ] represents the value at the position of blank_id at the time t taken from the first matrix, blank_id represents the position of blank, blank [ t-1]. Probblank represents the total probability of a blank when reaching the node of blank [ t-1], ctc [ t ] [ a_id ] represents the value at the position of a_id at the time t taken from the decoding matrix, and a_id represents the position of a;

For the non-initial phoneme nodes of the phoneme sequence corresponding to the preset command word, probblank and problabel are respectively calculated as follows: b [ t ]. Probblank =sum (b [ t-1 ]) ctc[t][blank_id],b[t].problabel = (sum(a[t - 1]) + sum(b[t - 1])) />Ctc [ t ] [ b_id ]; b represents a non-initial phoneme node of a phoneme sequence corresponding to a preset command word; t is greater than 0, b [ t ]. Probblank represents the total probability of blank when reaching the node b [ t ], b [ t ]. Problabel represents the total probability of phoneme when reaching the node b [ t ], [ b [ t-1 ]) =b [ t-1], [ probblank ] +b [ t-1] [ problabel; b [ t-1]. Probblank represents the total probability of blank when reaching the node b [ t-1], b [ t-1]. Problabel represents the total probability of phoneme when reaching the node b [ t-1], ctc [ t ] [ b_id ] represents the value at the position b_id of the moment t taken from the decoding matrix, and b_id represents the position where b is located;

7. A command word score calculating method according to claim 3, wherein when a phoneme sequence corresponding to the command word has consecutively repeated phonemes, only a diagonal downward path is provided between two adjacent repeated phonemes during recursion, and no translation path is provided.

8. A command word score calculating apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, characterized in that the processor, when executing the computer program, implements the steps of the command word score calculation method according to any one of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the steps of the command word score calculating method according to any of claims 1 to 7.