CN114399998B

CN114399998B - Voice processing method, device, equipment, storage medium and program product

Info

Publication number: CN114399998B
Application number: CN202111472014.XA
Authority: CN
Inventors: 王知践
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-03
Filing date: 2021-12-03
Publication date: 2022-09-27
Anticipated expiration: 2041-12-03
Also published as: CN114399998A

Abstract

The present disclosure provides a voice processing method, apparatus, device, storage medium, and program product. The present disclosure relates to the field of speech technologies, and in particular, to a speech processing method, apparatus, device, storage medium, and program product. In some embodiments of the present disclosure, a speech processing apparatus acquires a speech to be recognized, and inputs the speech to be recognized into a trained speech recognition graph neural network to obtain a speech recognition text; the voice recognition graph neural network comprises a plurality of word text nodes and directional edges connected between any two word text nodes, at least one syllable in the voice to be recognized and at least one word text node in the word text nodes have a mapping relation, the sequentially input syllables can determine the jump relation between the word text nodes through the directional edges of the word text nodes corresponding to the syllables, and then a path formed by the jump of the voice to be recognized between the word text nodes is formed, so that a final voice recognition result is obtained, and a new voice recognition mode is provided.

Description

Voice processing method, device, equipment, storage medium and program product

Technical Field

The present disclosure relates to the field of speech technologies, and in particular, to a speech processing method, apparatus, device, storage medium, and program product.

Background

In recent years, there has been an increasing interest in the expansion of deep learning methods on graphs. Under the promotion of success of multiple factors, researchers use the ideas of a convolutional network, a cyclic network and a deep automatic encoder for reference, and define and design a neural network structure for processing graph data, so that the speech recognition graph neural network is produced at the same time.

The existing speech recognition graph neural network is widely applied to the fields of social networks, recommendation systems, physical systems, chemical molecular prediction, knowledge graphs and the like. However, the speech recognition graph neural network has not been applied to speech recognition.

Disclosure of Invention

The present disclosure provides a voice processing method, apparatus, device, storage medium, and program product.

The embodiment of the disclosure provides a voice processing method, which includes:

acquiring a voice to be recognized, wherein the voice to be recognized comprises a plurality of syllables;

inputting the voice to be recognized into a voice recognition graph neural network to obtain a voice recognition text;

the word text nodes represent word texts, the directed edges between the word text nodes represent the jump relation of the syllables input to the speech recognition graph neural network on a time dimension, and at least one syllable in the speech to be recognized has a mapping relation with at least one word text node in the word text nodes.

An embodiment of the present disclosure further provides a speech processing apparatus, including:

the device comprises an acquisition module, a recognition module and a processing module, wherein the acquisition module is used for acquiring a voice to be recognized, and the voice to be recognized comprises a plurality of syllables;

the voice recognition module is used for inputting the voice to be recognized into a voice recognition graph neural network so as to obtain a voice recognition text; the word text nodes represent word texts, the directed edges between the word text nodes represent the jump relation of the syllables input to the speech recognition graph neural network on a time dimension, and at least one syllable in the speech to be recognized has a mapping relation with at least one word text node in the word text nodes.

An embodiment of the present disclosure further provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method described above.

The disclosed embodiments also provide a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the steps of the above method.

Embodiments of the present disclosure also provide a computer program product, comprising a computer program. Which when executed by a processor performs the steps of the method described above.

In some embodiments of the present disclosure, a speech processing apparatus acquires a speech to be recognized, and inputs the speech to be recognized into a trained speech recognition graph neural network to obtain a speech recognition text; the voice recognition graph neural network comprises a plurality of word text nodes and directional edges connected between any two word text nodes, at least one syllable in the voice to be recognized and at least one word text node in the word text nodes have a mapping relation, the sequentially input syllables can determine the jump relation between the word text nodes through the directional edges of the word text nodes corresponding to the syllables, and then a path formed by the jump of the voice to be recognized between the word text nodes is formed, so that a final voice recognition result is obtained, and a new voice recognition mode is provided.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart of a speech processing method according to a first embodiment of the disclosure;

fig. 2 is a schematic structural diagram of a first network relationship diagram according to an exemplary embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a second network relationship diagram provided in an exemplary embodiment of the present disclosure;

FIG. 4 is a process for constructing a first network subgraph corresponding to an input syllable provided by an exemplary embodiment of the present disclosure;

fig. 5 is a flowchart illustrating a speech processing method according to a second embodiment;

fig. 6 is a flowchart illustrating a speech processing method according to a third embodiment;

fig. 7a is a block diagram of a speech processing apparatus according to an exemplary embodiment of the present disclosure;

FIG. 7b is a block diagram of another speech processing apparatus according to an exemplary embodiment of the present disclosure;

fig. 8 is a schematic block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.

The existing speech recognition graph neural network is widely applied to the fields of social networks, recommendation systems, physical systems, chemical molecular prediction, knowledge graphs and the like. However, the graph neural network has not been applied to speech recognition.

Technical solutions provided by the embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart illustrating a speech processing method according to a first embodiment of the disclosure. As shown in fig. 1, the speech processing method includes:

s101: acquiring a voice to be recognized, wherein the voice to be recognized comprises a plurality of syllables;

s102: inputting the voice to be recognized into a voice recognition graph neural network to obtain a voice recognition text; the word text nodes represent word texts, the directed edges between the word text nodes represent jump relations of syllables input to the speech recognition graph neural network in a time dimension, and at least one syllable in the speech to be recognized and at least one word text node in the word text nodes have a mapping relation.

In this embodiment, the voice processing apparatus as the execution subject of the method may be a terminal device or a server. When the execution subject of the method is the terminal device, the implementation form of the terminal device is not limited in the embodiments of the present disclosure, for example, the terminal device may be a smart phone, a tablet computer, a personal computer, a smart audio, and a smart wearable device. When the execution subject of the method is a server, the implementation form of the server is not limited in this embodiment, and the server may be a conventional server, a cloud host, a virtual center, or other server devices. The server device mainly includes a processor, a hard disk, a memory, a system bus, and the like, and is similar to a general computer architecture.

In this embodiment, a syllable is used as a minimum voice unit constituting a voice, and a word text corresponding to the syllable is used as a word text node; and adding directed edges between word text nodes corresponding to the syllables to represent the jump relation of the syllables corresponding to the voice in the time dimension. And forming a voice recognition network relation graph, and training the constructed voice recognition network relation graph to obtain a trained voice recognition graph neural network. The voice processing device acquires the voice to be recognized, and inputs the voice to be recognized into the voice recognition graph neural network to obtain the voice recognition text.

It should be noted that the word text takes a word or a word as a unit, one syllable corresponds to at least one word text, one word text corresponds to one word text node in the speech recognition graph neural network, and at least one word text corresponds to at least one word text node in the speech recognition graph neural network, so that a mapping relationship exists between one syllable and at least one word text node in the speech recognition graph neural network. For example, the syllable "zhong" corresponds to the multiple words text "middle", "species", "end", "weight", "loyalty", "intermediate", "noon", "seed", "various", "end", "important", "weight", "loyalty", and "loyalty", etc. The syllable "zhong" has a mapping relationship with word text nodes in the speech recognition graph neural network, which correspond to word texts "middle", "species", "end", "weight", "loyalty", "intermediate", "noon", "seed", "various", "end", "importance", "weight", "loyalty", and "loyalty", one to one. For another example, the syllable "shang" corresponds to a plurality of word texts "up", "hurt", "business", "shang", "up and down", "sick", "wounded", "businessman", "business volume", "shang", and "shang. The syllable "shang" has a mapping relation with word text nodes in the speech recognition graph neural network, which are in one-to-one correspondence with word texts "up", "hurt", "quotient", "shang", "up and down", "sick", "wounded", "businessman", "shang", and "shang and".

It should be noted that, compared to a conventional neural network, a Graph neural network has Graph structure data (Graph) as an input, or is referred to as a Graph, which is composed of nodes and Edges and may be represented as G ═ V, E, where V represents nodes (Vertices) and E represents Edges (Edges), each node may be used to represent an object, and an edge represents an association relationship between nodes. In a practical application scenario, many data can be represented in the form of a graph, for example, in the chemical domain, each atom can be represented as a node, and chemical bonds between atoms can be represented as edges. For another example, in the social networking domain, each user may be represented as a node and the relationships between users may be represented as edges. For another example, in the map domain, each place may be represented as a node, and a relationship such as a path, a distance, or the like between the places may be represented as an edge. In this embodiment, the word text takes a word or a word as a unit, each word text is represented as a node, and the jump relation of the syllable corresponding to the sound in the time dimension can be represented as a directed edge, so that the graph neural network can be applied to a speech recognition scene.

In the embodiment, the speech to be recognized is input into the speech recognition graph neural network to obtain the speech recognition text. One way to implement this is that the speech recognition graph neural network performs a node relationship construction operation according to the input syllables that have been input into the speech recognition graph neural network to obtain a first network relationship graph; according to the remaining syllables of the subsequently input speech to be recognized, the node relation construction operation is continuously carried out on the first network relation graph until the speech to be recognized is input completely, so that a second network relation graph is obtained; and determining the voice recognition text according to the word text path contained in the second network relation graph. The word text path is formed by sequentially connecting word text nodes contained in the neural network of the voice recognition graph through corresponding directed edges. The method comprises the steps that a voice processing device receives input voice to be recognized, the voice to be recognized is input into a voice recognition graph neural network integrated in the voice processing device, and node relation construction operation is carried out among word text nodes corresponding to input syllables of the voice recognition graph neural network according to the input syllables of the voice recognition graph neural network inside the voice recognition graph neural network to obtain a first network relation graph; according to the remaining syllables of the subsequently input speech to be recognized, node relation construction operation is continuously carried out on the first network relation graph, and an iterative process of the node relation is carried out on each subsequent syllable until the speech to be recognized is completely input, so that a second network relation graph is obtained; and selecting a target word text path matched with the voice to be recognized from the word text paths included in the second network relation graph, and taking the voice recognition text corresponding to the word text node included in the target word text path as the final voice recognition text. The method and the device utilize the neural network of the voice recognition graph to directly calculate the voice on the graph to obtain the voice recognition text, and improve the accuracy of voice recognition.

For example, fig. 2 is a schematic structural diagram of a first network relationship diagram provided in an exemplary embodiment of the present disclosure. Fig. 3 is a schematic structural diagram of a second network relationship diagram according to an exemplary embodiment of the present disclosure. As shown in fig. 2, the speech recognition graph neural network performs node construction operation according to the speech corresponding to the text "i am" that has been input, so as to obtain a first network relationship graph; as shown in fig. 3, the speech recognition graph neural network continues to perform the node relationship construction operation on the first network relationship graph according to the subsequently input speech to be recognized and the remaining syllables corresponding to the "chinese" until the speech to be recognized is input, so as to obtain a second network relationship graph. As shown in fig. 3, the second network relationship diagram includes four word text paths, a target word text path with the highest path score is selected from the four word text paths, and the voice recognition text "i is a chinese person" corresponding to the word text node "me", the word text node "yes", the word text node "medium", the word text node "country", and the word text node "person" included in the target word text path is used as the final voice recognition text.

In the above embodiment, the speech processing apparatus performs the node relationship construction operation according to the input syllables of the neural network of the input speech recognition graph to obtain the first network relationship graph. One way to achieve this is to obtain a first network subgraph corresponding to the entered syllable; determining a second network subgraph constructed by word text nodes having a jumping relation with the first network subgraph; and carrying out node relation construction operation on the first network subgraph and the second network subgraph to obtain a first network relation graph. The voice processing device acquires an input first syllable and determines a word text node corresponding to the first syllable; taking a word text node corresponding to the first syllable as a first network subgraph; and respectively determining word text nodes with jump relations of the word text nodes corresponding to the first syllable, forming a second network subgraph by the word text nodes with jump relations of the word text nodes corresponding to the first syllable, and connecting the word text nodes corresponding to the first network subgraph and the second network subgraph to obtain a first network relation graph corresponding to the input first syllable. The method utilizes the input syllables to directly carry out node relation construction operation, determine a first network subgraph corresponding to the input syllables and determine a second network subgraph to which the input voice is about to jump.

In the above embodiment, after the speech processing device performs the node relationship construction operation on the first network subgraph and the second network subgraph to obtain the first network relationship graph, according to the subsequently input second syllable, the word text node in the first network relationship graph, which does not have a mapping relationship with the second syllable, is removed, so as to obtain the new first network subgraph corresponding to the currently input syllable. After obtaining the first network relation graph corresponding to the first syllable which is input, the voice processing device determines word text nodes corresponding to the second syllable according to the subsequently input second syllable, removes the word text nodes which do not have the mapping relation with the second syllable in the first network relation graph, namely activates the word text nodes which have the mapping relation with the second syllable, and forms the first network subgraph corresponding to the first two syllables which are input. And repeating the steps to obtain a new first network subgraph corresponding to the subsequent input syllable. After the target syllable arrives, the first network relation graph activates the node corresponding to the target syllable, namely, other nodes which do not correspond to the target syllable are removed, and the accuracy rate of voice recognition is improved.

For example, fig. 4 is a process for constructing a first network subgraph corresponding to an inputted syllable, in which a speech processing apparatus obtains a first syllable "wo" of the input, determines a word text node corresponding to the first syllable, and uses the word text node corresponding to the first syllable "wo" as the first network subgraph; and respectively determining word text nodes 'yes', 'called', 'having', the 'and the' of the word text nodes with jump relation corresponding to the first syllable. And the word text nodes with the jumping relation among the word text nodes corresponding to the first syllable form a second network subgraph, and the word text nodes corresponding to the first network subgraph and the second network subgraph are connected to obtain a first network relation graph corresponding to the input first syllable. The voice processing device receives a second syllable "shi" which is input subsequently, removes word text nodes "called", "having", and "these" which have no mapping relation with the second syllable, namely activates word text nodes "yes" which have mapping relation with the second syllable "shi", and forms a first network subgraph corresponding to the first two syllables which are input. And analogizing in sequence to obtain a new first network subgraph corresponding to the subsequent input syllable.

In an alternative embodiment, the speech processing apparatus determines a second network subgraph constructed from word-text nodes having a hopping relationship with the first network subgraph. One way to implement this is to determine candidate word text nodes that have a jump relationship with the first network subgraph; selecting target word text nodes with the skipping probability larger than or equal to a set probability threshold from the candidate word text nodes according to the skipping probability of directed edges between the first network subgraph and the candidate word text nodes; and determining a subgraph having a jumping relation with the target word text node to obtain a second network subgraph. It should be noted that, the set probability threshold is not limited in the embodiments of the present disclosure, and the set probability threshold may be adjusted according to actual situations. And screening candidate word text nodes with the jump relation in the first network subgraph to obtain a smaller second network subgraph, so that the calculation amount of voice recognition is reduced, and the voice recognition efficiency is improved.

For example, as shown in fig. 4, the speech processing apparatus obtains the word text node corresponding to the first syllable "wo" as the first network subgraph, and determines the candidate word text nodes "yes", "called", "have", and "today" that have a jump relationship with the first network subgraph; the probability of the jump of the directed edge between the first network subgraph and the candidate word text node is 0.98, the probability of the jump of the directed edge between the first network subgraph and the candidate word text node is 0.8, the probability of the jump of the directed edge between the first network subgraph and the candidate word text node is 0.85, the probability of the jump of the directed edge between the first network subgraph and the candidate word text node is 0.9, and the probability of the jump of the directed edge between the first network subgraph and the candidate word text node is 0.5. Selecting target word text nodes with the skipping probability being more than or equal to a set probability threshold value of 0.7 as 'yes', 'called', 'present' and's' from the candidate word text nodes; and constructing a network graph according to the target word text nodes of 'yes', 'called', 'having' and 'people' to obtain a second network subgraph.

In an optional embodiment, the speech processing apparatus performs a node relationship construction operation on the first network subgraph and the second network subgraph to obtain a first network relationship graph. One way to implement this is to connect the first network subgraph and the second network subgraph to obtain a connected subgraph; and carrying out optimization operation on the connected subgraphs to obtain a first network relation graph. And optimizing the connected subgraphs, reducing the calculation amount of voice recognition and improving the voice recognition efficiency. The voice processing device performs optimization operation on the communication subgraph, including but not limited to at least one of the following optimization operations:

optimizing operation one: screening out word text paths which correspond to the same voice recognition texts from the word text paths contained in the unicom subgraphs; and carrying out duplication removal operation on the word text paths which are screened out and correspond to the same voice recognition texts.

And (5) optimizing operation II: if two same syllables are input continuously, judging whether the two same syllables are continuous or not; and if the two same syllables are continuous, removing one word text node from the two word text nodes corresponding to the two same syllables.

And (3) optimizing operation III: calculating the path score of each word text path contained in the link subgraph; word text paths with path scores less than a path score threshold are removed. It should be noted that, in the embodiment of the present application, the path score threshold is not limited, and the path score threshold may be adjusted according to actual situations.

In the first optimization operation: the unicom sub-graph may contain the same word text path. For example, the word text node "i", the word text node "yes", the word text node "medium", the word text node "country", and the word text node "person"; the word text node "i", the word text node "yes" and the word text node "chinese" all correspond to the same speech recognition text "i is a chinese", the same word text path of the speech recognition text is removed, and only one word text path is reserved. The rule for removing the word text path identical to the speech recognition text may be randomly removed, or may be removed according to the path score of the word text path, which is not limited in the embodiment of the present application.

In the second optimization operation, for example, for two identical syllables "zhong" that are continuously input, the voice segments corresponding to the two identical syllables are subjected to voice continuity detection, if the voice segments corresponding to the two identical syllables are voice continuous segments, the two identical syllables are continuous, and one word text node is removed from the two word text nodes corresponding to the two identical syllables; if the voice segments corresponding to the two same syllables are discontinuous sound segments, the two same syllables are discontinuous, and two word text nodes corresponding to the two same syllables are reserved.

In the third optimization operation, the transition probability of each directed edge is calculated according to the transition score and the weight of each directed edge contained in each word text path contained in the unicom subgraph; and taking the sum of the transition probabilities of the directed edges among all word text nodes contained in the word text path as the path score of the word text path, and removing the word text path with the path score smaller than the path score threshold value.

In the above embodiment, the speech processing apparatus determines the speech recognition text according to the word text path included in the second network relationship diagram. One way to achieve this is to calculate the path score of each word text path contained in the second network relationship graph; selecting a target word text path with a path score meeting a path score condition from all word text paths contained in the second network relation graph; and taking a text formed by all word text nodes contained in the target word text path according to the directional edge direction as a voice recognition text. Aiming at each word text path contained in the second network relational graph, calculating the transition probability of the word text path according to the transition scores of the directed edges and the weights of the directed edges among all word text nodes contained in the word text path, and taking the transition probability as the path score of the word text path; and selecting a word text path with the highest path score from all word text paths contained in the second network relation graph, wherein a text formed by all word text nodes contained in the target word text path according to the direction of the directed edge is used as a voice recognition text. And determining a word text path of the second network relation graph, determining a target word text path meeting a path score condition, further determining a voice recognition text, and improving the accuracy of voice recognition.

For example, as shown in FIG. 3, four word text paths are included in the second network relationship graph. First word text path: a word text path formed by a word text node "I", a word text node "Y", a word text node "middle", a word text node "nation" and a word text node "person"; second-term text path: a word text path formed by a word text node "I", a word text node "Y", a word text node "middle", a word text node "Hua" and a word text node "person"; the third word text path: a word text path formed by a word text node 'I', a word text node 'Y', a word text node 'seed', a word text node 'plant' and a word text node 'person'; the word text node "I", the word text node "Y", the word text node "seed" and the word text node "child" form a word text path. Calculating the path score of each word text path according to the transfer score of the directed edge and the weight of the directed edge between all word text nodes contained in each word text path; calculating the transition probability of each directed edge according to the transition score of each directed edge and the weight of each directed edge; and taking the sum of the transition probabilities of the directed edges among all word text nodes contained in the word text path as the path score of the word text path. The path scores of the four word text paths are respectively as follows: the path score for the first word text path is 80, the path score for the second word text path is 75, the path score for the third word text path is 50, and the path score for the fourth word text path is 5.

With reference to the above description of the embodiments, fig. 5 is a flowchart illustrating a speech processing method according to a second embodiment. As shown in fig. 5, the method comprises:

s501: acquiring a voice to be recognized; wherein the speech to be recognized comprises a plurality of syllables;

s502: inputting the voice to be recognized into a voice recognition graph neural network;

s503: utilizing the voice recognition graph neural network to carry out node relation construction operation according to the input syllables of the input voice recognition graph neural network so as to obtain a first network relation graph;

s504: according to the remaining syllables of the subsequently input speech to be recognized, the node relation construction operation is continuously carried out on the first network relation graph until the speech to be recognized is input completely, so that a second network relation graph is obtained;

s505: and determining the voice recognition text according to the word text path contained in the second network relation graph.

In this embodiment, the implementation manner of each step in this embodiment may refer to the description of the content of the corresponding part in each embodiment, and this embodiment may also achieve the beneficial effects of the corresponding embodiment part, which is not described herein again.

When the speech recognition neural network in the above embodiment is integrated in an intelligent device, with reference to the description of the above embodiments, fig. 6 is a schematic flowchart disclosing a speech processing method provided by the third embodiment. As shown in fig. 6, the method includes:

s601: responding to voice input operation by the intelligent equipment to obtain voice to be recognized;

s602: responding to the voice recognition operation by the intelligent equipment, and inputting the voice to be recognized into a voice recognition graph neural network for voice recognition to obtain a voice recognition text;

s603: and displaying the voice recognition text.

For example, the smart device presents an interface on the display screen, where the interface includes a voice entry control and a voice recognition control; responding to the triggering operation of the user on the voice input control by the intelligent equipment, and acquiring the voice to be recognized input by the user by using a voice acquisition device on the intelligent equipment; responding to the triggering operation of a user on the voice recognition control by the intelligent equipment, and performing voice recognition by using a voice recognition graph neural network integrated in the intelligent equipment to obtain a voice recognition text; the intelligent device displays the speech recognition text on the display screen.

In the embodiment of the method, the voice processing device acquires the voice to be recognized, and inputs the voice to be recognized into the trained voice recognition graph neural network to obtain a voice recognition text; the voice recognition graph neural network comprises a plurality of word text nodes and directional edges connected between any two word text nodes, at least one syllable in the voice to be recognized and at least one word text node in the word text nodes have a mapping relation, the sequentially input syllables can determine the jump relation between the word text nodes through the directional edges of the word text nodes corresponding to the syllables, and then a path formed by the jump of the voice to be recognized between the word text nodes is formed, so that a final voice recognition result is obtained, and a new voice recognition mode is provided.

Fig. 7a is a block diagram of a speech processing apparatus 70 according to an exemplary embodiment of the present disclosure. As shown in fig. 7a, the speech processing apparatus 70 includes: an acquisition module 71 and a speech recognition module 72.

The obtaining module 71 is configured to obtain a speech to be recognized, where the speech to be recognized includes a plurality of syllables;

the speech recognition module 72 is used for inputting the speech to be recognized into a speech recognition graph neural network to obtain a speech recognition text; the word text nodes represent word texts, the directed edges between the word text nodes represent jump relations of syllables input to the speech recognition graph neural network in a time dimension, and at least one syllable in the speech to be recognized and at least one word text node in the word text nodes have a mapping relation.

Alternatively, fig. 7b is a block diagram of another speech processing apparatus 70 according to an exemplary embodiment of the disclosure. As shown in fig. 7b, the speech recognition module 72 includes a first network construction sub-module 721, a second network construction sub-module 722 and a determination sub-module 723;

a first network construction sub-module 721 configured to perform a node relationship construction operation by using the speech recognition graph neural network according to the input syllable of the speech recognition graph neural network, so as to obtain a first network relationship graph;

the second network construction submodule 722 is configured to continue the node relationship construction operation on the first network relationship diagram according to the remaining syllables of the subsequently input speech to be recognized until the speech to be recognized is input, so as to obtain a second network relationship diagram;

the determining submodule 723 is configured to determine the speech recognition text according to a word text path included in the second network relationship graph, where the word text path is a path formed by sequentially connecting word text nodes included in the neural network of the speech recognition graph through corresponding directed edges.

Optionally, the first network construction sub-module 721 is configured to obtain a first network sub-graph corresponding to the input syllable when performing a node relation construction operation according to the input syllable of the input speech recognition graph neural network to obtain a first network relation graph; determining a second network subgraph constructed by word text nodes having a jumping relation with the first network subgraph; and carrying out node relation construction operation on the first network subgraph and the second network subgraph to obtain a first network relation graph.

Optionally, the first network construction sub-module 721 is configured to, when determining a second network subgraph constructed by word text nodes having a jump relationship with the first network subgraph, determine candidate word text nodes having a jump relationship with the first network subgraph; selecting target word text nodes with the jump probability larger than or equal to a set probability threshold from the candidate word text nodes according to the jump probability of the directed edge between the first network subgraph and the candidate word text nodes; and determining a subgraph having a jumping relation with the target word text node to obtain a second network subgraph.

Optionally, the first network constructing sub-module 721 is configured to connect the first network subgraph and the second network subgraph to obtain a connected subgraph when performing a node relationship constructing operation on the first network subgraph and the second network subgraph to obtain a first network relationship graph; and carrying out optimization operation on the connected subgraphs to obtain a first network relation graph.

Optionally, the first network constructing sub-module 721 performs an optimization operation on the connection graph, which includes at least one of the following operations:

screening out word text paths which correspond to the same voice recognition texts from the word text paths contained in the unicom subgraphs; carrying out duplicate removal operation on the word text paths which are screened out and correspond to the same voice recognition texts;

if two identical syllables are input continuously, judging whether the two identical syllables are continuous; if the two same syllables are continuous, removing a word text node from the two word text nodes corresponding to the two same syllables;

calculating path scores of text paths of all words contained in the connected subgraphs; word text paths with path scores less than a path score threshold are removed.

Optionally, the determining sub-module 723 is configured to calculate a path score of each word text path included in the second network relationship diagram when determining the speech recognition text according to the word text path included in the second network relationship diagram; selecting a target word text path with a path score meeting a path score condition from all word text paths contained in the second network relation graph; and taking a text formed by all word text nodes contained in the target word text path according to the directional edge direction as a voice recognition text.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 shows a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the electronic apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 806 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 806, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The calculation unit 801 executes the respective methods and processes described above, such as a text processing method. For example, in some embodiments, the text processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 806. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by the computing unit 801, a computer program may perform one or more steps of the text processing method described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the text processing method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of large management difficulty and weak service extensibility in a traditional physical host and VPS service ("virtual private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

Accordingly, the embodiment of the present disclosure also provides a computer readable storage medium storing a computer program. The computer-readable storage medium stores a computer program, and the computer program, when executed by one or more processors, causes the one or more processors to perform the steps in the method embodiment of fig. 1.

Accordingly, embodiments of the present disclosure also provide a computer program product comprising a computer program/instructions for executing the steps in the method embodiment of fig. 1 by a processor.

In the above-described device, apparatus, storage medium, and program product embodiments of the present disclosure, the speech processing device obtains a speech to be recognized, inputs the speech to be recognized into a trained speech recognition graph neural network, and obtains a speech recognition text; the voice recognition graph neural network comprises a plurality of word text nodes and directional edges connected between any two word text nodes, at least one syllable in the voice to be recognized and at least one word text node in the word text nodes have a mapping relation, the sequentially input syllables can determine the jump relation between the word text nodes through the directional edges of the word text nodes corresponding to the syllables, and then a path formed by the jump of the voice to be recognized between the word text nodes is formed, so that a final voice recognition result is obtained, and a new voice recognition mode is provided.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of speech processing comprising:

2. The method of claim 1, wherein the inputting the speech to be recognized into a speech recognition graph neural network to obtain speech recognition text comprises:

utilizing a speech recognition graph neural network to carry out node relation construction operation according to the input syllables input into the speech recognition graph neural network so as to obtain a first network relation graph;

according to the subsequently input residual syllables of the voice to be recognized, node relation construction operation is continuously carried out on the first network relation graph until the voice to be recognized is input completely, so that a second network relation graph is obtained;

and determining the voice recognition text according to a word text path contained in the second network relation graph, wherein the word text path is formed by sequentially connecting word text nodes contained in the neural network of the voice recognition graph through the corresponding directed edges.

3. The method of claim 2, wherein said performing a node relation construction operation based on the inputted syllables that have been inputted into the neural network of the speech recognition graph to obtain a first network relation graph comprises:

acquiring a first network subgraph corresponding to the input syllable;

determining a second network subgraph constructed by word text nodes having the jump relation with the first network subgraph;

and carrying out node relation construction operation on the first network subgraph and the second network subgraph to obtain the first network relation graph.

4. The method of claim 3, wherein said determining a second network subgraph constructed from word-text nodes having said hopping relationship with said first network subgraph comprises:

determining candidate word text nodes having the jump relation with the first network subgraph;

selecting target word text nodes with the skipping probability larger than or equal to a set probability threshold from the candidate word text nodes according to the skipping probability of the directed edges between the first network subgraph and the candidate word text nodes;

and determining the subgraph with the jump relation with the target word text node to obtain the second network subgraph.

5. The method of claim 3, wherein said performing a node relationship construction operation on said first network subgraph and said second network subgraph to obtain said first network relationship graph comprises:

connecting the first network subgraph and the second network subgraph to obtain a communicated subgraph;

and optimizing the communication subgraph to obtain the first network relation graph.

6. The method of claim 5, wherein the optimizing the connected subgraph comprises at least one of:

screening out corresponding word text paths with the same voice recognition text from the word text paths contained in the unicom subgraphs; carrying out duplication elimination operation on the screened word text paths with the same corresponding voice recognition texts;

calculating path scores of text paths of all words contained in the connected subgraphs; removing word text paths for which the path score is less than a path score threshold.

7. The method of claim 3, wherein after performing a node relationship construction operation on the first network subgraph and the second network subgraph to obtain the first network relationship graph, the method further comprises:

and according to the subsequently input target syllables, removing word text nodes which do not have the mapping relation with the target syllables in the first network relation graph to obtain a third network subgraph corresponding to the currently input syllables.

8. The method of claim 2, wherein said determining the speech recognition text from a word text path contained in the second network relationship graph comprises:

calculating a path score of each word text path contained in the second network relation graph;

selecting a target word text path with a path score meeting a path score condition from all word text paths contained in the second network relation graph;

and taking a text formed by all word text nodes contained in the target word text path according to the direction of the directed edge as the voice recognition text.

9. A speech processing apparatus comprising:

10. The speech processing apparatus of claim 9, wherein the speech recognition module comprises: a first network construction submodule, a second network construction submodule and a determination submodule;

the first network construction submodule is used for utilizing the voice recognition graph neural network to carry out node relation construction operation according to the input syllables input into the voice recognition graph neural network so as to obtain a first network relation graph;

the second network construction submodule is used for continuing the node relation construction operation on the first network relation graph according to the subsequently input residual syllables of the voice to be recognized until the voice to be recognized is input, so as to obtain a second network relation graph;

the determining submodule is configured to determine the speech recognition text according to a word text path included in the second network relationship graph, where the word text path is a path formed by sequentially connecting word text nodes included in the neural network of the speech recognition graph through the corresponding directed edges.

11. The speech processing apparatus according to claim 10, wherein the first network construction sub-module, when performing a node relation construction operation based on an input syllable that has been input to the speech recognition graph neural network to obtain a first network relation graph, is configured to obtain a first network subgraph corresponding to the input syllable; determining a second network subgraph constructed by word text nodes having the jump relation with the first network subgraph; and carrying out node relation construction operation on the first network subgraph and the second network subgraph to obtain the first network relation graph.

12. The speech processing apparatus according to claim 11, wherein the first network construction sub-module, when determining a second network subgraph constructed from word text nodes having the jump relationship with the first network subgraph, is configured to determine candidate word text nodes having the jump relationship with the first network subgraph; selecting target word text nodes with the skipping probability larger than or equal to a set probability threshold from the candidate word text nodes according to the skipping probability of the directed edges between the first network subgraph and the candidate word text nodes; and determining the subgraph with the jump relation with the target word text node to obtain the second network subgraph.

13. The speech processing apparatus according to claim 11, wherein the first network construction sub-module, when performing a node relationship construction operation on the first network subgraph and the second network subgraph to obtain the first network relationship graph, is configured to connect the first network subgraph and the second network subgraph to obtain a connected subgraph; and optimizing the communication subgraph to obtain the first network relation graph.

14. The speech processing apparatus according to claim 13, wherein the first network construction sub-module performs an optimization operation on the connected subgraph, including at least one of:

screening out corresponding word text paths with the same voice recognition text from the word text paths contained in the unicom subgraphs; carrying out duplicate removal operation on the screened word text paths with the same corresponding voice recognition texts;

if two same syllables are input continuously, judging whether the two same syllables are continuous or not; if the two same syllables are continuous, removing a word text node from two word text nodes corresponding to the two same syllables;

calculating path scores of text paths of all words contained in the unicom subgraph; removing word text paths for which the path score is less than a path score threshold.

15. The speech processing apparatus according to claim 11, wherein the first network construction sub-module, after performing a node relationship construction operation on the first network subgraph and the second network subgraph to obtain the first network relationship graph, is further configured to remove word text nodes in the first network relationship graph, which do not have the mapping relationship with the target syllable, according to a subsequently input target syllable to obtain a third network subgraph corresponding to the currently input syllable.

16. The speech processing apparatus according to claim 10, wherein the determining sub-module, when determining the speech recognition text based on the word text path included in the second network relationship diagram, is configured to calculate a path score of each word text path included in the second network relationship diagram; selecting a target word text path with a path score meeting a path score condition from all word text paths contained in the second network relation graph; and taking a text formed by all word text nodes contained in the target word text path according to the direction of the directed edge as the voice recognition text.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the steps of the method according to any one of claims 1-8.