CN114360528B

CN114360528B - Speech recognition method, device, computer equipment and storage medium

Info

Publication number: CN114360528B
Application number: CN202210006614.5A
Authority: CN
Inventors: 田晋川; 余剑威
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2024-02-06
Anticipated expiration: 2042-01-05
Also published as: CN114360528A

Abstract

The embodiment of the application discloses a voice recognition method, a voice recognition device, computer equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: identifying the voice data to obtain target text data, and determining each character in the target text data and each word obtained by dividing the target text data into words as a first element of the target text data; acquiring the weight of each first element; and determining the confidence of the target text data based on a plurality of first elements in the target text data and the weight of each first element, and determining the confidence of the target text data and the confidence of the target text data as the recognition result of the voice data. When determining whether the text data can be used as the text data matched with the voice data, the method considers whether the first element accords with language logic or not and considers more information, thereby improving the accuracy of voice recognition.

Description

Speech recognition method, device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a voice recognition method, a voice recognition device, computer equipment and a storage medium.

Background

The speech recognition technology is a technology for recognizing speech data to obtain text data matched with the speech data, and in recent years, the speech recognition technology has been widely used. In the related art, when recognizing voice data, the voice data itself is generally recognized directly, and the recognition result is used as text data corresponding to the voice data, so that the recognition accuracy is low because less information is based in the recognition process.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, a voice recognition device, computer equipment and a storage medium, which improve the recognition accuracy of voice signals. The technical scheme is as follows:

in one aspect, a method for speech recognition is provided, the method comprising:

identifying voice data to obtain target text data, determining each character in the target text data and each word obtained by dividing words of the target text data as a first element of the target text data, wherein the words comprise at least two characters;

Acquiring a weight of each first element, wherein the weight represents the correlation degree between the first element and a preamble element of the first element, and the preamble element of the first element comprises at least one element which is positioned in front of the first element and is adjacent to the first element in the target text data;

determining the confidence of the target text data based on a plurality of first elements in the target text data and the weight of each first element, wherein the confidence represents the credibility of the target text data matched with the voice data;

and determining the confidence level of the target text data and the target text data as a recognition result of the voice data.

Optionally, the acquiring the second architecture diagram includes:

dividing the target text data according to different character numbers to obtain a plurality of element sets, wherein second elements belonging to the same element set form the target text data, the character data contained in the second elements belonging to the same element set are the same, and the character numbers contained in the second elements belonging to different element sets are different;

and acquiring a second structure diagram of the target text data based on a plurality of element sets of the target text data, wherein a second element corresponding to each path in the second structure diagram forms the target text data, the weight of each second element represents the correlation degree between the second element and a preamble element of the second element, and the preamble element of the second element comprises at least one second element which belongs to the same element set as the second element, is positioned in front of the second element and is adjacent to the second element.

Optionally, the acquiring the second architecture diagram of the target text data based on the multiple element sets of the target text data includes:

creating a first node and M ₁ Second node M ₁ Is multiple in number ofA number of first target elements in a second element, the first target elements comprising a first character in the target text data; based on M ₁ Each of the first target elements creates a target element pointing to M from the first node ₁ Connecting lines M of the second nodes ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and the first target elements corresponding to different connecting lines are different;

for each of the second nodes, create M ₂ Third node M ₂ The number of second target elements in the plurality of second elements is the number of the second target elements, wherein the second target elements comprise first characters after the first target elements corresponding to the second nodes; based on M ₂ Each of the second target elements creates a point M from the second node ₂ Connecting lines of the third nodes, M ₂ And the connecting lines respectively correspond to a second target element and the weight of the second target element, the second target elements corresponding to different connecting lines are different until the second elements corresponding to each path from the first node form the target text data, and a second structure diagram of the target text data is obtained.

Optionally, the acquiring the second architecture diagram of the target text data based on the multiple element sets of the target text data further includes:

creating a first empty node, and creating a connecting line pointing from the first node to the first empty node;

creation of M ₁ Fourth node based on M ₁ Each of the first target elements creates a point M from the first empty node ₁ Connecting lines of the fourth node, M ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and the first target elements corresponding to different connecting lines are different.

Optionally, the obtaining the target structure diagram of the target text data based on the plurality of first elements of the target text data and the weight of each first element includes:

acquiring a first architecture diagram of the target text data based on a plurality of first elements of the target text data, wherein the first architecture diagram comprises a plurality of nodes and a plurality of connecting lines, each first element corresponds to one connecting line, each connecting line points to one termination node from one starting node, the connecting line corresponding to the word points to the termination node of the connecting line corresponding to the termination character of the word from the starting node of the connecting line corresponding to the starting character of the word, and the first element corresponding to the connecting line pointing to any node is adjacent to and positioned before the first element corresponding to the connecting line taking the node as the starting node;

And adding the weight of each first element in the first architecture diagram to acquire the target architecture diagram.

Optionally, the acquiring the weight of each first element includes:

and determining the weight of a second element which is the same as the first element corresponding to each connecting line in the first architecture diagram as the weight of the first element, wherein the weight of each second element represents the correlation degree between the second element and the preamble element of the second element, and the preamble element of the second element comprises at least one second element which belongs to the same element set as the second element, is positioned before the second element and is adjacent to the second element.

Optionally, before determining the weight of the second element, which is the same as the first element corresponding to each connection line in the first architecture diagram, as the weight of the first element, the method further includes:

For each second element in each set of elements:

acquiring a third number of the second elements in the corpus data;

acquiring a fourth number of second target segments in the corpus data, wherein the second target segments comprise the second elements and the preamble elements of the second elements;

a weight of the second element is determined based on a ratio between the third number and the third number.

Optionally, the determining the confidence of the first text data based on the plurality of first elements in the first text data and the weight of each first element includes:

acquiring a target architecture diagram of the first text data based on a plurality of first elements of the first text data and the weight of each first element;

and determining the confidence of the first text data based on the target architecture diagram of the first text data.

In another aspect, there is provided a speech recognition apparatus, the apparatus comprising:

the voice recognition module is used for recognizing the voice data to obtain target text data;

the element acquisition module is used for determining each character in the target text data and each word obtained by dividing the target text data into words as a first element of the target text data, wherein the words comprise at least two characters;

A weight acquisition module, configured to acquire a weight of each first element, where the weight represents a degree of correlation between the first element and a preamble element of the first element, and the preamble element of the first element includes at least one element located before and adjacent to the first element in the target text data;

the confidence determining module is used for determining the confidence of the target text data based on a plurality of first elements in the target text data and the weight of each first element, wherein the confidence represents the confidence degree of the target text data matched with the voice data;

and the recognition result determining module is used for determining the confidence level of the target text data and the target text data as the recognition result of the voice data.

Optionally, the confidence determining module includes:

an architecture diagram obtaining unit, configured to obtain a target architecture diagram of the target text data based on a plurality of first elements in the target text data and weights of each of the first elements, where the target architecture diagram includes a plurality of nodes and a plurality of connection lines;

And the confidence determining unit is used for determining the confidence of the target text data based on the target architecture diagram.

Each connecting line corresponds to a first element and the weight of the first element, each connecting line points to a termination node from a start node, the connecting line corresponding to the word points to the termination node of the connecting line corresponding to the termination character of the word from the start node of the connecting line corresponding to the start character of the word, and the first element corresponding to the connecting line pointing to any node is adjacent to and located before the first element corresponding to the connecting line taking the node as the start node.

Optionally, the architecture diagram obtaining unit is configured to:

Acquiring a second architecture diagram, wherein the second architecture diagram comprises a plurality of nodes and a plurality of connecting lines, each connecting line corresponds to a second element and a weight corresponding to the second element, and each connecting line points to a termination node from a starting node;

selecting a connecting wire corresponding to a second element identical to each first element in the first architecture diagram in the second architecture diagram according to the first architecture diagram, and a node connected with the connecting wire;

and removing nodes or connecting lines which do not belong to any path corresponding to the target text data from the selected nodes and connecting lines to obtain the target structure diagram.

Optionally, the architecture diagram obtaining unit is configured to:

creating U+1 nodes, wherein U is the number of characters in the target text data, the xth node corresponds to the xth character in the target text data, U is a positive integer greater than 1, and x is any positive integer not greater than U;

and creating a connecting line from the node corresponding to the initial character in each first element to the next node of the node corresponding to the termination character in each first element, and obtaining the first architecture diagram.

Optionally, the architecture diagram obtaining unit is configured to:

creating a first node and M ₁ Second node M ₁ A number of first target elements in a plurality of second elements, the first target elements comprising a first character in the target text data; based on M ₁ Each of the first target elements creates a target element pointing to M from the first node ₁ Connecting lines M of the second nodes ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and the first target elements corresponding to different connecting lines are different;

Optionally, the architecture diagram obtaining unit is configured to:

Creation of M ₁ Fourth node based on M ₁ Each of the first target elements creates a point M from the first empty node ₁ Connecting lines of the fourth node, M ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and different connecting lines correspond toIs different from the first target element of (c).

Optionally, the architecture diagram obtaining unit is configured to:

Optionally, the confidence determining module includes:

a node characteristic obtaining unit, configured to determine a node characteristic of a target node based on a weight corresponding to a connection line pointing to the target node in the target architecture diagram and a node characteristic of another node connected to the connection line, where the target node is any node in the target architecture diagram except a first node, and the node characteristic includes a text segment formed by a first element located on a connection line before the target node and a text confidence corresponding to the text segment;

a text confidence acquiring unit, configured to determine, when the target node is the last node in the target architecture diagram, a text confidence included in a node feature of the target node as a text confidence of the target text data, where the text confidence represents a probability that the target text data conforms to language logic;

the confidence coefficient obtaining unit is used for adjusting the text confidence coefficient to obtain the confidence coefficient of the target text data.

Optionally, the node feature acquiring unit is configured to:

under the condition that n connecting lines point to the target node in the target structure diagram, for each connecting line in the n connecting lines, obtaining the node characteristic corresponding to the connecting line based on the weight corresponding to the connecting line and the node characteristic of another node connected by the connecting line;

And determining the node characteristics of the target node based on the node characteristics corresponding to the n connecting lines.

Optionally, the node feature acquiring unit is configured to:

determining the node characteristic corresponding to the maximum text confidence in the node characteristics corresponding to the n connecting lines as the node characteristic of the target node; or,

and determining the average value of the node characteristics corresponding to the n connecting lines as the node characteristic of the target node.

Optionally, the voice recognition module is configured to:

the voice data is identified, so that the target text data and the identification confidence coefficient of the target text data are obtained, and the identification confidence coefficient represents the matching degree of the target text data and the voice data;

the confidence coefficient obtaining unit is used for:

and weighting the recognition confidence coefficient and the text confidence coefficient to obtain the confidence coefficient of the target text data.

Optionally, the weight obtaining module is configured to:

for each first element:

acquiring a first number of the first element in corpus data;

acquiring a second number of first target segments in the corpus data, wherein the first target segments comprise the first elements and the preamble elements of the first elements;

A weight of the first element is determined based on a ratio between the second number and the first number.

Optionally, the weight obtaining module is configured to:

Optionally, the weight acquisition module is further configured to:

for each second element in each set of elements:

acquiring a third number of the second elements in the corpus data;

Optionally, the voice recognition module is configured to recognize the voice data to obtain a first character and a second character, then combine the first character and the second character to obtain first text data, obtain confidence coefficient of the first text data until the voice data is recognized to obtain a last character, and combine the last character with a character obtained by previous recognition to obtain the target text data.

Optionally, the element obtaining module is further configured to determine each character in the first text data and each word obtained by word division of the first text data as a first element of the first text data, where the word includes at least two characters;

the weight acquisition module is further configured to acquire a weight of each first element in the first text data, where the weight represents a degree of correlation between the first element and a preamble element of the first element, and the preamble element of the first element includes at least one element located before and adjacent to the first element in the first text data;

The confidence determining module is further configured to determine a confidence of the first text data based on a plurality of first elements in the first text data and a weight of each first element.

Optionally, the confidence determining module is further configured to:

In another aspect, a computer device is provided, the computer device comprising a processor and a memory, the memory storing at least one computer program, the at least one computer program loaded and executed by the processor to implement the operations performed by the speech recognition method as described in the above aspects.

In another aspect, there is provided a computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement the operations performed by the speech recognition method as described in the above aspects.

In another aspect, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the operations performed by the speech recognition method of the above aspect.

According to the technical scheme provided by the embodiment of the invention, after the text data is obtained by identifying the voice data, whether the text data is the text data matched with the voice data is not directly determined, but the weight of each first element in the text data is obtained, and because the weight of each first element can represent the association degree between the first element and the preamble element of the first element, namely whether the first element and the preamble element of the first element in the text data accord with the language logic or not, when determining whether the text data can be used as the text data matched with the voice data or not, whether the first element and the preamble element of the first element accord with the language logic or not is considered, more information is considered, and the accuracy of voice identification is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by embodiments of the present application;

FIG. 2 is a flowchart of a method for speech recognition according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for speech recognition according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a architectural diagram provided in an embodiment of the present application;

FIG. 5 is a flow chart for creating a first architecture diagram provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a first architecture diagram provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a first architecture diagram provided in an embodiment of the present application;

FIG. 8 is a flow chart for creating a second architecture diagram provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a target architecture diagram provided in an embodiment of the present application;

FIG. 10 is a flowchart of a method for speech recognition according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It will be understood that the terms "first," "second," and the like, as used herein, may be used to describe various concepts, but are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, the first permutation may be referred to as a second permutation and the second permutation may be referred to as a first permutation without departing from the scope of the present application.

The terms "at least one," "a plurality," "each," "any" and the like as used herein, wherein at least one includes one, two or more, and a plurality includes two or more, each referring to each of a corresponding plurality, and any one referring to any one of the plurality. For example, the plurality of elements includes 3 elements, and each element refers to each of the 3 elements, and any one refers to any one of the 3 elements, which may be the first, the second, or the third.

In order to facilitate understanding of the embodiments of the present application, the keywords related to the embodiments of the present application are explained first:

FST (Finite-State Transducer): an abstract model with a strict algebraic basis. FST includes FSA (Finite State receiver), FST, WFSA (Weighted Finite State receiver), WFST (Weighted Finite State Transducer). Wherein either type of FST includes nodes and connecting lines, the connecting lines in the FSA correspond to input tags only, the connecting lines in the WFSA correspond to input tags and weights, the connecting lines in the WFST correspond to input tags, output tags and weights, and the connecting lines in the FST correspond to input tags and output tags.

Language model: a mathematical model based on probability statistics is used for measuring whether text data accords with language logic. The language model is mainly divided into an N-gram language model and a neural network language model. Wherein the N-gram language model can be represented as a WFSA and the neural network language model can be represented as a neural network. The advantages of the N-gram language model over the neural network language model are mainly customizable. Since the N-gram language model is often represented as a WFSA, it is allowed to be modified and customized at a small cost. For some languages represented by Chinese, the use of word-level N-gram language models has presented practical difficulties in speech recognition systems. Unlike languages (e.g., english) that have explicit word boundaries (e.g., spaces), languages such as chinese, japanese, korean, etc. do not have explicit word boundaries. If a word-level language model is required, the text needs to be first divided into words and then used (i.e., word segmentation process). However, when such a text is segmented into words, the words have ambiguity and ambiguity, and different segmentation modes can also lead to different results, but the method provided by the embodiment of the application adopts another WFSA to jointly process with the N-gram language model, so that the problem that the segmented words have ambiguity and ambiguity is avoided.

A speech recognition system: a system for converting speech data into corresponding text data. The method is mainly divided into Hybrid systems and end-to-end systems. The Hybrid system is cooperatively operated by the FST and the neural network; the end-to-end system performs all the work by one neural network.

In the embodiment of the application, the end-to-end system is assisted to carry out voice recognition based on the N-gram language model and the FST.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autopilot, unmanned, robotic, smart medical, smart customer service, car networking, autopilot, smart transportation, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and will be of increasing importance.

The scheme provided by the embodiment of the application relates to the technology of natural language processing, machine learning and the like of artificial intelligence, and voice data are identified through a voice identification system and a language model so as to obtain text data matched with the voice data.

The voice recognition method provided by the embodiment of the application is executed by computer equipment, wherein the computer equipment is a terminal, the terminal is a mobile phone, a tablet computer, a notebook computer, a desktop computer, an intelligent sound box, an intelligent watch, a vehicle-mounted terminal or other types of terminals, or the computer equipment is a server. Optionally, the server is an independent physical server, or a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.

Or, the voice recognition method provided by the embodiment of the application is interactively executed by the terminal and the server. FIG. 1 is a schematic diagram of an implementation environment provided by embodiments of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected by a wireless or wired network. The terminal 101 installs thereon a target application served by the server 102, by which the terminal 101 can realize functions such as information search, data transmission, voice recognition, and the like. Alternatively, the target application is a target application in the operating system of the terminal 101 or a target application provided for a third party. For example, the target application is a voice recognition application having a voice recognition function, but of course, the voice recognition application can also have other functions such as a comment function, a shopping function, a navigation function, a game function, and the like.

The terminal 101 logs in to a target application based on a user identification, acquires voice data based on the target application, transmits the voice data to the server 102 through the target application, the server 102 receives the voice data transmitted by the terminal 101, recognizes the voice data to obtain text data, and then determines whether the text data conforms to language logic based on the text data to determine whether the text data can be used as text data matched with the voice data. Alternatively, the server 102 can also transmit the recognition result to the terminal 101, and the server transmits the text data to the terminal 101 in the case where it is determined that the text data is text data that matches the voice data.

Fig. 2 is a flowchart of a voice recognition method according to an embodiment of the present application. The execution subject of the embodiments of the present application is a computer device. Referring to fig. 2, the method includes the steps of:

201. the computer device recognizes the voice data to obtain target text data.

In this embodiment of the present application, the speech data is any speech data to be identified, and the target text data is obtained by identifying the speech data directly, and it is not possible to determine whether the speech data can be used as the text data matched with the speech data. Wherein, the voice data is identified by an end-to-end voice identification system or other models.

202. Each character in the target text data and each word obtained by dividing the target text data are determined as a first element of the target text data.

Wherein the target text data includes a plurality of characters. After the target text data is identified, the target text data is divided in different dividing modes, and a plurality of first elements are obtained, wherein the first elements are characters or words. The words obtained by dividing the target text data comprise at least two characters.

203. The computer device obtains a weight for each first element, the weight representing a degree of correlation between the first element and a preamble element of the first element.

The first element preamble element comprises at least one element positioned before the first element and adjacent to the first element in the target text data. The greater the weight of the first element, i.e. the greater the degree of correlation between the first element and the preceding element of the first element, the more consistent the language logic between the first element and the preceding element of the first element; the smaller the weight of a first element, i.e., the smaller the degree of correlation between the first element and the first element's predecessor element, the less consistent the linguistic logic between the first element and the first element's predecessor element.

204. The computer device determines a confidence level of the target text data based on the plurality of first elements in the target text data and the weight of each first element, the confidence level representing a confidence level of the target text data matching the speech data.

In this embodiment of the present invention, for each first element in the target text data, since the weight of the first element may indicate whether the first element and the corresponding preamble element conform to the language logic, by comprehensively considering the weights of a plurality of first elements, it may be determined whether the whole target text data conforms to the language logic, thereby determining whether the target text data is the target text data matched with the voice data.

205. The computer device determines the target text data and the confidence level of the target text data as the recognition result of the voice data.

According to the method provided by the embodiment of the invention, after the text data is obtained by identifying the voice data, whether the text data is the text data matched with the voice data is not directly determined, but the weight of each first element in the text data is obtained, and because the weight of each first element can represent the association degree between the first element and the leading element of the first element, namely whether the first element and the leading element of the first element in the text data accord with the language logic or not, when determining whether the text data can be used as the text data matched with the voice data or not, whether the first element and the leading element of the first element accord with the language logic or not is considered, more information is considered, and the accuracy of voice identification is improved.

In the embodiment shown in fig. 2, it is briefly introduced that, when determining whether the target text data is the target text data matched with the voice data, whether the target text data accords with the language logic is considered, in one possible implementation manner, a target architecture diagram corresponding to the target text data can be constructed, and whether the target text data is the target text data matched with the voice data is determined based on the target architecture diagram.

Fig. 3 is a flowchart of a voice recognition method according to an embodiment of the present application. The execution subject of the embodiments of the present application is a computer device. Referring to fig. 3, the method includes the steps of:

301. the computer equipment identifies the voice data to obtain target text data and identification confidence of the target text data.

The voice data is any voice data, and the duration and the content of the voice data are not limited in the embodiment of the application. The recognition confidence represents the matching degree of the target text data and the voice data, alternatively, the recognition confidence is represented by probability, score, or other modes, and the representation mode of the recognition confidence is not limited in the embodiment of the application.

In one possible implementation, the computer device invokes a speech recognition system to recognize the speech data to obtain the target text data and the recognition confidence. For example, the voice recognition system is an end-to-end system or other system capable of performing voice data recognition, and the embodiment of the present application does not limit the structure of the voice recognition system. Of course, the computer device can also recognize the voice data in other manners, and the recognition manner of the voice data is not limited in the embodiment of the present application.

302. The computer device determines each character in the target text data and each word obtained by word segmentation of the target text data as a first element of the target text data.

For the target text data, different dividing modes are adopted, so that different first elements can be obtained by dividing the target text data. In one possible implementation, the target text data is divided according to characters, each character in the target text data is obtained, and then each character is determined to be a first element of the target text data; dividing the target text data according to words to obtain each word in the target text data, and determining each obtained word as a first element of the target text data.

Optionally, the computer device stores a word corpus, the word corpus includes a plurality of words, the computer device divides the target text data based on the word corpus, and the words in the target text data are located in the word corpus; or the computer equipment performs semantic recognition on the target text data, determines semantic information of the target text data, and then divides the target text data based on the semantic information to obtain each word in the target text data. Alternatively, other word segmentation manners can be adopted, and the method for dividing the words in the target text data is not limited in the embodiments of the present application.

303. The computer device obtains a first architecture diagram of the target text data based on a plurality of first elements of the target text data.

In the embodiment of the present application, in order to accurately and simply represent the division manner of the target text data and whether the target text data conforms to the language logic, a frame diagram (lattice) manner is adopted for representation. For example, referring to the architecture diagram shown in fig. 4, the architecture diagram includes 7 nodes and 8 connecting lines, where the node formed by two circles represents an end node in the architecture diagram, the first character on the connecting line represents an input character, the second character represents an output character, and the third character represents a weight. Taking the connection line between node 0 and node 1 as an example, the input character is d, the output character is data, the weight is 1, and the representation on the connection line between the other nodes is similar to the representation on the connection line between node 0 and node 1.

For the first architecture diagram in the embodiments of the present application, the first architecture diagram includes a plurality of nodes and a plurality of connection lines, each first element corresponds to one connection line, and each connection line points from one start node to one end node. That is, the connection line in the first structure diagram has an input character thereon, and does not include an output character and a weight.

For any character, the connecting line corresponding to the character points to the ending node from the starting node, for any word, the connecting line corresponding to the word points to the ending node of the connecting line corresponding to the ending character of the word from the starting node of the connecting line corresponding to the starting character of the word, and the first element corresponding to the connecting line pointing to any node is adjacent to and positioned before the first element corresponding to the connecting line taking the node as the starting node.

In this embodiment, the creation process of the first architecture diagram, referring to the embodiment shown in fig. 5, is not described herein again.

304. The computer device obtains a second architecture diagram.

The second architecture diagram comprises a plurality of nodes and a plurality of connecting lines, each connecting line corresponds to a second element and a weight corresponding to the second element, and each connecting line points to a termination node from a starting node.

In one possible implementation, the second architecture is an N-gram language model, and the computer device creates the N-gram language model based on a preset plurality of second elements and corpus data containing the plurality of second elements before performing speech recognition. The preset second elements comprise a plurality of characters and a plurality of words. Alternatively, the N-gram language model is sent to the computer device by another computer device, and the embodiment of the present application does not limit the manner of constructing the N-gram language model.

In another possible implementation manner, the computer device creates a second architecture diagram corresponding to the target text data based on the identified target text data, and the creation process of the second architecture diagram refers to the embodiment shown in fig. 8, which is not described herein again.

It should be noted that, in the embodiment of the present application, only the first architecture diagram is created first, and then the second architecture diagram is acquired, and in another embodiment, the computer device can execute step 304 first, then execute steps 302-303, or execute step 304 and steps 302-303 simultaneously.

305. And the computer equipment performs intersection operation on the first architecture diagram and the second architecture diagram to obtain a target architecture diagram.

In the embodiment of the application, the target architecture diagram can be obtained by performing an intersection (intersection) operation on the first architecture diagram and the second architecture diagram.

In one possible implementation manner, according to the first architecture diagram, selecting a connecting line corresponding to a second element identical to each first element in the first architecture diagram in the second architecture diagram, and a node connected with the connecting line; and removing nodes or connecting lines which do not belong to any path corresponding to the target text data from the selected nodes and connecting lines to obtain the target structure diagram, wherein each path from the first node in the target structure diagram comprises complete target text data, and the target structure diagram comprises a plurality of paths which can represent various dividing conditions of the target text data.

306. The computer device obtains text confidence of the target text data based on the target architecture diagram.

The computer equipment determines the node characteristic of the target node based on the weight corresponding to the connecting line pointing to the target node in the target architecture diagram and the node characteristic of another node connected with the connecting line, and determines the text confidence contained in the node characteristic of the target node as the text confidence of the target text data under the condition that the target node is the last node in the target architecture diagram. The target node is any node except the first node in the target structure diagram, and the node characteristics comprise text fragments formed by the first elements on a connecting line positioned in front of the target node and text confidence degrees corresponding to the text fragments; the text confidence represents a probability that the target text data conforms to language logic.

That is, since the connection lines in the target structure diagram have directivity, first, a node to which no connection line points in the target structure diagram is determined, the node is determined to be a first node, a first element and a weight corresponding to the connection line between the first node and a next node are determined from the first node, the next node is a termination node to which a certain connection line points with the first node as a start node, and the first element and the weight are determined to be node characteristics of the next node; then, for the next node, determining the node characteristics of the next node until the node characteristics of the last node are determined. Wherein, the larger the weight of the first element, the more relevant the first element and the preamble element of the first element, i.e. the more consistent the language logic; the smaller the weight of a first element, the less relevant, i.e. less consistent with language logic, the first element and the first element's predecessor elements.

In one possible implementation manner, under the condition that n connecting lines point to the target node in the target architecture diagram, for each connecting line in the n connecting lines, obtaining a node characteristic corresponding to the connecting line based on the weight corresponding to the connecting line and the node characteristic of another node connected with the connecting line; and determining the node characteristics of the target node based on the node characteristics corresponding to the n connecting lines.

Alternatively, a semiring rule is used to determine node characteristics. Determining the node characteristic corresponding to the maximum text confidence in the node characteristics corresponding to the n connecting lines as the node characteristic of the target node; or determining the average value of the node characteristics corresponding to the n connecting lines as the node characteristic of the target node. For the same target node, the text segments corresponding to multiple paths before the node are the same, so that the average value of the node features corresponding to the n connecting lines is obtained by averaging the text confidence degrees in the n node features, the average confidence degree is obtained, and the average confidence degree and the text segments corresponding to any path are determined to be the node features of the target node.

For example, text confidence is determined using the following formula:

/>

Wherein P (W) represents text confidence of the target text data, W represents the target text data, W _u Represents the u-th first element in the target text data,is w _u Precursor element of->Is indicated at->On the basis of (a) the->W is the next first element of (2) _u And U represents the number of first elements in the target text data.

Alternatively, willApproximately->Namely:

wherein,according to C->And C->Determining, wherein C (W) represents the number of W in the corpus data, < >>A preamble element, C +.>Representing the number of target fragments consisting of the u-th first element and the preamble element,/->Another precursor element, C.sub.f, representing the u-th first element>Representing the number of target fragments consisting of the u-th first element and the further precursor element,/->And->The number of first elements contained is different. Wherein the order N of the N-gram language model is 3, 4 or 5Any number.

In one possible implementation manner, after the text confidence P (W) is obtained by adopting the above formula, the log P (W) is obtained by taking the log P (W) as the text confidence of the target text data.

In the embodiment of the application, since the end-to-end speech recognition system works stepwise, each iteration is based on the decoded local hypothesis W already obtained _u-1 ＝[w ₁ ,…,W _u-1 ]At most a new character w is generated _u . For example, a local hypothesis "today" has been derived, predicting a new character "me" in a new one-step iteration. To verify at the end of a new iteration whether the proposed character "me" is correct, the logarithmic probability needs to be calculated I.e. the probability that the character "i am" is correct.

307. The computer equipment carries out weighting processing on the recognition confidence and the text confidence to obtain the confidence of the target text data.

In the embodiment of the application, in order to determine the final confidence level of the target text data, the recognition confidence level when recognizing the voice data and the text confidence level of the target text data are comprehensively considered to determine the confidence level of the target text data, namely, whether the target text data can be used as the target text data matched with the voice data or not is determined.

In one possible implementation, the weight corresponding to the recognition confidence and the weight corresponding to the text confidence are set according to the importance degree of the recognition confidence and the text confidence respectively. And carrying out weighted summation on the recognition confidence coefficient and the text confidence coefficient based on the weight corresponding to the recognition confidence coefficient and the weight corresponding to the text confidence coefficient to obtain the confidence coefficient of the target text data.

In the above embodiment, the first architecture diagram and the second architecture diagram are acquired, and the intersection operation is performed on the first architecture diagram and the second architecture diagram to obtain the target architecture diagram. In another embodiment, after obtaining the first architecture diagram and the weights of the first elements, the computer device adds the weights of each first element to the first architecture diagram, and obtains the target architecture diagram, so that the architecture diagram obtained in this way has fewer composition cases of the target text data represented by the target architecture diagram compared with the architecture diagram obtained in the foregoing embodiment.

For the weight of the first element, in one possible implementation manner, for each first element, a first number of the first element in the corpus data is obtained, and a second number of the first target segment in the corpus data is obtained, wherein the first target segment comprises the first element and a preamble element of the first element; the weight of the first element is determined based on a ratio between the second number and the first number. The corpus data is preset and contains a plurality of text data, for example, the corpus data is an article, a language segment and the like downloaded from a network. The more target segments contained in the corpus data, the more the collocation of the plurality of first elements is used in actual language use, and the more the collocation accords with language logic.

In another possible implementation manner, the computer device divides the target text data according to different character numbers to obtain a plurality of element sets, wherein the second elements belonging to the same element set form the target text data, the character data contained in the second elements belonging to the same element set are the same, and the character numbers contained in the second elements belonging to different element sets are different; the weight of the second element which is the same as the first element corresponding to each connecting line in the first architecture diagram is determined as the weight of the first element, the weight of each second element represents the correlation degree between the second element and the preamble element of the second element, and the preamble element of the second element comprises at least one second element which belongs to the same element set as the second element, is positioned before the second element and is adjacent to the second element.

Optionally, for each second element in each element set, a third number of the second element in the corpus data is obtained, a fourth number of the second target segment in the corpus data is obtained, the target segment contains the second element and a preamble element of the second element, and the weight of the second element is determined based on a ratio between the fourth number and the third number.

It should be noted that, in the embodiment of the present application, the text confidence coefficient of the target text data is obtained through the target architecture diagram is merely described as an example, and in another embodiment, the computer device may obtain the text confidence coefficient in other manners based on a plurality of first elements of the target text data and a weight of each first element, which is not limited in this embodiment of the present application.

308. The computer device determines the target text data and the confidence level of the target text data as the recognition result of the voice data.

In the embodiment of the application, after the computer equipment obtains the confidence coefficient of the target text data, the target text data obtained through recognition and the confidence coefficient are used as the recognition result of the voice data.

Optionally, the computer device is capable of storing the target text data and the confidence level; alternatively, the computer device can store the target text data and confidence level in correspondence with the voice data; alternatively, the computer device can also send the target text data and confidence level to other devices; alternatively, other operations can be performed on the recognition result, which is not limited by the embodiments of the present application.

And by determining the target structure diagram of the text data, each connecting line in the target structure diagram has a corresponding first element and the weight of the first element, so that the target structure diagram can reflect whether the text data accords with language logic, and whether the first element accords with the language logic or not is considered when determining whether the text data can be used as text data matched with the voice data based on the target structure diagram, thereby improving the accuracy of voice recognition. In addition, the target structure diagram can intuitively and accurately represent the information of the text data in terms of language logic, and when the confidence coefficient is obtained, the confidence coefficient can be sequentially determined based on the connection sequence of each node and connecting line in the target structure diagram, so that the efficiency of obtaining the confidence coefficient can be improved.

Fig. 5 is a flowchart for creating a first architecture diagram provided in an embodiment of the present application. Referring to fig. 5, the method is performed by a computer device, and includes:

501. the computer device creates u+1 nodes.

In the embodiment of the present application, the number of characters in the target text data is illustrated as U, and the number of nodes to be created by the computer device is 1 more than the number of characters, that is, u+1 nodes are created, where the xth node corresponds to the xth character in the target text data, U is a positive integer greater than 1, and x is any positive integer not greater than U. For example, referring to FIG. 6, for target text data "show to people," 6 nodes are created.

502. The computer device creates a connection line from a node corresponding to a start character in each first element to a next node of a node corresponding to a stop character in each first element, resulting in a first architecture diagram.

In the case that the first element is a character in the target text data, for each character in the target text data, a node corresponding to a next character from the node corresponding to the character in the target text data, that is, the node corresponding to the character is a start node, and a node corresponding to the next character of the character is a stop node, so that a connection line corresponding to each character is created. For the character, the start character and the end character in the character are the same, and are the characters themselves. In the case that the first element is a word in the target text data, for each word in the target text data, creating a node corresponding to the next character from a node corresponding to the first character in the word to the last character in the word, that is, the node corresponding to the first character is a start node, and the node corresponding to the next character of the last character is a stop node, thereby creating a connection line corresponding to each word. For example, referring to fig. 6, a plurality of first elements "to", "people", "exhibited", "show", "people", and "show" each have a corresponding one of the connecting lines.

In another embodiment, referring to fig. 7, the computer device creates u+2 nodes, takes the last node as an end node, the connection line of the last node and other nodes does not correspond to any first element, and in order to facilitate subsequent intersection operation on the first architecture diagram and the second architecture diagram, a weight of 0 is set on each connection line, and a self-loop is set on each node except for the end node, and the self-loop does not correspond to any first element. Wherein epsilon does not represent any first element, i.e., epsilon represents no corresponding first element from the ring.

It should be noted that, in the embodiment of the present application, only the first architecture diagram is created in the above manner as an example, and in another embodiment, the first architecture diagram can be created in other manners, and the creation manner of the first architecture diagram is not limited in the embodiment of the present application.

According to the method provided by the embodiment of the application, the first architecture diagram can be created by considering the actual partitionable condition of the target text data, so that the subsequent determination of the partitionable mode of the target text data based on the architecture diagram is facilitated.

Fig. 8 is a flowchart for creating a second architecture diagram provided in an embodiment of the present application. Referring to fig. 8, the method is performed by a computer device, and includes:

801. The computer device divides the target text data according to different character numbers to obtain a plurality of element sets.

The second elements belonging to the same element set form the target text data, the number of characters contained in the second elements belonging to the same element set is the same, and the number of characters contained in the second elements belonging to different element sets is different. Wherein the number of characters is 1, 2, 3, 4 or other number.

Optionally, the character data is determined according to the number of characters contained in the target text data, for example, the character data contained in the target text data is 10, and the computer device can divide the target text data according to the number of characters of 1-9 respectively, so as to obtain 9 element sets.

For example, the target text data is "show to people", and when the number of characters is 1, the element set { show, people, show } is obtained, and when the character data is 2, the element set { show, people show, show } is obtained.

In one possible implementation manner, after the computer device obtains the plurality of element sets, the weight of each second element is obtained, or in the process of creating the second architecture diagram, before each connecting line is created, the weight of the second element corresponding to the connecting line is obtained. The manner of obtaining the weight of the second element is the same as that of obtaining the weight of the second element in the embodiment shown in fig. 3, and is not described herein.

802. The computer device creates a first node and M ₁ And a second node.

M ₁ Is the number of first target elements in the plurality of second elements, the first target elements comprising a first character in the target text data. For example, for target text data "show to people," the first target element is "show to" or "person to".

803. Computer equipment is based on M ₁ The first target elements respectively create points to M from the first node ₁ And connecting lines of the second nodes.

Wherein M is ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and the first target elements corresponding to different connecting lines are different. For example, there are two first target elements "to" and "to person", and two connection lines are created from the first node with the start node and the two second nodes as the end nodes, where the two connection lines correspond to "and" to person ", respectively, and a corresponding weight is set on each connection line.

804. The computer device creates M for each of the second nodes ₂ A third node based on M ₂ The second target elements respectively create points from the second node to M ₂ And connecting lines of the third node until the second element corresponding to each path from the first node forms the target text data, so as to obtain a second structure diagram of the target text data.

Wherein M is ₂ The number of the second target elements in the plurality of second elements is the number of the second target elements, and the second target elements comprise the first character after the first target elements corresponding to the second nodes; m is M ₂ The connecting lines respectively correspond to a second target element and the weight of the second target element, and the second target elements corresponding to different connecting lines are different.

In one possible implementation, a computer device creates a first node and M ₁ After the second node, a first empty node is also created, and a connecting line pointing to the first empty node from the first node is established; creation of M ₁ Fourth node based on M ₁ Each of the first target elements creates a point M from the first empty node ₁ Connecting lines of the fourth node M ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and the first target elements corresponding to different connecting lines are different.

In this embodiment of the present application, the first node in the target architecture diagram is a start node, and cannot be used as a termination node, and the last node in the target architecture diagram is a termination node, and cannot be used as a start node, and other nodes except the first node and the last node in the target architecture diagram can be used as both a start node and a termination node.

In this embodiment of the present application, a second element corresponding to each path in the created second structure diagram forms the target text data, and a weight of each second element indicates a degree of correlation between the second element and a preamble element of the second element, where the preamble element of the second element includes at least one second element that belongs to the same element set as the second element, is located before the second element, and is adjacent to the second element. The main difference between the second architecture diagram and the first architecture diagram is that the elements on which the second architecture diagram is based are different in the creation process, and the connection lines in the second architecture diagram have corresponding weights, whereas the connection lines in the first architecture diagram do not have corresponding weights.

After the second architecture diagram is created in the above manner, the first architecture diagram and the second architecture diagram can be subjected to a interaction operation to obtain a target architecture diagram. For example, referring to the target architecture diagram shown in fig. 9, the target architecture diagram includes at least one connection line corresponding to each first element, and also connection lines not corresponding to any first element. Wherein epsilon does not represent any first element, namely epsilon represents that the connecting line does not have a corresponding first element, and the calculated P (W) is required to be logarithmically calculated when the confidence coefficient is calculated, so that the weight in the target structure graph is represented by a negative number.

It should be noted that, numbers marked on nodes in any architecture diagram in the embodiments of the present application are used to identify corresponding nodes, and do not indicate the order of the nodes.

In the embodiment shown in fig. 3, the recognition of the voice data is merely performed to obtain the target text data, and the confidence coefficient of obtaining the target text data is illustrated as an example, in another embodiment, during the recognition process of the voice data, the computer device sequentially recognizes and obtains a plurality of characters, and each time a character is recognized and obtained, the process of obtaining the confidence coefficient is performed, and the embodiment is described in fig. 10 below.

Fig. 10 is a flowchart of a voice recognition method according to an embodiment of the present application. Referring to fig. 10, the method is performed by a computer device, and includes:

1001. the computer device recognizes the voice data to obtain a first character and a second character.

In the embodiment of the application, the computer equipment identifies the voice data based on the end-to-end voice identification system, and in the identification process, the end-to-end voice identification system sequentially outputs the first character and the second character which are obtained by identification.

1002. The computer device combines the first character and the second character to obtain first text data.

That is, the computer device acquires the first text data obtained by currently recognizing the voice data, and then continues to recognize the voice data, and further obtains the third character, the fourth character, and other characters.

1003. The computer device determines each character in the first text data and each word obtained by word segmentation of the first text data as a first element of the first text data.

1004. The computer equipment acquires a target architecture diagram corresponding to the first text data based on a plurality of first elements of the first text data.

1005. The computer device determines a confidence level of the first text data based on a target architecture diagram corresponding to the first text data.

The confidence level indicates the confidence level that the first text data is text data matched with the currently recognized voice fragments in the voice data. For example, the voice data is identified, the first character "to" and the second character "person" are sequentially obtained, and then the first text data "to person" is obtained by combining, and the corresponding confidence degree is determined for the first text data.

In this embodiment, the implementation manners of step 1003-step 1005 are the same as the implementation manners of step 302-step 307, and are not described herein.

It should be noted that, after the second character is identified by the computer device, the computer device continues to identify the voice data to obtain a third character, then combines the first text data with the third character to obtain the second text data, or combines the first character, the second character and the third character to obtain the second text data, then continues to perform operations similar to the operations from step 1003 to step 1005 above until the voice data is identified to obtain a last character, combines the last character with the character obtained by the previous identification to obtain the target text data, and executes the embodiment shown in fig. 3 above with respect to the target text data.

It should be noted that, in this embodiment of the present application, after obtaining a plurality of first elements of first text data, the computer device may obtain a weight of each first element in the first text data, where the weight indicates a degree of correlation between the first element in the first text data and a preamble element of the first element, where the preamble element of the first element includes at least one element located before and adjacent to the first element in the first text data, and then determine a confidence level of the first text data based on the plurality of first elements in the first text data and the weight of each first element. In the above embodiment, only the example of determining the confidence level of the first text data based on the target structure diagram corresponding to the first text data is described, and in another embodiment, the computer device may also determine the confidence level of the first text data in a manner other than the target structure diagram of the first text data.

According to the method provided by the embodiment of the application, in the voice recognition process, each character corresponding to the voice data is sequentially recognized, and each character is recognized, the confidence coefficient of the text data which is currently recognized is obtained based on the corresponding target structure diagram, and the recognition accuracy of each character is improved by iteratively predicting each character in the text data, so that the recognition accuracy of the voice data is improved.

In some embodiments, for the speech recognition approach described above, the test is performed using the Chinese speech recognition data sets Aishell-1 and Aishell-2. For neural transducer (neural sensor), a frame of neural transducer +mmi training+ctc training+mmi decoding (a speech recognition frame) is employed; for the Hybrid CTC/Attention framework, a corresponding standard model is employed, where neural transducer +mmi training+ctc training+mmi decoding is denoted NT and Hybrid CTC/Attention is denoted HCA. The word error rate (CER) when the above two frameworks employ the method provided by the embodiment of the present application (referred to as the present solution in the table) and the word error rate when the method provided by the embodiment of the present application is not employed are shown in table 1 below.

TABLE 1

Model	By using the scheme	dev(％)	test(％)	ios(％)	android(％)	mic(％)
							NT	Whether or not	4.2	4.5	5.4	6.6	6.5
NT	Is that	3.8	4.2	5.1	6.1	6.0
							HCA	Whether or not	4.6	5.0	5.9	7.0	6.8
HCA	Is that	4.2	4.6	5.3	6.2	6.0

In Table 1, dev and test represent word error rates obtained by testing on the Aishell-1 test set, and ios, android, mic represents word error rates obtained by testing on the Aishell-2 test set. As can be seen from table 1, the word error rate of the speech recognition system using the method provided by the embodiment of the present application is lower than that of the speech recognition system not using the method provided by the embodiment of the present application.

In another embodiment, the recognition accuracy of the method provided by the embodiments of the present application is tested using other test data, and the word error rate (CER) for NT without the method provided by the embodiments of the present application (referred to as the present solution in the table) and the word error rate with the method provided by the embodiments of the present application are shown in table 2 below.

TABLE 2

In table 2 above, re represents a speakable test set; tr represents a Min-Han translation test set; gu represents a vehicle navigation test set; tv represents the tv anytime dataset; mu represents a music on-demand dataset; ed represents an online educational test set; mean represents the mean of the test results of the aforementioned several test sets or datasets. Where the result corresponding to "NT" is the CER obtained from the test when the protocol of the present application was not used on NT basis, and the result corresponding to "nt+the protocol of the present application" is the CER when the protocol of the present application was used on NT basis. As can be seen from table 2, the word error rate of the speech recognition system using the method provided by the embodiment of the present application is lower than that of the speech recognition system using the method provided by the embodiment of the present application.

Fig. 11 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application. Referring to fig. 11, the apparatus includes:

the voice recognition module 1101 is configured to recognize voice data to obtain target text data;

an element obtaining module 1102, configured to determine, as a first element of the target text data, each character in the target text data and each word obtained by performing word division on the target text data, where the word includes at least two characters;

a weight obtaining module 1103, configured to obtain a weight of each first element, where the weight represents a degree of correlation between the first element and a preamble element of the first element, and the preamble element of the first element includes at least one element located before and adjacent to the first element in the target text data;

a confidence determining module 1104, configured to determine a confidence of the target text data based on a plurality of first elements in the target text data and a weight of each of the first elements, where the confidence represents a confidence level of the target text data being the target text data to which the speech data matches;

the recognition result determining module 1105 is configured to determine the confidence level of the target text data and the target text data as a recognition result of the voice data.

According to the device provided by the embodiment of the invention, after the text data is obtained by identifying the voice data, whether the text data is the text data matched with the voice data is not directly determined, but the weight of each first element in the text data is obtained, and because the weight of each first element can represent the association degree between the first element and the preamble element of the first element, namely whether the first element and the preamble element of the first element in the text data conform to the language logic or not, when determining whether the text data can be used as the text data matched with the voice data or not, whether the first element and the preamble element of the first element conform to the language logic or not is considered, more information is considered, and the accuracy of voice identification is improved.

Optionally, referring to fig. 12, the confidence determining module 1104 includes:

an architecture diagram acquiring unit 1114 configured to acquire a target architecture diagram of the target text data, based on a plurality of first elements in the target text data and a weight of each of the first elements, the target architecture diagram including a plurality of nodes and a plurality of connection lines;

a confidence determining unit 1124 for determining the confidence of the target text data based on the target architecture diagram.

Alternatively, referring to fig. 12, the architecture diagram acquiring unit 1114 includes:

acquiring a first architecture diagram of the target text data based on a plurality of first elements of the target text data, wherein the first architecture diagram comprises a plurality of nodes and a plurality of connecting lines, each first element corresponds to one connecting line, each connecting line points to one termination node from one starting node, the connecting line corresponding to the word points to the termination node of the connecting line corresponding to the termination character of the word from the starting node of the connecting line corresponding to the starting character of the word, and the first element corresponding to the connecting line pointing to any node is adjacent to and positioned in front of the first element corresponding to the connecting line taking the node as the starting node;

and removing nodes or connecting lines which do not belong to any path corresponding to the target text data from the selected nodes and connecting lines, and obtaining the target structure diagram.

Alternatively, referring to fig. 12, the architecture diagram acquiring unit 1114 is configured to:

based on a plurality of element sets of the target text data, a second structure diagram of the target text data is obtained, second elements corresponding to each path in the second structure diagram form the target text data, the weight of each second element represents the correlation degree between the second element and a preamble element of the second element, and the preamble element of the second element comprises at least one second element which belongs to the same element set as the second element, is positioned in front of the second element and is adjacent to the second element.

creating a first node and M ₁ Second node M ₁ A number of first target elements in the plurality of second elements, the first target elements comprising a first character in the target text data; based on M ₁ The first target elements respectively create points to M from the first node ₁ Connecting lines M of the second node ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and the first target elements corresponding to different connecting lines are different;

for each of the second nodes, create M ₂ Third node M ₂ The number of the second target elements in the plurality of second elements is the number of the second target elements, and the second target elements comprise the first character after the first target elements corresponding to the second nodes; based on M ₂ The second target elements respectively create points from the second node to M ₂ Connecting lines of the third node M ₂ The connecting lines respectively correspond to a second target element and the weight of the second target element, the second target elements corresponding to different connecting lines are different until the second element corresponding to each path from the first node forms the target text data, and the target text data is obtainedA second architecture diagram of the target text data.

creating a first null node, and creating a connecting line pointing from the first node to the first null node;

creation of M ₁ Fourth node based on M ₁ Each of the first target elements creates a point M from the first empty node ₁ Connecting lines of the fourth node M ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and the first target elements corresponding to different connecting lines are different.

A node feature obtaining unit 1134, configured to determine a node feature of a target node based on a weight corresponding to a connection line pointing to the target node in the target architecture diagram and a node feature of another node connected to the connection line, where the target node is any node in the target architecture diagram except a first node, and the node feature includes a text segment formed by a first element located on a connection line before the target node and a text confidence corresponding to the text segment;

a text confidence acquiring unit 1144, configured to determine, when the target node is the last node in the target architecture diagram, a text confidence included in a node feature of the target node as a text confidence of the target text data, where the text confidence represents a probability that the target text data conforms to language logic;

and the confidence coefficient obtaining unit 1124 is configured to adjust the confidence coefficient of the text, and obtain the confidence coefficient of the target text data.

Alternatively, referring to fig. 12, the node characteristic acquiring unit 1134 is configured to:

under the condition that n connecting lines point to the target node in the target structure diagram, for each connecting line in the n connecting lines, obtaining the node characteristic corresponding to the connecting line based on the weight corresponding to the connecting line and the node characteristic of the other node connected with the connecting line;

Optionally, the voice recognition module 1101 is configured to:

the voice data is identified, so that identification confidence degrees of the target text data and the target text data are obtained, wherein the identification confidence degrees represent the matching degree of the target text data and the voice data;

the confidence acquiring unit 1124 is configured to:

and weighting the recognition confidence coefficient and the text confidence coefficient to obtain the confidence coefficient of the text data.

Optionally, the weight obtaining module 1103 is configured to:

for each first element:

acquiring a first number of the first element in corpus data;

acquiring a second number of first target segments in the corpus data, wherein the first target segments comprise the first element and a preamble element of the first element;

Optionally, the weight obtaining module 1103 is configured to:

the weight of a second element which is the same as the first element corresponding to each connecting line in the first structure graph is determined as the weight of the first element, and the weight of each second element represents the correlation degree between the second element and the preamble element of the second element, wherein the preamble element of the second element comprises at least one second element which belongs to the same element set as the second element, is positioned before the second element and is adjacent to the second element.

Optionally, the weight obtaining module 1103 is further configured to:

for each second element in each set of elements:

acquiring a third number of the second element in the corpus data;

acquiring a fourth number of second target segments in the corpus data, wherein the second target segments comprise the second element and the precursor element of the second element;

A weight of the second element is determined based on a ratio between the third quantity and the third quantity.

Optionally, the voice recognition module 1101 is configured to:

after the voice data are identified to obtain a first character and a second character, the first character and the second character are combined to obtain first text data, the confidence coefficient of the first text data is obtained until the voice data are identified to obtain a last character, and the last character is combined with the character obtained by the previous identification to obtain the target text data.

Optionally, the element obtaining module 1102 is further configured to determine each character in the first text data and each word obtained by performing word segmentation on the first text data as a first element of the first text data, where the word includes at least two characters;

the weight obtaining module 1103 is further configured to obtain a weight of each first element in the first text data, where the weight represents a degree of correlation between the first element and a preceding element of the first element, and the preceding element of the first element includes at least one element located before and adjacent to the first element in the first text data;

The confidence determining module 1104 is further configured to determine a confidence of the first text data based on the plurality of first elements in the first text data and the weight of each first element.

Optionally, the confidence determining module 1104 is further configured to:

a confidence level of the first text data is determined based on a target architecture diagram of the first text data.

Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein in detail.

It should be noted that: in the voice recognition device provided in the above embodiment, only the division of the above functional modules is used for illustration, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the computer device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the voice recognition device and the voice recognition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments, which are not described herein again.

The embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein at least one computer program is stored in the memory, and the at least one computer program is loaded and executed by the processor to realize the operation executed by the voice recognition method of the embodiment.

Optionally, the computer device is provided as a terminal. Fig. 13 is a schematic structural diagram of a terminal 1300 according to an embodiment of the present application. The terminal 1300 includes: a processor 1301, and a memory 1302.

Processor 1301 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. Processor 1301 may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array ). Processor 1301 may also include a main processor, which is a processor for processing data in an awake state, also called a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, processor 1301 may integrate a GPU (Graphics Processing Unit, image processor) for taking care of rendering and rendering of content that the display screen needs to display. In some embodiments, the processor 1301 may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

Memory 1302 may include one or more computer-readable storage media, which may be non-transitory. Memory 1302 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 1302 is used to store at least one computer program for execution by processor 1301 to implement the speech recognition methods provided by the method embodiments herein.

In some embodiments, the terminal 1300 may further optionally include: a peripheral interface 1303 and at least one peripheral. The processor 1301, the memory 1302, and the peripheral interface 1303 may be connected by a bus or signal lines. The respective peripheral devices may be connected to the peripheral device interface 1303 through a bus, a signal line, or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1304, a display screen 1305, a camera assembly 1306, audio circuitry 1307, and a power supply 1308.

A peripheral interface 1303 may be used to connect I/O (Input/Output) related at least one peripheral to the processor 1301 and the memory 1302. In some embodiments, processor 1301, memory 1302, and peripheral interface 1303 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 1301, the memory 1302, and the peripheral interface 1303 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.

The Radio Frequency circuit 1304 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 1304 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 1304 converts an electrical signal to an electromagnetic signal for transmission, or converts a received electromagnetic signal to an electrical signal. Optionally, the radio frequency circuit 1304 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 1304 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: the world wide web, metropolitan area networks, intranets, generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity ) networks. In some embodiments, the radio frequency circuit 1304 may also include NFC (Near Field Communication ) related circuits, which are not limited in this application.

The display screen 1305 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 1305 is a touch display, the display 1305 also has the ability to capture touch signals at or above the surface of the display 1305. The touch signal may be input to the processor 1301 as a control signal for processing. At this point, the display 1305 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 1305 may be one and disposed on the front panel of the terminal 1300; in other embodiments, the display 1305 may be at least two, disposed on different surfaces of the terminal 1300 or in a folded configuration; in other embodiments, the display 1305 may be a flexible display disposed on a curved surface or a folded surface of the terminal 1300. Even more, the display screen 1305 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display screen 1305 may be made of LCD (Liquid Crystal Display ), OLED (Organic Light-Emitting Diode) or other materials.

The camera assembly 1306 is used to capture images or video. Optionally, camera assembly 1306 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 1306 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.

The audio circuit 1307 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 1301 for processing, or inputting the electric signals to the radio frequency circuit 1304 for voice communication. For purposes of stereo acquisition or noise reduction, a plurality of microphones may be provided at different portions of the terminal 1300, respectively. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is then used to convert electrical signals from the processor 1301 or the radio frequency circuit 1304 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 1307 may also comprise a headphone jack.

A power supply 1308 is used to power the various components in terminal 1300. The power source 1308 may be alternating current, direct current, a disposable battery, or a rechargeable battery. When the power source 1308 comprises a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

Those skilled in the art will appreciate that the structure shown in fig. 13 is not limiting of terminal 1300 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Optionally, the computer device is provided as a server. Fig. 14 is a schematic structural diagram of a server provided in the embodiments of the present application, where the server 1400 may have a relatively large difference due to configuration or performance, and may include one or more processors (Central Processing Units, CPU) 1401 and one or more memories 1402, where at least one computer program is stored in the memories 1402, and the at least one computer program is loaded and executed by the processors 1401 to implement the methods provided in the respective method embodiments described above. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.

The present application also provides a computer readable storage medium having at least one computer program stored therein, the at least one computer program being loaded and executed by a processor to implement the operations performed by the speech recognition method of the above embodiments.

The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the operations performed by the speech recognition method of the above embodiments.

In some embodiments, the computer program related to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or on multiple computer devices distributed across multiple sites and interconnected by a communication network, where the multiple computer devices distributed across multiple sites and interconnected by a communication network may constitute a blockchain system.

It will be appreciated that in the specific embodiments of the present application, related data such as voice data is referred to, and when the above embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the above storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing is merely an alternative embodiment of the present application and is not intended to limit the embodiments of the present application, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the embodiments of the present application are intended to be included in the scope of the present application.

Claims

1. A method of speech recognition, the method comprising:

the voice data is identified, so that target text data and identification confidence coefficient of the target text data are obtained, and the identification confidence coefficient represents the matching degree of the target text data and the voice data;

determining each character in the target text data and each word obtained by dividing the target text data into words as a first element of the target text data, wherein the words comprise at least two characters;

determining the confidence level of the target text data and the target text data as a recognition result of the voice data;

the determining the confidence of the target text data based on the plurality of first elements in the target text data and the weight of each first element comprises:

acquiring a target architecture diagram of the target text data based on a plurality of first elements in the target text data and the weight of each first element, wherein the target architecture diagram comprises a plurality of nodes and a plurality of connecting lines; each connecting line corresponds to a first element and the weight of the first element, each connecting line points to a termination node from a starting node, the connecting line corresponding to the word points to the termination node of the connecting line corresponding to the termination character of the word from the starting node of the connecting line corresponding to the starting character of the word, and the first element corresponding to the connecting line pointing to any node is adjacent to and located before the first element corresponding to the connecting line taking the node as the starting node;

Determining node characteristics of a target node based on a weight corresponding to a connecting line pointing to the target node in the target architecture diagram and node characteristics of another node connected with the connecting line, wherein the target node is any node except a first node in the target architecture diagram, and the node characteristics comprise text fragments formed by a first element positioned on a connecting line before the target node and text confidence corresponding to the text fragments;

determining text confidence contained in node characteristics of the target node as text confidence of the target text data under the condition that the target node is the last node in the target structure graph, wherein the text confidence represents probability that the target text data accords with language logic;

2. The method of claim 1, wherein the obtaining the target structure graph of the target text data based on the plurality of first elements of the target text data and the weight of each of the first elements comprises:

Removing nodes or connecting lines which do not belong to any path corresponding to the target text data from the selected nodes and connecting lines to obtain the target structure diagram;

the second architecture diagram is an N-gram language model, and the second elements are preset elements in the N-gram language model; alternatively, the acquiring the second architecture diagram includes:

dividing the target text data according to different character numbers to obtain a plurality of element sets, wherein second elements belonging to the same element set form the target text data, the number of characters contained in the second elements belonging to the same element set is the same, and the number of characters contained in the second elements belonging to different element sets is different;

3. The method of claim 2, wherein the obtaining the first frame of the target text data based on the plurality of first elements of the target text data comprises:

4. The method of claim 2, wherein the obtaining a second structure diagram of the target text data based on the plurality of element sets of the target text data comprises:

for each of the second nodes, create M ₂ Third node M ₂ The number of second target elements in the plurality of second elements is the number of the second target elements, wherein the second target elements comprise first characters after the first target elements corresponding to the second nodes; based on M ₂ Each of the second target elements creates a point M from the second node ₂ Connecting lines of the third nodes, M ₂ The connecting lines respectively correspond to a second target elementAnd the weights of the second target elements are different from the second target elements corresponding to different connecting lines until the second elements corresponding to each path from the first node form the target text data, so as to obtain a second structure diagram of the target text data.

5. The method of claim 2, wherein the obtaining a second structure diagram of the target text data based on the plurality of element sets of the target text data comprises:

creating a first empty node, and creating a connecting line pointing to the first empty node from the first node;

Creation of M ₁ Fourth node based on M ₁ A first target element for creating a first node pointing to M ₁ Connecting lines of the fourth node, M ₁ The connecting lines respectively correspond to a first target element and the weight of the first target element, and the first target elements corresponding to different connecting lines are different.

6. The method of claim 1, wherein the obtaining the target structure graph of the target text data based on the plurality of first elements of the target text data and the weight of each of the first elements comprises:

7. The method according to claim 1, wherein the determining the node characteristic of the target node based on the weight corresponding to the connection line in the target architecture diagram to the target node and the node characteristic of another node to which the connection line is connected includes:

8. The method of claim 7, wherein determining the node characteristic of the target node based on the node characteristics corresponding to the n connection lines comprises:

9. The method of claim 1, wherein the obtaining the weight of each first element comprises:

for each first element:

acquiring a first number of the first element in corpus data;

10. The method of claim 1, wherein the obtaining the weight of each first element comprises:

the method comprises the steps of determining the weight of a second element which is the same as a first element corresponding to each connecting line in a first structure graph as the weight of the first element, wherein the weight of each second element represents the correlation degree between the second element and a preamble element of the second element, and the preamble element of the second element comprises at least one second element which belongs to the same element set with the second element, is positioned before the second element and is adjacent to the second element.

11. The method of claim 10, wherein the determining the weight of a second element that is the same as a first element corresponding to each connection line in the first architecture diagram is preceded by determining the weight of the first element, the method further comprising:

for each second element in each set of elements:

acquiring a third number of the second elements in the corpus data;

a weight of the second element is determined based on a ratio between the fourth number and the third number.

12. The method according to any one of claims 1-11, wherein identifying the speech data to obtain the target text data comprises:

and after the voice data are identified to obtain a first character and a second character, combining the first character with the second character to obtain first text data, obtaining the confidence coefficient of the first text data until the voice data are identified to obtain a last character, and combining the last character with the character obtained by the previous identification to obtain the target text data.

13. The method of claim 12, wherein the obtaining the confidence level of the first text data comprises:

determining each character in the first text data and each word obtained by carrying out word division on the first text data as a first element of the first text data, wherein the word comprises at least two characters;

acquiring a weight of each first element in the first text data, wherein the weight represents a degree of correlation between the first element and a preamble element of the first element, and the preamble element of the first element comprises at least one element which is positioned before the first element and is adjacent to the first element in the first text data;

a confidence level of the first text data is determined based on a plurality of first elements in the first text data and a weight of each first element.

14. The method of claim 13, wherein the determining the confidence level of the first text data based on the plurality of first elements in the first text data and the weight of each of the first elements comprises:

15. A speech recognition device, the device comprising:

the voice recognition module is used for recognizing the voice data to obtain target text data and recognition confidence of the target text data, wherein the recognition confidence represents the matching degree of the target text data and the voice data;

The recognition result determining module is used for determining the confidence level of the target text data and the target text data as a recognition result of the voice data;

the confidence determining module includes:

an architecture diagram obtaining unit, configured to obtain a target architecture diagram of the target text data based on a plurality of first elements in the target text data and weights of each of the first elements, where the target architecture diagram includes a plurality of nodes and a plurality of connection lines; each connecting line corresponds to a first element and the weight of the first element, each connecting line points to a termination node from a starting node, the connecting line corresponding to the word points to the termination node of the connecting line corresponding to the termination character of the word from the starting node of the connecting line corresponding to the starting character of the word, and the first element corresponding to the connecting line pointing to any node is adjacent to and located before the first element corresponding to the connecting line taking the node as the starting node;

and the confidence coefficient acquisition unit is used for carrying out weighting processing on the recognition confidence coefficient and the text confidence coefficient to obtain the confidence coefficient of the target text data.

16. The apparatus of claim 15, wherein the architecture diagram acquisition unit is configured to:

the second architecture diagram is an N-gram language model, and the second elements are preset elements in the N-gram language model; or, the architecture diagram obtaining unit is configured to:

17. The apparatus of claim 16, wherein the architecture diagram acquisition unit is configured to:

18. The apparatus of claim 16, wherein the architecture diagram acquisition unit is configured to:

19. The apparatus of claim 16, wherein the architecture diagram acquisition unit is configured to:

20. The apparatus of claim 15, wherein the architecture diagram acquisition unit is configured to:

21. The apparatus according to claim 15, wherein the node characteristic obtaining unit is configured to:

22. The apparatus according to claim 21, wherein the node characteristic obtaining unit is configured to:

23. The apparatus of claim 15, wherein the weight acquisition module is configured to:

for each first element:

Acquiring a first number of the first element in corpus data;

24. The apparatus of claim 15, wherein the weight acquisition module is configured to:

25. The apparatus of claim 24, wherein the weight acquisition module is further configured to:

for each second element in each set of elements:

acquiring a third number of the second elements in the corpus data;

26. The apparatus according to any one of claims 15 to 25, wherein the speech recognition module is configured to recognize the speech data to obtain a first character and a second character, then combine the first character and the second character to obtain first text data, obtain a confidence level of the first text data, until the speech data is recognized to obtain a last character, and combine the last character with a character obtained by previous recognition to obtain the target text data.

27. The apparatus of claim 26, wherein the element obtaining module is further configured to determine each character in the first text data and each word obtained by word segmentation of the first text data as a first element of the first text data, the word including at least two characters;

28. The apparatus of claim 27, wherein the confidence determination module is further configured to:

29. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one computer program that is loaded and executed by the processor to perform the operations performed by the speech recognition method of any one of claims 1 to 14.

30. A computer readable storage medium having stored therein at least one computer program loaded and executed by a processor to implement the operations performed by the speech recognition method of any one of claims 1 to 14.