CN112489630A

CN112489630A - Voice recognition method and device

Info

Publication number: CN112489630A
Application number: CN201910866542.XA
Authority: CN
Inventors: 陈明
Original assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Current assignee: Wuhan TCL Group Industrial Research Institute Co Ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2021-03-12

Abstract

The application is applicable to the technical field of communication, and provides a voice recognition method and equipment, wherein the voice recognition method comprises the following steps: acquiring voice information to be recognized; converting the voice information into an initial voice vector with a preset dimension; processing the initial voice vector by adopting a preset voice recognition model to obtain predicted text information corresponding to the voice information; the speech recognition model is obtained by training a deep learning network based on sample speech vectors of a plurality of sample speech information and sample text information corresponding to the sample speech information. According to the method, the original voice information is input into the voice recognition model, voice features do not need to be extracted before data is input into the voice recognition model, recognition speed is improved, the voice recognition model can extract complete feature information of the original voice information, partial information loss of the original voice information cannot be caused, and accuracy of voice recognition is improved.

Description

Voice recognition method and device

Technical Field

The present application relates to the field of communications technologies, and in particular, to a speech recognition method and device.

Background

Speech Recognition technology, also known as Automatic Speech Recognition (ASR), aims at converting the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes, character sequences or text messages. In the speech recognition method in the prior art, usually, speech feature information needs to be extracted from speech information to be recognized, and then the extracted speech feature information is input into an acoustic model obtained based on machine learning algorithm training for processing, so as to obtain a speech recognition result.

However, the voice recognition device needs to consume a certain hardware resource in the process of extracting the voice feature, and the data processing speed is slowed, so that the voice recognition speed is slowed when voice recognition is performed through an acoustic model, and meanwhile, partial information in an original signal is lost in the process of extracting the feature, so that the voice recognition result is inaccurate.

Disclosure of Invention

In view of this, embodiments of the present application provide a speech recognition method and device, so as to solve the problems that the existing speech recognition method is slow in recognition speed, and a part of information in an original signal is lost in a feature extraction process, which results in an inaccurate speech recognition result.

A first aspect of an embodiment of the present application provides a speech recognition method, including:

acquiring voice information to be recognized;

converting the voice information into an initial voice vector with a preset dimension;

processing the initial voice vector by adopting a preset voice recognition model to obtain predicted text information corresponding to the voice information; the speech recognition model is obtained by training a deep learning network based on sample speech vectors of a plurality of sample speech information and sample text information corresponding to the sample speech information.

A second aspect of an embodiment of the present application provides a speech recognition apparatus, including:

the device comprises an acquisition unit, a recognition unit and a processing unit, wherein the acquisition unit is used for acquiring voice information to be recognized;

the conversion unit is used for converting the voice information into an initial voice vector with preset dimensionality;

the recognition unit is used for processing the initial voice vector by adopting a preset voice recognition model to obtain predicted text information corresponding to the voice information; the speech recognition model is obtained by training a deep learning network based on sample speech vectors of a plurality of sample speech information and sample text information corresponding to the sample speech information.

A third aspect of embodiments of the present application provides a speech recognition device, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the speech recognition method according to the first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium, which stores a computer program that, when executed by a processor, implements the steps of the speech recognition method according to the first aspect.

A fifth aspect of embodiments of the present application provides a computer program product, which, when run on a speech recognition device, causes the speech recognition device to perform the steps of the speech recognition method according to the first aspect.

According to the embodiment of the application, after the voice information to be recognized is converted into the initial voice vector with the preset dimensionality, the initial voice vector is input into the voice recognition model to be processed, and the text information corresponding to the voice information is obtained. In the process of voice recognition, the voice recognition model processes the vector corresponding to the original voice information to obtain a recognition result, before the voice recognition model is input, the feature information of the original voice information does not need to be extracted, the problem that the data processing speed is slow because the hardware resources (memory, processor resources and the like) are occupied by the feature information of the original voice information can be solved, and the available hardware resources are used for voice recognition, so that the voice recognition efficiency is improved. In addition, the voice recognition model carries out voice recognition based on the vector corresponding to the original voice information and carries out recognition based on the complete original audio information, so that complete voice characteristic information can be obtained, compared with the scheme that the extracted characteristic information is input into the voice recognition model after the characteristic information is extracted, the problem that the recognition result is not accurate enough due to the fact that part of the original audio information is lost when the characteristic information is extracted can be solved, and the accuracy of the voice recognition can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic flow chart diagram of a speech recognition method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a network architecture of a speech recognition model according to an embodiment of the present application;

FIG. 3 is a schematic network structure diagram of a speech recognition model according to another embodiment of the present application;

FIG. 4 is a flow chart illustrating a speech recognition method according to another embodiment of the present application;

FIG. 5 is a schematic diagram of a speech recognition apparatus provided in an embodiment of the present application;

fig. 6 is a schematic diagram of a speech recognition device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

In order to explain the technical solution described in the present application, the following description will be given by way of specific examples.

Referring to fig. 1, fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present application. In this embodiment, the main execution body of the speech recognition method is a speech recognition device, and the speech recognition device includes but is not limited to a mobile terminal such as a smart phone, a tablet computer, and a wearable device, and may also be a desktop computer, a robot, a server, and the like. The speech recognition method as shown in fig. 1 may include:

s101: and acquiring the voice information to be recognized.

When a user needs to recognize voice information, a voice recognition instruction can be triggered through an exchange interface of the voice recognition device, or the voice recognition device is triggered to generate the voice recognition instruction through voice, or an instruction used for identifying the current voice information needing to be recognized is sent to the voice recognition device through a voice control mode.

When detecting a voice recognition instruction, the voice recognition device can acquire voice information to be recognized sent by a speaker in the surrounding environment through a built-in sound pickup device (such as a microphone); or the voice recognition device acquires the audio file or the video file corresponding to the file identifier according to the file identifier contained in the voice recognition instruction, extracts the sound information in the audio file or the video file, and recognizes the sound information as the voice information to be recognized. The audio file or the video file may be uploaded by a user, or may be downloaded from a server or a database for storing the audio file or the video file, which is not limited herein.

The voice recognition device can also receive voice information to be recognized sent by other devices.

S102: and converting the voice information into an initial voice vector with a preset dimension.

The voice recognition equipment converts voice information to be recognized into an initial voice vector with preset dimensionality so as to input a vector corresponding to the original voice information into a voice recognition model for processing to obtain a corresponding prediction result. Before inputting data into the speech recognition model, the speech recognition device does not need to extract the characteristic information of the original speech information, so that the problem that the data processing speed is slow because the hardware resources (memory, processor resources and the like) are occupied by extracting the characteristic information of the original speech information can be solved.

The preset dimension may be two-dimensional, but is not limited thereto, and may be set according to the actual situation, and is not limited herein. The following description will take the initial speech vector as a two-dimensional vector as an example.

Assuming that the speech information to be recognized is an audio signal with a duration of n seconds and a sampling rate of 16000hz, the audio signal can be converted into a one-dimensional vector, and the one-dimensional vector can be denoted as (1, n × 16000).

Then, the speech recognition apparatus converts the one-dimensional vector into an initial speech vector of a preset dimension. In the process, the one-dimensional vector can be regarded as a matrix, and the matrix is converted through a reshape function in MATLAB to obtain an initial voice vector with a preset dimension. The reshape function can readjust the number of rows, columns, and dimensions of the matrix.

It should be noted that the numbers of elements of the two matrices before and after conversion using the reshape function must be the same, in this embodiment, the elements included in the one-dimensional vector before conversion and the initial speech vector after conversion need to be the same, and if an element in the initial speech vector is empty during the conversion process, the element can be filled with 0.

For example, the speech recognition device may call a function B ═ reshape (a, m, n) to convert a one-dimensional vector corresponding to the speech information to be recognized into an initial speech vector of a preset dimension. The method comprises the steps that A represents a one-dimensional vector corresponding to voice information to be recognized, B represents an initial voice vector with a preset dimension, the size of B is m multiplied by n, and the dimension is m. The number of elements contained in a is m × n.

Assuming that the preset dimension of the initial speech vector is two, when the speech information to be recognized is an audio signal with a duration of n seconds and a sampling rate of 16000hz, the one-dimensional vector corresponding to the speech information to be recognized is marked as (1, n × 16000), and now 400 points are taken as one frame, that is, 25 milliseconds are taken as one frame, a reshape operation is performed on the one-dimensional vector (1, n × 16000), and the one-dimensional vector is converted into a second vector with a vector dimension of two, so that the initial speech vector can be marked as ([ n × 16000/400],400), and if n × 16000/400 cannot be divided exactly, the resthape operation is performed with 0.

S103: processing the initial voice vector by adopting a preset voice recognition model to obtain predicted text information corresponding to the voice information; the speech recognition model is obtained by training a deep learning network based on sample speech vectors of a plurality of sample speech information and sample text information corresponding to the sample speech information.

The voice recognition device is characterized in that a pre-trained voice recognition model is stored in the voice recognition device in advance, the pre-trained voice recognition model is obtained by training sample voice vectors corresponding to a plurality of sample voice information in a sample training set by using a machine learning algorithm, the sample voice information in the sample training set and corresponding sample text information are stored in an associated mode, the sample text information is used for identifying a text recognition result corresponding to the sample voice information, and the text recognition result refers to a result obtained after the sample voice information is converted into a text.

The input of the speech recognition model is sample speech information in the training sample and corresponding sample text information, and the output of the speech recognition model is a text recognition result corresponding to the sample speech information.

It is understood that the speech recognition model may be trained in advance by the speech recognition device, or a file corresponding to the speech recognition model may be transplanted to the speech recognition device after being trained in advance by another device, that is, an execution subject for training the speech recognition model may be the same as or different from an execution subject for performing speech recognition using the speech recognition model. Specifically, when the training of the deep learning network is finished, the other devices fix model parameters of the deep learning network, and transplant the voice recognition model file corresponding to the deep learning network into the voice recognition device.

It is understood that the network structure of the speech recognition model during the training process is the same as the network structure corresponding to the application process (predicting text information corresponding to speech information). For example, in the training process, the speech recognition model includes a sampling layer, a semantic analysis layer and a speech recognition layer, and accordingly, when text information corresponding to the speech information is predicted by the speech recognition model, the speech recognition model also includes the sampling layer, the semantic analysis layer and the speech recognition layer.

Further, for example, in the training process, the sampling layer of the speech recognition model includes 3 sub-sampling layers, the semantic analysis layer may include 2 sub-semantic analysis layers, and the speech recognition layer may include 2 sub-speech recognition layers; correspondingly, when text information corresponding to the speech information is predicted through the speech recognition model, the sampling layer includes 2 sub-sampling layers, the semantic analysis layer includes 2 sub-semantic analysis layers, the speech recognition layer may include 2 sub-speech recognition layers, and in the application process, the working principle of each layer is the same as that of each layer in the training process, so that the input and output conditions of each layer of neural network in the speech recognition model application process can refer to the relevant description in the training process of the speech recognition model, and are not described herein again.

And the voice recognition equipment inputs the initial voice vector corresponding to the voice information to be recognized into a pre-trained voice recognition model to perform processing such as feature extraction, feature analysis, feature recognition and the like, so as to obtain a text recognition result corresponding to the voice information to be recognized.

Specifically, the voice recognition device extracts local feature information of an initial voice vector corresponding to the voice information to be recognized through a preset voice recognition model, extracts context information of all the local feature information, and predicts predicted text information corresponding to the voice information to be recognized based on the context information. The context information may be semantic context information or spatial context information. The semantic context information is used for identifying the semantic relevance between adjacent words and between adjacent sentences. The spatial context information is used for identifying position correlation between adjacent words and between preceding and succeeding sentences.

In one embodiment, the speech recognition model may be a network architecture diagram as shown in FIG. 2. The system comprises a speech recognition model sampling layer, a semantic analysis layer and a speech recognition layer.

In another embodiment, please refer to fig. 3, in which fig. 3 is a schematic diagram of a network structure of a speech recognition model according to another embodiment of the present application.

The sampling layer comprises a plurality of sub-sampling layers, the feature vector output by each sub-sampling layer is used as the input of the next sub-sampling layer adjacent to the sub-sampling layer, and the difference result of the last sub-sampling layer is the final output result of the sampling layer, namely the local feature information vector of the output voice information. Each sub-sampling layer includes a Convolutional layer (Convolutional layer) and a Pooling layer (Pooling layer). The sampling layer is used for extracting local features of the voice information to be recognized. The number of the sub-sampling layers may be 5, but is not limited thereto, and the sub-sampling layers may be set according to actual needs, and is not limited herein. It can be understood that the more the number of sub-sampling layers is, the more local features are extracted, and the more accurate the speech recognition result is.

The semantic analysis layer includes a plurality of Bi-directional Long Short-Term Memory (BilSTM). And the semantic analysis layer is used for extracting context information of the voice information to be recognized. The BilSTM is prior art and is not described herein again.

The speech recognition layer comprises an attention mechanism layer and can also comprise a full connection layer. Wherein, the attention mechanism layer can be composed of a plurality of Long Short-Term Memory networks (LSTM). LSTM is prior art and will not be described herein. The voice recognition layer is used for processing the context information and outputting text information corresponding to the voice information to be recognized. When the voice recognition layer does not comprise a full connection layer, outputting a recognition result by the attention mechanism layer; when the speech recognition layer comprises the full connection layer, the recognition result is output by the full connection layer.

Further, in order to obtain the local feature and the context information of the speech information to be recognized, so as to accurately recognize the text information corresponding to the speech information by combining the local feature and the context information, and improve the accuracy of the speech recognition result, as shown in fig. 2, S103 may include S1031 to S1033, which are specifically as follows:

s1031: inputting the initial voice vector into a sampling layer of the voice recognition model to carry out convolution and down-sampling processing to obtain a local characteristic information vector corresponding to the initial voice vector; and the local feature information vector is used for identifying the local feature corresponding to the initial voice vector.

The method comprises the steps that initial voice vectors corresponding to voice information to be recognized are input into a sampling layer of a voice recognition model by voice recognition equipment, and local feature information vectors corresponding to the initial voice vectors are obtained through convolution and down-sampling processing of the initial voice vectors in the sampling layer; and the local feature information vector is used for identifying the local feature corresponding to the initial voice vector.

Specifically, referring to fig. 3, the speech recognition apparatus may input an initial speech vector into a kth sub-sampling layer of the sampling layers, where k is an integer greater than or equal to 1, and each sub-sampling layer includes a convolutional layer and a pooling layer. And performing convolution processing on the initial voice vector through a convolution layer in the kth sub-sampling layer, inputting a convolution result into a pooling layer in the kth sub-sampling layer, and performing down-sampling processing on the convolution result through a maximum pooling method to obtain the kth local feature information vector. And then, inputting the kth local feature information vector into a (k + 1) th sub-sampling layer in the sampling layer, performing convolution processing on the initial voice vector through a convolution layer in the (k + 1) th sub-sampling layer, inputting a convolution result into a pooling layer in the (k + 1) th sub-sampling layer, and performing down-sampling processing on the convolution result through a maximum pooling method to obtain the (k + 1) th local feature information vector.

It can be understood that the output result of the (k + 1) th sub-sampling layer is the final output of the sampling layer, and all local feature information corresponding to the initial speech vector is identified. The number of local feature information vectors output by the sampling layer may be one, or may be at least two, and is not limited herein. When the local feature information vector output by the sampling layer is one, the local feature information vector can identify all local feature information corresponding to the initial voice vector; when the number of the local feature information vectors output by the sampling layer is at least two, the sum of the local feature information identified by the at least two local feature information vectors is all the local feature information corresponding to the initial voice vector.

S1032: inputting the local feature information vectors into a semantic analysis layer of the voice recognition model for processing, determining context information of all the local feature information vectors, and generating voice sequence feature vectors based on the local feature information vectors and the context information; wherein the speech sequence feature vector is used to identify the context of all the local features.

The voice recognition equipment recognizes all local characteristic information vectors output by the sampling layer as the input of the semantic analysis layer of the voice recognition model, inputs the local characteristic information vectors into the semantic analysis layer of the voice recognition model for processing, analyzes the context information of all local characteristic information vectors, and determines the sequence of all local characteristic information corresponding to the characteristic information vectors. The context information may be semantic context information, spatial context information. The semantic context information is used for identifying the semantic relevance between adjacent words and between adjacent sentences. The spatial context information is used for identifying position correlation between adjacent words and between preceding and succeeding sentences. Then, the voice recognition device sorts all local feature information corresponding to the local feature information vector based on the local feature information vector and the context information to generate a voice sequence feature vector.

Specifically, the speech recognition device may input all local feature information vectors output by the sampling layer into a plurality of bilstms shown in fig. 2 for processing, so as to obtain a speech sequence feature vector. It is understood that for the first BilSTM, the input is all local feature information vectors output by the sampling layer, and the output is a speech sequence feature vector. The output of the current BilSTM is used as the input of the next BilSTM adjacent to the current BilSTM, and the output of the last BilSTM of the semantic analysis layer is the final output of the semantic analysis layer and is used as the input of the speech recognition layer of the speech recognition model. The next BilSTM adjacent to the current BilSTM is determined by the network connection relationship and data transfer direction of each BilSTM in the semantic analysis layer.

Wherein, the BilSTM is formed by combining a forward Short-Term Memory network (LSTM) and a backward LSTM. Both forward LSTM and backward LSTM are commonly used to model context information in natural language processing tasks.

When words are combined into sentences, an addition method may be used, for example, all vectors representing single words are added to obtain corresponding sentences, or all vectors representing single words are added to obtain average values to obtain corresponding sentences, but these methods do not consider the front-back order of words in sentences. Such as the sentence "i don't feel that he is good". The word "not" is negative for the following "good", i.e. the emotional polarity of the sentence is derogative. Longer distance dependencies can be better captured using LSTM. Because LSTM learns which information to remember and which information to forget through the training process.

In more granular classification, such as five-classification tasks of recognition of strength, recognition of weakness, neutrality, dereference of weakness, and dereference of strength, attention is paid to the interaction between emotional words, degree words, and negative words. As an example, "the restaurant is dirty and has no good next door", where "dirty" is a modification of the degree of "dirty", and the context relationship between "dirty" and "not" can be better captured by BilTM, and the output of BilTM is obtained by considering both the front and the back.

S1033: and inputting the voice sequence feature vector into a voice recognition layer of the voice recognition model for processing to obtain predicted text information corresponding to the voice information.

The speech recognition device inputs the speech sequence feature vector output by the semantic analysis layer into a speech recognition layer of a speech recognition model for processing, analyzes the speech sequence feature vector based on an Attention (Attention) mechanism, determines text information corresponding to the speech information, and outputs the recognized predicted text information.

From the role of Attention, we can classify the types of Attention from two perspectives: spatial Attention (Spatial Attention) and Temporal Attention (Temporal Attention). More practical, the Attention can be classified into Soft Attention and Hard Attention. The Soft Attention means that all data can pay Attention, corresponding Attention weight values can be calculated, and screening conditions cannot be set. The Hard Attention will screen a part of the Attention which is not in accordance with the condition after generating the Attention weight, and let the Attention weight be 0, which can be understood as that the part which is not in accordance with the condition is not noticed any more.

Specifically, when the speech recognition layer is composed of the attention mechanism layer shown in fig. 3, the speech recognition device inputs the speech sequence feature vector output by the semantic analysis layer into the attention mechanism layer, analyzes the speech sequence feature vector based on the attention mechanism through a plurality of LSTM networks included in the attention mechanism layer, screens out speech feature information that meets the conditions, maps the screened speech feature information that meets the conditions into corresponding text labels based on the correspondence between preset speech feature information and preset text labels (word ids), and sorts the text corresponding to the speech feature information based on the text corresponding to the mapped text labels according to the order between the speech feature information, so as to obtain text information corresponding to the speech information.

When the speech recognition layer is composed of the attention mechanism layer and the full-connection layer as shown in fig. 3, the speech recognition device inputs the speech sequence feature vectors output by the semantic analysis layer into the attention mechanism layer for processing so as to screen out speech feature information meeting the conditions, and inputs the processing results output by the attention mechanism layer into the full-connection layer for processing so as to map the processing results output by the attention mechanism layer onto a preset text label, so as to obtain text information corresponding to the speech information.

According to the scheme, after the voice information to be recognized is converted into the initial voice vector with the preset dimensionality, the initial voice vector is input into the voice recognition model to be processed, and the text information corresponding to the voice information is obtained. In the process of voice recognition, the voice recognition model processes the vector corresponding to the original voice information to obtain a recognition result, before the voice recognition model is input, the feature information of the original voice information does not need to be extracted, the problem that the data processing speed is slow because the hardware resources (memory, processor resources and the like) are occupied by the feature information of the original voice information can be solved, and the available hardware resources are used for voice recognition, so that the voice recognition efficiency is improved. Moreover, the voice recognition model carries out voice recognition based on the vector corresponding to the original voice information and carries out recognition based on the complete original audio information, so that complete voice characteristic information can be obtained, compared with the scheme that the extracted characteristic information is input into the voice recognition model after the characteristic information is extracted, the problem that the recognition result is not accurate enough due to the fact that part of the original audio information is lost when the characteristic information is extracted can be avoided, and the accuracy of the voice recognition can be improved.

Referring to fig. 4, fig. 4 is a schematic flow chart of a speech recognition method according to another embodiment of the present application. In this embodiment, the main execution body of the speech recognition method is a speech recognition device, and the speech recognition device includes but is not limited to a mobile terminal such as a smart phone, a tablet computer, and a wearable device, and may also be a desktop computer, a robot, a server, and the like. The difference between the present embodiment and the previous embodiment is that the previous embodiment does not include S201 to S203 in the present embodiment, wherein S204 to S206 in the present embodiment are the same as S101 to S103 in the previous embodiment, so that the detailed implementation process of S204 to S206 in the present embodiment is please refer to the related description in the embodiment (i.e., the previous embodiment) corresponding to fig. 1, and is not repeated herein. S201 to S203 may be executed before S206, where S201 to S203 specifically include the following:

s201: and converting the sample voice information in the training sample set into a sample voice vector with a preset dimension.

The voice recognition equipment can obtain a sample training set from a database for storing training samples and can also obtain training sample sets stored by other equipment, and the sample training set is preset and input for related personnel. The training sample set comprises a plurality of sample voice messages and corresponding sample text messages. The number of the sample voice information is not limited, the number of the sample voice information in the sample training set can be set according to actual conditions, and to a certain extent, the greater the number of the sample voice information in the training sample set is, the more accurate the recognition result is when the voice recognition model obtained by training the training sample set is used for voice recognition.

It is understood that the training sample set may include sample speech information spoken by multiple speakers for the same text information, may also include sample speech information spoken by different speakers for different text information, or sample speech information spoken by the same speaker for different text information.

The speech recognition device can divide the sample speech information in the training sample set into a plurality of batches, so that the sample speech information of different batches can be used for training. The sample speech information of the same batch can be regarded as a sub-sample set.

It can be understood that, in the training process, all sample voice information in the training sample set may be used for training, or part of sample voice information in the training sample set may be used for training; the sample speech information used in each training may be the same or different, and is not limited herein. For example, the first time S201 is executed, the first batch of sample speech information may be used, and the second time S201 is executed, the first batch of sample speech information may be used, or any other batch of sample speech information may be used.

It can be understood that the speech recognition device may convert all sample speech information participating in training into sample speech vectors of preset dimensions in advance, and then perform S202; or when training is performed using the sample speech information, the sample speech information may be converted into a sample speech vector of a preset dimension.

Before the sample voice information is input into the deep learning network, the voice recognition equipment converts the sample voice information into vectors with preset dimensionality, and the efficiency of extracting the voice feature information can be improved.

Specifically, the speech recognition device may convert the sample speech information into a one-dimensional vector based on the respective speech durations and sampling rates corresponding to the sample speech information, and call a reshape function in the MATLAB to convert the one-dimensional vector into a sample speech vector of a preset dimension. The preset dimension may be two-dimensional, but is not limited thereto, and may be set according to the actual situation, and is not limited herein.

The reshape function can readjust the number of rows, columns, and dimensions of the matrix. It should be noted that the numbers of elements of the two matrices before and after conversion using the reshape function must be the same, in this embodiment, the elements included in the one-dimensional vector before conversion and the sample speech vector after conversion need to be the same, and if an element in the sample speech vector is absent during the conversion process, the element may be filled with 0.

For example, the speech recognition device may call a function B ═ reshape (a, m, n) to convert the one-dimensional vector corresponding to each sample speech information into a sample speech vector of a preset dimension. A represents a one-dimensional vector corresponding to each sample voice message, B represents a sample voice vector with a preset dimension, the size of B is m multiplied by n, and the dimension is m. The number of elements contained in a is m × n. Wherein when m × n is larger than the number of elements contained in a, elements other than the element containing a in B are filled with 0.

S202: and inputting a sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result.

After the voice recognition equipment inputs the sample voice vector corresponding to the sample voice information into the deep learning network for processing, local feature information and context information of the sample voice information are extracted, and the local feature information and the context information of the sample voice information are analyzed to obtain a text recognition result of the sample voice information. The speech recognition model comprises a sampling layer, a semantic analysis layer and a speech recognition layer.

It can be understood that, in the training process, when it is necessary to return to executing S202, the sample speech information corresponding to executing S202 for the first time may be the same as or different from the sample speech information corresponding to executing S202 for the non-first time.

Further, S202 may include S2021 to S2023, specifically as follows:

s2021: inputting a sample voice vector corresponding to sample voice information into a sampling layer of the deep learning network for convolution and downsampling processing to obtain a sample local characteristic information vector corresponding to the sample voice information; and the sample local feature information vector is used for identifying the local feature corresponding to the sample voice information.

The voice recognition equipment inputs sample voice vectors corresponding to the sample voice information into a sampling layer of a voice recognition model, and each sample voice vector is subjected to convolution and down-sampling processing through the sampling layer to obtain local characteristic information vectors corresponding to the sample voice vectors; the local feature information vector is used for identifying the local feature corresponding to the sample voice vector.

Further, when the sampling layer includes a plurality of sub-sampling layers as shown in fig. 3, each sub-sampling layer outputs a feature vector as an input of a next sub-sampling layer adjacent thereto, and the last sub-sampling layer outputs a sample local feature information vector. It can be understood that the more sub-sampling layers included in the sampling layer, the more local feature information of the sample speech information extracted by the sampling layer, and the closer the text recognition result corresponding to the sample speech information is to the sample text information. Taking the example that the sampling layer includes three sub-sampling layers as an example, specifically, S2021 may include the following steps:

s20211: and inputting a sample voice vector corresponding to the sample voice information into a first sub-sampling layer in the deep learning network for convolution and down-sampling processing to obtain a first sample local characteristic information vector.

In the present embodiment, the sampling layer includes three sub-sampling layers, each of which includes a convolutional layer and a pooling layer, and the sampling layer is used to extract local features of the speech information to be recognized.

It is understood that in other embodiments, the number of sub-sampling layers may be greater than 3, for example, 5, but is not limited thereto, and may be set according to actual needs, and is not limited herein.

The voice recognition equipment inputs a sample voice vector corresponding to the sample voice information into a first sub-sampling layer of a voice recognition model, convolves the sample voice vector on a convolution layer of the first sub-sampling layer according to a preset convolution window and convolution step length, inputs a convolution result into a pooling layer of the first sub-sampling layer, and performs down-sampling processing on the convolution result by adopting a maximum pooling method based on the preset pooling window and pooling step length to obtain a first sample local feature information vector.

The convolution window size of the preset convolution layer is (5, 5), that is, the convolution kernel is 5 × 5, and the pooling window size of the preset pooling layer is (2, 2). The step size of convolution and pooling is 1, and the number and dimension of channels before and after convolution are unchanged.

The output of each sub-sampling layer is a four-dimensional vector that reflects the information in the following dimensions: the number of sample speech information, the horizontal size of the convolution kernel, the vertical size of the convolution kernel, and the number of convolution kernels employed in this training. It can be referred to as (batch _ size, left _ size, right _ size, num _ filters), where batch _ size is the number of sample voice messages used in the present training, i.e., the number of sample voice messages participating in the present training, left _ size is the horizontal size of the convolution kernel, right _ size is the vertical size of the convolution kernel, and num _ filters is the number of convolution kernels.

S20212: and inputting the first sample local characteristic information vector into a second sub-sampling layer adjacent to the first sub-sampling layer to carry out convolution and down-sampling treatment to obtain a second sample local characteristic information vector.

The convolution and downsampling process is the same as in S20211, and is not described here. And the second sub-sampling layer is an intermediate sampling layer between the first sub-sampling layer and the third sub-sampling layer. The number of the second sub-sampling layers may be one or at least two. And when the number of the second sub-sampling layers is at least two, the output of the adjacent previous second sub-sampling layer is used as the input of the next second sub-sampling layer, and the output of the last second sub-sampling layer is used as the input of the third sub-sampling layer.

S20213: and inputting the second sample local characteristic information vector into a third sub-sampling layer for convolution and down-sampling processing to obtain the sample local characteristic information vector.

And the sample local characteristic information vectors output by the third sub-sampling layer are all local characteristic information corresponding to sample voice information respectively corresponding to each sample voice vector input into the deep learning network. The third sub-sampling layer is the last sub-sampling layer of the sampling layers, so that the output of the third sub-sampling layer is the output corresponding to the whole sampling layer, that is, the output result of the third sub-sampling layer is the sample local feature information vector corresponding to the sample speech information in S2021.

S2022: inputting the sample local characteristic information vector into a semantic analysis layer of the deep learning network for processing, determining context information of all the sample local characteristic information vectors, and generating a sample voice sequence characteristic vector based on the sample local characteristic information vector and the context information; wherein the sample speech sequence feature vector is used to identify the context of all the local features.

The speech recognition equipment inputs all sample local characteristic information vectors corresponding to the sample speech information output by the sampling layer into a semantic analysis layer of the deep learning network for processing, analyzes the context information of all sample local characteristic information vectors corresponding to the sample speech information, and determines the sequence of all local characteristic information corresponding to the sample speech information. Then, the voice recognition device sorts all local feature information corresponding to the sample local feature information vector based on the sample local feature information vector and the context information, and generates a sample voice sequence feature vector of the sample voice information.

The semantic analysis layer is used for extracting context information of the voice information to be recognized.

Further, the semantic analysis layer may include at least two sub-semantic analysis layers; each sub-semantic layer consists of a BilSTM. The output result of each sub-semantic analysis layer is used as the input of the next sub-semantic analysis layer, and the output result of the last sub-semantic analysis layer is used as the input of the semantic analysis layer; the output result of each sub-semantic analysis layer comprises a voice sequence vector and a state information vector.

It can be understood that the more the number of sub-semantic analysis layers included in the semantic analysis layer is, the more comprehensive the context information corresponding to the sample speech information extracted by the semantic analysis layer is, and the closer the text recognition result corresponding to the sample speech information obtained through the deep learning network processing is to the sample text information.

Wherein, the speech sequence vector can be expressed as: output (batch _ size, sequence _ length, hidden _ size). The state information vector may be represented as: state (batch _ size, hidden _ size). batch _ size is the number of sample speech information used in the training, sequence _ length is the length of the sample speech information, hidden _ size is the size of the hidden layer, and the size of the hidden layer is preset.

Further, when the semantic analysis layer includes two sub-semantic analysis layers, S2022 may include:

s20221: and inputting the sample local characteristic information vector into a first sub-semantic analysis layer of the deep learning network for processing, and outputting a first voice sequence vector and a first state information vector.

The voice recognition equipment inputs a sample local characteristic information vector corresponding to each sample voice information output by the sampling layer into a first BilSTM of the deep learning network for processing, analyzes context information corresponding to each sample voice information, and outputs a first voice sequence vector and a first state information vector. The first speech sequence vector may reflect the number of sample speech messages used in the training, the length of the sample speech messages, and the size of the hidden layer in the first BiLSTM. The first state information vector may reflect the state of the hidden layer in the first BilSTM and the amount of sample speech information used for this training.

Wherein the first speech sequence vector may be represented as: output1(batch _ size, sequence _ length, hidden _ size). The first state information vector may be represented as: state1(batch _ size, hidden _ size).

S20222: and inputting the first voice sequence vector and the first state information into the second sub-semantic analysis layer for processing, and outputting the sample voice sequence feature vector and the second state information vector.

In the present embodiment, the sub-semantic analysis layer has 2 layers, and the second sub-semantic analysis layer is connected to the speech recognition layer.

The sample speech sequence feature vector can reflect the number of sample speech information adopted by the training, the length of the sample speech information and the size of a hidden layer in the second BilTM. The sample speech sequence feature vector can be represented as output (batch _ size, sequence _ length, hidden _ size).

The second state information vector, which may be denoted as state2(batch _ size), may reflect the state of the hidden layer in the second BilSt and the amount of sample speech information used for this training.

In other embodiments, when the semantic analysis layer includes 3 or more sub-semantic analysis layers, the first sub-semantic analysis layer is the first layer of the semantic analysis layer, and the second sub-semantic analysis layer is the last layer of the semantic analysis layer. The speech recognition equipment inputs the first speech sequence vector and the first state information into a next sub-semantic analysis layer adjacent to the first sub-semantic analysis layer for processing, inputs an output result into the next sub-semantic analysis layer adjacent to the output result, and so on.

It is understood that the hidden layer size of each layer may be the same, or may be different, and is specifically set according to actual requirements, and is not limited herein.

S2023: and inputting the sample voice sequence feature vector into a voice recognition layer of the deep learning network for processing to obtain a text recognition result corresponding to the sample voice information.

The speech recognition equipment inputs the sample speech sequence feature vector corresponding to each sample speech information output by the semantic analysis layer into the speech recognition layer of the deep learning network for processing, analyzes the sample speech sequence feature vector based on an attention mechanism, determines a recognition result corresponding to the sample speech information, and outputs recognized text information.

Further, the speech recognition layer may include at least two sub-speech recognition layers, each sub-speech recognition layer being formed of an LSTM. The feature vector output by each sub-speech recognition layer is used as the input of the next sub-speech recognition layer adjacent to the sub-speech recognition layer, and the last sub-speech recognition layer outputs the text recognition result. It can be understood that the more sub-speech recognition layers included in the speech recognition layer, the more speech feature information reflected by the speech feature vector output by the speech recognition layer, and the closer the text recognition result of the output sample speech information is to the sample text information corresponding to the sample speech information.

Further, when the speech recognition layer includes two sub-speech recognition layers, S2023 may include:

s20231: and inputting the sample voice sequence feature vector and the second state information vector into a first sub-voice recognition layer of the deep learning network for processing to obtain a voice feature vector and a third state information vector.

The speech recognition device inputs a sample speech sequence feature vector and a second state information vector corresponding to each sample speech information output by the semantic analysis layer into a first sub-speech recognition layer of the deep learning network for processing, and analyzes the sample speech sequence feature vector and the second state information vector based on an attention mechanism to obtain a speech feature vector and a state information vector (namely, a third state information vector). The speech feature vector is a three-dimensional vector that reflects information in the following dimensions: the number of sample voice messages, the length of the sample voice messages and the size of the hidden layer adopted by the training are determined. The third state information vector is a two-dimensional vector reflecting information in the following dimensions: the number of sample speech information and the hidden layer size of the LSTM used in this training.

For example, a speech feature vector may be represented as: output (batch _ size, sequence _ length, hidden _ size), the third state information vector may be represented as: state (batch _ size, hidden _ size). batch _ size represents the number of sample speech information used in this training, sequence _ length represents the length of the sample speech information, and hidden _ size represents the hidden layer size of LSTM.

It is to be understood that when the voice recognition layer includes 3 or more sub voice recognition layers, the first sub voice recognition layer is a first layer of the voice recognition layer, and the second sub voice recognition layer is a last layer of the voice recognition layer. The output of the previous sub-speech recognition layer is used as the input of the next sub-speech recognition layer adjacent to the previous sub-speech recognition layer, and the output results of the other sub-speech recognition layers except the last sub-speech recognition layer comprise a speech feature vector and a state information vector. The speech feature vector reflects information in the following dimensions: the number of sample voice messages of the current batch, the length of the sample voice messages, and the hidden layer size information. The state information vector reflects information in the following dimensions: the number of sample speech information per batch and the hidden layer size.

S20232: and inputting the voice feature vector and the third state information vector into the second sub-voice recognition layer for processing, and outputting a text recognition result corresponding to the sample voice information.

Specifically, when the speech recognition layer is composed of the attention mechanism layer shown in fig. 3, the speech recognition device inputs the sample speech sequence feature vector corresponding to the sample speech information output by the semantic analysis layer into the attention mechanism layer, analyzes the sample speech sequence feature vector corresponding to the sample speech information based on the attention mechanism through a plurality of LSTMs included in the attention mechanism layer, screens out speech feature information meeting conditions, maps the screened speech feature information meeting the conditions into corresponding text labels based on the correspondence between the preset speech feature information and the preset text labels, and sorts the text corresponding to the speech feature information based on the text corresponding to the mapped text labels according to the order between the speech feature information to obtain the text recognition result corresponding to the sample speech information.

Further, the speech recognition layer may include at least two sub-speech recognition layers and a full connection layer.

Specifically, when the speech recognition layer is composed of the attention mechanism layer and the full link layer as shown in fig. 3, the speech recognition device inputs the sample speech sequence feature vector corresponding to each sample speech information output by the semantic analysis layer into the attention mechanism layer for processing, so as to screen out speech feature information meeting the conditions, and inputs the processing result output by the attention mechanism layer into the full link layer for processing, so as to map the processing result output by the attention mechanism layer into a corresponding text label, and sorts the characters corresponding to the mapped text label according to the sequence between the speech feature information, so as to obtain the text recognition result corresponding to each sample speech information.

S203: and correcting the model parameters of the deep learning network according to the text recognition result corresponding to the sample voice information and the sample text information corresponding to the sample voice information, and returning to execute the step of inputting the sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result until the training condition of the deep learning network meets a first preset condition, so as to obtain the voice recognition model.

And the voice recognition equipment compares a text recognition result corresponding to the sample voice information with sample text information corresponding to the sample voice information, determines the difference between the text recognition result and the sample text information, and adjusts the model parameters of the deep learning network according to the difference. And then returning to S202, and continuing to execute S202-S203, so that the deep learning network continues to train after the model parameters are adjusted until the training condition of the deep learning network meets the first preset condition.

The training condition of the deep learning network meeting the first preset condition may be: and stopping training when the accumulated total training times reach a preset time threshold value or the difference between the text recognition result output by the deep learning network and the corresponding sample text information meets a preset requirement, and taking the deep learning model after training as a voice recognition model. Namely, the deep learning network in which the model parameters are adjusted last time is used as the speech recognition model.

The preset requirement may be that the difference degree is smaller than or equal to a preset difference degree threshold, or that the difference degree belongs to a preset error range, but is not limited to this, and may also be set according to an actual situation.

Further, in the training process, in order to prevent the deep learning network from being over-fitted, S203 may include S2031 to S2033, specifically as follows:

s2031: and evaluating the difference degree between the text recognition result corresponding to the sample voice information and the sample text information through a preset loss function.

In machine learning, it is desirable that the predicted data distribution learned by the model on the training data is as close as possible to the real data distribution, and the cross entropy is often taken as the loss function (loss function). The preset loss function can be set according to actual conditions, and is not limited herein.

The difference between the text recognition result corresponding to each sample voice message and the sample text message is used for measuring the accuracy of the recognition result.

When obtaining the difference degree between the text recognition result corresponding to the sample voice information and the sample text information, the voice recognition device judges whether the difference degree meets a second preset condition, and when the difference degree does not meet the second preset condition, the voice recognition device jumps to S202 and continues to execute S202 and S2031; when the degree of difference satisfies the second preset condition, S2033 is performed. The second preset condition may be that the difference degree is smaller than or equal to a preset difference degree threshold, or the difference degree belongs to a preset error range, but is not limited thereto, and may also be set according to an actual situation, and is not limited herein.

S2032: and when the difference does not meet a second preset condition, adjusting the model parameters of the deep learning network, and returning to execute the step of inputting the sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result.

For example, when the preset condition is that the degree of difference is less than or equal to the preset degree of difference threshold, the speech recognition apparatus determines that the current speech recognition accuracy has not reached the requirement when confirming that the current degree of difference is greater than the preset degree of difference threshold, adjusts the model parameters of the deep learning network, and then returns to S202, and continues to execute S202, S2031 until the degree of difference determined in S2031 is less than or equal to the preset degree of difference threshold, and executes S2033.

S2033: and when the difference degree meets the second preset condition, stopping training the deep learning network, and taking the trained deep learning model as the voice recognition model.

For example, when the preset condition is that the degree of difference is less than or equal to the preset degree of difference threshold, the speech recognition device determines that the training meets the expected requirement when confirming that the degree of difference is less than or equal to the preset degree of difference threshold, stops training the deep learning network, and takes the deep learning model after training as the speech recognition model.

At the moment, the deep learning network after model parameters are adjusted is trained by a large number of samples, the difference degree of the deep learning network is kept in a small range, and the deep learning network is used for processing voice information so as to obtain a more accurate recognition result.

According to the scheme, the local feature information of the original voice information is extracted through the convolution layer and the pooling layer in the sampling layer of the voice recognition model, the context information of the original voice information is extracted through the BilSTM of the semantic analysis layer of the voice recognition model, and the context information of the local feature information is analyzed through the attention mechanism to obtain the text information corresponding to the original voice information. The original voice information is input into the voice recognition model, so that the feature information does not need to be extracted in advance, hardware resources occupied by extracting the feature information can be saved, the speed of voice recognition through the voice recognition model can be increased, the voice recognition model can extract the complete feature information of the original voice information, the problem that the recognition result is not accurate enough due to the fact that part of the original audio information is lost when the feature information is extracted is solved, and the accuracy of the voice recognition can be improved.

The voice recognition model comprises a sampling layer, a semantic analysis layer and a voice recognition layer, wherein the sampling layer is provided with a plurality of sub-sampling layers, so that more local characteristic information can be extracted through the sampling layer; the semantic analysis layer is provided with at least two sub-semantic analysis layers, so that the context information extracted by the semantic analysis layers is more comprehensive; the speech recognition layer is provided with at least two sub-speech recognition layers, so that the speech feature vectors output by the speech recognition layer can reflect more speech feature information, and the accuracy of the text recognition result of the speech information output by the speech recognition model is further improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Referring to fig. 5, fig. 5 is a schematic diagram of a speech recognition apparatus according to an embodiment of the present application. The included units are used for executing steps in the embodiments corresponding to fig. 1 and fig. 4, and refer to the related description in the embodiments corresponding to fig. 1 and fig. 4. For convenience of explanation, only the portions related to the present embodiment are shown. Referring to fig. 5, the speech recognition apparatus 5 includes:

an obtaining unit 510, configured to obtain voice information to be recognized;

a converting unit 520, configured to convert the voice information into an initial voice vector with a preset dimension;

the recognition unit 530 is configured to process the initial speech vector by using a preset speech recognition model to obtain predicted text information corresponding to the speech information; the speech recognition model is obtained by training a deep learning network based on sample speech vectors of a plurality of sample speech information and sample text information corresponding to the sample speech information.

Further, the voice recognition model comprises a sampling layer, a semantic analysis layer and a voice recognition layer; the recognition unit 530 includes:

the sampling unit is used for inputting the initial voice vector into a sampling layer of the voice recognition model to carry out convolution and downsampling processing so as to obtain a local characteristic information vector corresponding to the initial voice vector; the local feature information vector is used for identifying local features corresponding to the initial voice vector;

a semantic analysis unit, configured to input the local feature information vectors into a semantic analysis layer of the speech recognition model for processing, determine context information of all the local feature information vectors, and generate speech sequence feature vectors based on the local feature information vectors and the context information; wherein the speech sequence feature vector is used for identifying the context of all the local features;

and the voice recognition unit is used for inputting the voice sequence feature vector into a voice recognition layer of the voice recognition model for processing to obtain predicted text information corresponding to the voice information.

Further, the speech recognition apparatus further includes:

the sample conversion unit is used for converting the sample voice information in the training sample set into a sample voice vector with a preset dimension; the training sample set comprises a plurality of sample voice messages and sample text messages corresponding to the sample voice messages respectively;

the first training unit is used for inputting a sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result;

and the second training unit is used for correcting the model parameters of the deep learning network according to the text recognition result corresponding to the sample voice information and the sample text information corresponding to the sample voice information, and returning to execute the step of inputting the sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain the text recognition result until the training condition of the deep learning network meets a first preset condition, so as to obtain the voice recognition model.

Further, the second training unit comprises:

the difference degree analysis unit is used for evaluating the difference degree between the text recognition result corresponding to the sample voice information and the sample text information through a preset loss function;

the adjusting unit is used for adjusting the model parameters of the deep learning network when the difference does not meet a second preset condition, and returning to execute the step of inputting the sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result;

and the locking unit is used for stopping training the deep learning network when the difference degree meets the second preset condition, and taking the trained deep learning model as the voice recognition model.

Further, the voice recognition model comprises a sampling layer, a semantic analysis layer and a voice recognition layer; the first training unit includes:

the sampling training unit is used for inputting a sample voice vector corresponding to the sample voice information into a sampling layer of the deep learning network for convolution and downsampling processing to obtain a sample local characteristic information vector corresponding to the sample voice information; the sample local feature information vector is used for identifying local features corresponding to the sample voice information;

a semantic training unit, configured to input the sample local feature information vector into a semantic analysis layer of the deep learning network for processing, determine context information of all the sample local feature information vectors, and generate a sample speech sequence feature vector based on the sample local feature information vector and the context information; wherein the sample speech sequence feature vector is used to identify the context of all the local features;

and the text training unit is used for inputting the sample voice sequence feature vector into the voice recognition layer of the deep learning network for processing to obtain a text recognition result corresponding to the sample voice information.

Further, the sampling layer comprises three sub-sampling layers, the feature vector output by each sub-sampling layer is used as the input of the next sub-sampling layer adjacent to the sub-sampling layer, and the last sub-sampling layer outputs the sample local feature information vector;

the sampling training unit is specifically configured to:

inputting a sample voice vector corresponding to the sample voice information into a first sub-sampling layer in the deep learning network for convolution and down-sampling processing to obtain a first sample local characteristic information vector;

inputting the first sample local feature information vector into a second sub-sampling layer adjacent to the first sub-sampling layer to carry out convolution and down-sampling processing to obtain a second sample local feature information vector;

and inputting the second sample local characteristic information vector into a third sub-sampling layer for convolution and down-sampling processing to obtain the sample local characteristic information vector.

Further, the semantic analysis layer comprises two sub-semantic analysis layers; the output result of each sub-semantic analysis layer is used as the input of the next sub-semantic analysis layer, and the output result of the last sub-semantic analysis layer is used as the input of the semantic analysis layer; wherein, the output result of each sub-semantic analysis layer comprises a voice sequence vector and a state information vector;

the semantic training unit is specifically configured to: inputting the sample local characteristic information vector into a first sub-semantic analysis layer of the deep learning network for processing, and outputting a first voice sequence vector and a first state information vector;

and inputting the first voice sequence vector and the first state information into the second sub-semantic analysis layer for processing, and outputting the sample voice sequence feature vector and the second state information vector.

Further, the speech recognition layer comprises two sub-speech recognition layers; the text training unit is specifically configured to: inputting the sample voice sequence feature vector and the second state information vector into a first sub-voice recognition layer of the deep learning network for processing to obtain a voice feature vector and a third state information vector; and inputting the voice feature vector and the third state information vector into the second sub-voice recognition layer for processing, and outputting a text recognition result corresponding to each sample voice information.

Fig. 6 is a schematic diagram of a speech recognition device provided in an embodiment of the present application. As shown in fig. 6, the speech recognition apparatus 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62, such as a speech recognition program, stored in said memory 61 and operable on said processor 60. The processor 60, when executing the computer program 62, implements the steps in the various speech recognition method embodiments described above, such as the steps 101 to 103 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, implements the functions of the modules/units in the above-mentioned device embodiments, such as the functions of the modules 510 to 530 shown in fig. 5.

Illustratively, the computer program 62 may be partitioned into one or more modules/units that are stored in the memory 61 and executed by the processor 60 to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution of the computer program 62 in the speech recognition device 6. For example, the computer program 62 may be divided into an obtaining unit, a converting unit, and an identifying unit, and specific functions of each unit are described in the embodiment corresponding to fig. 5, which is not described herein again.

The speech recognition device 6 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The speech recognition device may include, but is not limited to, a processor 60, a memory 61. It will be appreciated by those skilled in the art that fig. 6 is merely an example of the speech recognition device 6 and is not intended to be limiting with respect to the speech recognition device 6 and may include more or less components than shown, or some components may be combined, or different components, for example the speech recognition device may also include input output devices, network access devices, buses, etc.

The Processor 60 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the speech recognition device 6, such as a hard disk or a memory of the speech recognition device 6. The memory 61 may also be an external storage device of the speech recognition device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the speech recognition device 6. Further, the memory 61 may also include both an internal storage unit of the voice recognition device 6 and an external storage device. The memory 61 is used for storing the computer program and other programs and data required by the speech recognition device. The memory 61 may also be used to temporarily store data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or illustrated in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/terminal device and method may be implemented in other ways. For example, the above-described embodiments of the apparatus/terminal device are merely illustrative, and for example, the division of the modules or units is only one logical division, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. . Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A speech recognition method, comprising:

acquiring voice information to be recognized;

2. The speech recognition method of claim 1, wherein the speech recognition model comprises a sampling layer, a semantic analysis layer, and a speech recognition layer; processing the initial voice vector by adopting a preset voice recognition model to obtain text information corresponding to the voice information, and the method comprises the following steps:

inputting the initial voice vector into a sampling layer of the voice recognition model to carry out convolution and down-sampling processing to obtain a local characteristic information vector corresponding to the initial voice vector; the local feature information vector is used for identifying local features corresponding to the initial voice vector;

inputting the local feature information vectors into a semantic analysis layer of the voice recognition model for processing, determining context information of all the local feature information vectors, and generating voice sequence feature vectors based on the local feature information vectors and the context information; wherein the speech sequence feature vector is used for identifying the context of all the local features;

and inputting the voice sequence feature vector into a voice recognition layer of the voice recognition model for processing to obtain predicted text information corresponding to the voice information.

3. The speech recognition method according to claim 1 or 2, wherein before the processing the initial speech vector by using the preset speech recognition model to obtain the text information corresponding to the speech information, the method further comprises:

converting sample voice information in a training sample set into a sample voice vector with a preset dimension; the training sample set comprises a plurality of sample voice messages and sample text messages corresponding to the sample voice messages respectively;

inputting a sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result;

and correcting the model parameters of the deep learning network according to the text recognition result corresponding to the sample voice information and the sample text information corresponding to the sample voice information, and returning to execute the step of inputting the sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result until the training condition of the deep learning network meets a first preset condition, so as to obtain the voice recognition model.

4. The speech recognition method according to claim 3, wherein the step of correcting the model parameters of the deep learning network according to the text recognition result corresponding to the sample speech information and the sample text information corresponding to the sample speech information, and returning to perform the step of inputting the sample speech vector corresponding to the sample speech information into the deep learning network for processing to obtain the text recognition result until the training condition of the deep learning network satisfies a first preset condition to obtain the speech recognition model comprises:

evaluating the difference degree between the text recognition result corresponding to the sample voice information and the sample text information through a preset loss function;

when the difference does not meet a second preset condition, adjusting the model parameters of the deep learning network, and returning to execute the step of inputting the sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result;

and when the difference degree meets the second preset condition, stopping training the deep learning network, and taking the trained deep learning network as the voice recognition model.

5. The speech recognition method of claim 3, wherein the speech recognition model comprises a sampling layer, a semantic analysis layer, and a speech recognition layer; the step of inputting the sample voice vector corresponding to the sample voice information into the deep learning network for processing to obtain a text recognition result includes:

inputting a sample voice vector corresponding to sample voice information into a sampling layer of the deep learning network for convolution and downsampling processing to obtain a sample local characteristic information vector corresponding to the sample voice information; the sample local feature information vector is used for identifying local features corresponding to the sample voice information;

inputting the sample local characteristic information vector into a semantic analysis layer of the deep learning network for processing, determining context information of all the sample local characteristic information vectors, and generating a sample voice sequence characteristic vector based on the sample local characteristic information vector and the context information; wherein the sample speech sequence feature vector is used to identify the context of all the local features;

and inputting the sample voice sequence feature vector into a voice recognition layer of the deep learning network for processing to obtain a text recognition result corresponding to the sample voice information.

6. The speech recognition method of claim 5, wherein the sampling layer comprises three sub-sampling layers, wherein each sub-sampling layer outputs a feature vector as an input to a next sub-sampling layer adjacent thereto, and a last sub-sampling layer outputs the sample local feature information vector;

the method for inputting the sample voice vector corresponding to the sample voice information into the sampling layer of the deep learning network for convolution and downsampling processing to obtain the sample local feature information vector corresponding to the sample voice information includes:

7. The speech recognition method according to claim 5 or 6, wherein the semantic analysis layer comprises two sub-semantic analysis layers;

the inputting the sample local feature information vector into a semantic analysis layer of the deep learning network for processing, determining context information of all the sample local feature information vectors, and generating a sample voice sequence feature vector based on the sample local feature information vector and the context information includes:

inputting the sample local characteristic information vector into a first sub-semantic analysis layer of the deep learning network for processing, and outputting a first voice sequence vector and a first state information vector;

and inputting the first voice sequence vector and the first state information vector into the second sub-semantic analysis layer for processing, and outputting the sample voice sequence feature vector and the second state information vector.

8. The speech recognition method of claim 5 or 6, wherein the speech recognition layer comprises two sub-speech recognition layers;

the step of inputting the sample voice sequence feature vector into a voice recognition layer of the deep learning network for processing to obtain a text recognition result corresponding to the sample voice information includes:

inputting the sample voice sequence feature vector and the second state information vector into a first sub-voice recognition layer of the deep learning network for processing to obtain a voice feature vector and a third state information vector;

and inputting the voice feature vector and the third state information vector into the second sub-voice recognition layer for processing, and outputting a text recognition result corresponding to the sample voice information.

9. A speech recognition device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the speech recognition method according to any one of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 8.