CN112712797A

CN112712797A - Voice recognition method and device, electronic equipment and readable storage medium

Info

Publication number: CN112712797A
Application number: CN202011600083.XA
Authority: CN
Inventors: 王健宗; 瞿晓阳
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2021-04-27
Also published as: WO2022141867A1

Abstract

The invention relates to a voice processing technology, and discloses a voice recognition method, which comprises the following steps: training a preset comparison prediction coding model by using the first voice set to obtain a voice feature extraction model; performing feature extraction on the second voice set by using the voice feature extraction model to obtain a voice feature set; training a preset deep learning model by using the voice feature set to obtain the voice recognition model; when receiving a voice to be recognized, performing feature extraction on the voice to be recognized by using the voice feature extraction model to obtain a target voice feature set; and identifying the target voice feature set by using the voice identification model to obtain an identification text. The invention also relates to a blockchain technique, wherein the target speech feature set can be stored in a blockchain. The invention also provides a voice recognition device, an electronic device and a readable storage medium. The invention can improve the accuracy of voice recognition.

Description

Voice recognition method and device, electronic equipment and readable storage medium

Technical Field

The present invention relates to the field of speech processing, and in particular, to a speech recognition method, apparatus, electronic device, and readable storage medium.

Background

With the development of artificial intelligence technology, the speech recognition technology is a technology that a machine converts a speech signal into a corresponding text through a recognition and understanding process, and the machine can more easily understand a speech command through the speech recognition technology, so that the process of human life intellectualization is accelerated, and therefore the speech recognition technology is more and more emphasized by people.

However, the current speech recognition technology needs to extract mel-frequency cepstrum coefficient features of speech, but the mel-frequency cepstrum coefficient features are very sensitive to noise, and the noise can cause the mel-frequency cepstrum coefficient features to be remarkably reduced, so that the accuracy of speech recognition is low.

Disclosure of Invention

The invention provides a voice recognition method, a voice recognition device, electronic equipment and a computer readable storage medium, and mainly aims to improve the accuracy of voice recognition.

In order to achieve the above object, the present invention provides a speech recognition method, including:

acquiring a first voice set, and training a preset comparison prediction coding model by using the first voice set to obtain a voice feature extraction model;

acquiring a second voice set, and performing feature extraction on the second voice set by using the voice feature extraction model to obtain a voice feature set;

training a preset deep learning model by using the voice feature set to obtain the voice recognition model;

when receiving a voice to be recognized, performing feature extraction on the voice to be recognized by using the voice feature extraction model to obtain a target voice feature set;

and identifying the target voice feature set by using the voice identification model to obtain an identification text.

Optionally, the performing, by using the speech feature extraction model, sound feature extraction on the second speech set to obtain a speech feature set includes:

resampling each voice in the second voice set to obtain corresponding digital voice;

pre-emphasis is carried out on the digital voice to obtain corresponding standard digital voice;

performing feature extraction on the standard digital voice by using the voice feature extraction model to obtain a voice feature subset;

and summarizing all the voice feature subsets to obtain the voice feature set.

Optionally, the performing feature extraction on the standard digital speech by using the speech feature extraction model to obtain a speech feature subset includes:

dividing the standard digital voice into a plurality of voice paragraphs according to a preset time scale to obtain a voice paragraph set;

and performing feature extraction on each speech paragraph in the speech paragraph set by using the speech feature extraction model to obtain the speech feature subset.

Optionally, the training a preset deep learning model by using the speech feature set to obtain the speech recognition model includes:

performing character marking on each voice feature contained in the voice feature set to obtain a training set;

and performing iterative training on the deep learning model by using the training set to obtain the voice recognition model.

Optionally, the performing iterative training on the deep learning model by using the training set to obtain the speech recognition model includes:

a characteristic extraction step: performing convolution pooling operation on the training set according to preset convolution pooling times to obtain a feature set;

and a loss calculation step: calculating the feature set by using a preset activation function to obtain a predicted value, vectorizing characters marked by each voice feature in the training set to obtain a label value, and calculating by using a pre-constructed first loss function according to the predicted value and the label value to obtain a first loss value;

training and judging: comparing the first loss value with a preset first loss threshold value, and returning to the feature extraction step when the first loss value is greater than or equal to the first preset threshold value; and when the first loss value is smaller than the first preset threshold value, stopping training to obtain the voice recognition model.

Optionally, the performing feature extraction on the speech to be recognized by using the speech feature extraction model to obtain a target speech feature set includes:

dividing the voice to be recognized into a plurality of target voice paragraphs according to the time scale;

marking the sequence number of each target speech paragraph to obtain a target speech paragraph set;

and performing voice feature extraction on each target voice paragraph in the target voice paragraph set by using the voice feature extraction model to obtain the target voice feature set.

Optionally, the recognizing the target speech feature set by using the speech recognition model to obtain a recognition text includes:

recognizing each target voice feature contained in the target voice feature set by using the voice recognition model to obtain a corresponding recognition character;

and sequentially combining the recognition characters according to the sequence numbers of the target voice paragraphs corresponding to the target voice paragraph set to obtain the recognition text.

In order to solve the above problem, the present invention also provides a speech recognition apparatus, comprising:

the feature extraction model construction module is used for acquiring a first voice set, and training a preset comparison prediction coding model by using the first voice set to obtain a voice feature extraction model;

the voice recognition model construction module is used for acquiring a second voice set and extracting the characteristics of the second voice set by using the voice characteristic extraction model to obtain a voice characteristic set; training a preset deep learning model by using the voice feature set to obtain the voice recognition model;

the voice recognition module is used for performing feature extraction on the voice to be recognized by utilizing the voice feature extraction model when receiving the voice to be recognized to obtain a target voice feature set; and identifying the target voice feature set by using the voice identification model to obtain an identification text.

In order to solve the above problem, the present invention also provides an electronic device, including:

a memory storing at least one computer program; and

a processor executing the computer program stored in the memory to implement the speech recognition method described above.

In order to solve the above problem, the present invention also provides a computer-readable storage medium having at least one computer program stored therein, the at least one computer program being executed by a processor in an electronic device to implement the speech recognition method described above.

In the embodiment of the invention, the first speech set is used for training a preset comparison prediction coding model to obtain a speech feature extraction model, the comparison prediction coding model is an unsupervised model, and the training samples are easy to obtain and low in cost, so that the speech feature extraction model has stronger feature extraction capability and higher robustness; acquiring a second voice set, and performing feature extraction on the second voice set by using the voice feature extraction model to obtain a voice feature set, so that the quality of deep learning model training data is improved by depending on the voice feature extraction model; the preset deep learning model is trained by utilizing the voice feature set to obtain the voice recognition model, the quality of deep learning model training data is saved and improved by depending on a voice feature extraction model, and the voice recognition capability of the voice recognition model is improved; when receiving a voice to be recognized, performing feature extraction on the voice to be recognized by using the voice feature extraction model to obtain a target voice feature set; and identifying the target voice feature set by using the voice identification model to obtain an identification text. Therefore, the speech recognition method, the speech recognition device, the electronic equipment and the computer-readable storage medium provided by the embodiment of the invention improve the accuracy of speech recognition.

Drawings

Fig. 1 is a schematic flow chart of a speech recognition method according to an embodiment of the present invention;

fig. 2 is a schematic flow chart illustrating a process of obtaining a speech feature set in a speech recognition method according to an embodiment of the present invention;

fig. 3 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention;

fig. 4 is a schematic internal structural diagram of an electronic device implementing a speech recognition method according to an embodiment of the present invention;

the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The embodiment of the invention provides a voice recognition method. The execution subject of the voice recognition method includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiments of the present application. In other words, the voice recognition method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.

Referring to fig. 1, a flow diagram of a speech recognition method according to an embodiment of the present invention is shown, where in the embodiment of the present invention, the speech recognition method includes:

s1, acquiring a first voice set, and training a preset contrast prediction coding model by using the first voice set to obtain a voice feature extraction model;

in an embodiment of the present invention, the first speech set includes a speech set with multiple languages, multiple dialects and multiple background noises.

Further, in the embodiment of the present invention, in order to enable the comparison prediction coding model to have a speech feature extraction capability, the first speech set is used to perform iterative training on a preset comparison prediction coding model until the comparison prediction coding model converges, so as to obtain the speech feature extraction model. The contrast predictive coding model is a CPC (contrast predictive coding) model, and since the contrast predictive coding model is an unsupervised model, training data is not required to be labeled, a large amount of training data can be obtained at low cost, so that the model has stronger feature extraction capability.

S2, acquiring a second voice set, and performing feature extraction on the second voice set by using the voice feature extraction model to obtain a voice feature set;

in the embodiment of the present invention, the second speech set is a set of speech having corresponding text labels.

Further, in the embodiment of the present invention, in order to determine the voice features corresponding to different characters and make subsequent voice recognition more accurate, feature extraction is performed on the second voice set, and the voice feature of each voice in the second voice set is extracted to obtain the voice vector set.

In detail, in the embodiment of the present invention, referring to fig. 2, performing voice feature extraction on the second speech set by using the speech feature extraction model to obtain the speech vector set includes:

s11, resampling each voice in the second voice set to obtain corresponding digital voice;

in the embodiment of the present invention, in order to facilitate data processing on each voice in the second voice set, the sample audio is resampled to obtain the digital voice, and preferably, the embodiment of the present invention uses a digital-to-analog converter to resample the sample audio.

S12, pre-emphasis is carried out on the digital voice to obtain standard digital voice;

in detail, the embodiment of the present invention performs the pre-emphasis operation by using the following formula:

y(t)＝x(t)-μx(t-1)

wherein x (t) is the digital voice, t is time, y (t) is the standard digital voice, μ is a preset adjustment value of the pre-emphasis operation, and preferably, μ has a value range of [0.9,1.0 ].

S13, performing feature extraction on the standard digital voice by using the voice feature extraction model to obtain a voice feature subset;

in detail, since different durations of the standard digital speech may be different, in order to make the speech features uniform and facilitate subsequent model recognition, the standard digital speech is divided into a plurality of speech paragraphs according to a preset time scale to obtain a speech paragraph set, and each speech paragraph in the speech paragraph set is subjected to feature extraction by using the speech feature extraction model to obtain the speech feature subset.

And S14, summarizing all the voice feature subsets to obtain the voice feature set.

In the embodiment of the invention, all the voice features are collected to obtain the voice feature set.

S3, training a preset deep learning model by utilizing the voice feature set to obtain the voice recognition model;

in the embodiment of the invention, the deep learning model is a convolutional neural network model.

Preferably, in the embodiment of the present invention, each speech feature included in the speech feature set is subjected to text labeling to obtain a training set, and the deep learning model is subjected to iterative training by using the training set to obtain the speech recognition model.

In detail, the iteratively training the deep learning model by using the training set includes:

step A: performing convolution pooling operation on the training set according to preset convolution pooling times to obtain a feature set;

and B: calculating the feature set by using a preset activation function to obtain a predicted value, vectorizing characters marked by each voice feature in the training set to obtain a label value, and calculating by using a pre-constructed first loss function according to the predicted value and the label value to obtain a first loss value;

preferably, in the embodiment of the present invention, an onehot code is used to convert the text of each voice feature label in the training set into a vector, so as to obtain the label value.

And C: comparing the first loss value with a preset first loss threshold value, and returning to the step A when the first loss value is greater than or equal to the first preset threshold value; and when the first loss value is smaller than the first preset threshold value, stopping training to obtain the voice recognition model.

In detail, in the embodiment of the present invention, the performing convolution pooling on the training set to obtain a first feature set includes: performing convolution operation on the training set to obtain a first convolution data set; performing a maximum pooling operation on the first convolved data set to obtain the first feature set.

Further, the convolution operation is:

and ω' represents the number of channels of the first convolution data set, ω represents the number of channels of the training set, k is the size of a preset convolution kernel, f is the step of a preset convolution operation, and p is a preset data zero padding matrix.

Further, in a preferred embodiment of the present invention, the first activation function includes:

wherein, mu_tRepresenting the predicted values, s represents data in the feature set.

y_iIs the tag value, p_iAnd the predicted value is used.

S4, when receiving the voice to be recognized, performing feature extraction on the voice to be recognized by using the voice feature extraction model to obtain a target voice feature set;

in this embodiment of the present invention, the speech to be recognized is divided into a plurality of target speech paragraphs according to the time scale, and each target speech paragraph is subjected to sequence number marking to obtain a target speech paragraph set, for example: the time scale is 2 seconds, the number of the voices to be recognized is 6 seconds, the voices to be recognized are divided into target voice paragraphs A, B, C according to the time scale, a target voice paragraph A is 0-2 seconds of voices, a target voice paragraph B is 2-4 seconds of voices, a target voice paragraph C is 4-6 seconds of voices, the target voice paragraph A is marked with a serial number 2, the target voice paragraph B is marked with a serial number 1, and the target voice paragraph C is marked with a serial number 3.

In another embodiment of the present invention, in order to ensure data privacy, the target speech feature set may be stored in a block link point.

And S6, recognizing the target voice feature set by using the voice recognition model to obtain a recognition text.

In detail, in the embodiment of the present invention, each target speech feature included in the target speech feature set is recognized by using the speech recognition model to obtain a corresponding recognition character, and the recognition characters are sequentially combined according to the sequence number of the target speech passage corresponding to the target speech passage set to obtain the recognition text. Such as: the target speech passage includes a target speech passage A, B, C, the sequence number of the target speech passage a is 2, the sequence number corresponding to the target speech paragraph B is 1, the sequence number corresponding to the target speech paragraph C is 3, the target voice features corresponding to the target voice paragraph A, B, C are a, b and c, the voice recognition model is used for recognizing the target voice feature a to obtain a recognition character ' yes ', the voice recognition model is used for recognizing the target voice feature b to obtain a recognition character ' me ', the voice recognition model is used for recognizing the target voice feature c to obtain a recognition character ' who ', and the recognition characters are sequentially combined according to the sequence numbers of the corresponding target voice paragraphs in the target voice paragraph set to obtain the recognition text ' who ' i '.

Fig. 3 is a functional block diagram of the speech recognition apparatus according to the present invention.

The speech recognition apparatus 100 of the present invention can be installed in an electronic device. According to the implemented functions, the speech recognition apparatus may include a feature extraction model building module 101, a speech recognition model building module 102, and a speech recognition module 103, which may also be referred to as a unit, and refer to a series of computer program segments that can be executed by a processor of an electronic device and can perform fixed functions, and are stored in a memory of the electronic device.

In the present embodiment, the functions regarding the respective modules/units are as follows:

the feature extraction model construction module 101 is configured to obtain a first speech set, and train a preset contrast prediction coding model by using the first speech set to obtain a speech feature extraction model.

Further, in this embodiment of the present invention, in order to enable the contrast prediction coding model to have a speech feature extraction capability, the feature extraction model construction module 101 performs iterative training on a preset contrast prediction coding model by using the first speech set until the contrast prediction coding model converges, so as to obtain the speech feature extraction model. The contrast predictive coding model is a CPC (contrast predictive coding) model, and since the contrast predictive coding model is an unsupervised model, training data is not required to be labeled, a large amount of training data can be obtained at low cost, so that the model has stronger feature extraction capability.

The speech recognition model construction module 102 is configured to obtain a second speech set, and perform feature extraction on the second speech set by using the speech feature extraction model to obtain a speech feature set; and training a preset deep learning model by utilizing the voice feature set to obtain the voice recognition model.

Further, in the embodiment of the present invention, in order to determine the voice features corresponding to different characters and make subsequent voice recognition more accurate, the voice recognition model building module 102 performs feature extraction on the second voice set, and extracts the voice feature of each voice in the second voice set to obtain the voice vector set.

In detail, in the embodiment of the present invention, the voice recognition model building module 102 performs sound feature extraction on the second voice set by using the following means to obtain the voice vector set, including:

Pre-emphasis is carried out on the digital voice to obtain standard digital voice;

y(t)＝x(t)-μx(t-1)

And summarizing all the voice feature subsets to obtain the voice feature set.

In detail, the speech recognition model construction module 102 iteratively trains the deep learning model by using the following means, including:

Further, the convolution operation is:

y_iIs the tag value, p_iAnd the predicted value is used.

The voice recognition module 103 is configured to, when receiving a voice to be recognized, perform feature extraction on the voice to be recognized by using the voice feature extraction model to obtain a target voice feature set; and identifying the target voice feature set by using the voice identification model to obtain an identification text.

In this embodiment of the present invention, the speech recognition module 103 divides the speech to be recognized into a plurality of target speech paragraphs according to the time scale, and performs sequence number marking on each target speech paragraph to obtain a target speech paragraph set, where: the time scale is 2 seconds, the number of the voices to be recognized is 6 seconds, the voices to be recognized are divided into target voice paragraphs A, B, C according to the time scale, a target voice paragraph A is 0-2 seconds of voices, a target voice paragraph B is 2-4 seconds of voices, a target voice paragraph C is 4-6 seconds of voices, the target voice paragraph A is marked with a serial number 2, the target voice paragraph B is marked with a serial number 1, and the target voice paragraph C is marked with a serial number 3.

In detail, in the embodiment of the present invention, the speech recognition module 103 uses the speech recognition model to recognize each target speech feature included in the target speech feature set to obtain a corresponding recognition character, and sequentially combines the recognition characters according to the sequence number of the target speech passage corresponding to the target speech passage set to obtain the recognition text. Such as: the target speech passage includes a target speech passage A, B, C, the sequence number of the target speech passage a is 2, the sequence number corresponding to the target speech paragraph B is 1, the sequence number corresponding to the target speech paragraph C is 3, the target voice features corresponding to the target voice paragraph A, B, C are a, b and c, the voice recognition model is used for recognizing the target voice feature a to obtain a recognition character ' yes ', the voice recognition model is used for recognizing the target voice feature b to obtain a recognition character ' me ', the voice recognition model is used for recognizing the target voice feature c to obtain a recognition character ' who ', and the recognition characters are sequentially combined according to the sequence numbers of the corresponding target voice paragraphs in the target voice paragraph set to obtain the recognition text ' who ' i '.

Fig. 4 is a schematic structural diagram of an electronic device implementing the speech recognition method according to the present invention.

The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a speech recognition program 12, stored in the memory 11 and executable on the processor 10.

The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only to store application software installed in the electronic device 1 and various types of data, such as codes of a voice recognition program, etc., but also to temporarily store data that has been output or is to be output.

The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by operating or executing programs or modules (e.g., voice recognition programs, etc.) stored in the memory 11 and calling data stored in the memory 11.

The bus may be a PerIPheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.

Fig. 4 only shows an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 4 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.

For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.

Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The speech recognition program 12 stored in the memory 11 of the electronic device 1 is a combination of computer programs which, when run in the processor 10, enable:

Specifically, the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.

Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable medium may be non-volatile or volatile. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).

Embodiments of the present invention may also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor of an electronic device, the computer program may implement:

Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.

The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.

Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method of speech recognition, the method comprising:

2. The speech recognition method of claim 1, wherein the performing acoustic feature extraction on the second speech set using the speech feature extraction model to obtain a speech feature set comprises:

and summarizing all the voice feature subsets to obtain the voice feature set.

3. The speech recognition method of claim 2, wherein said extracting features of the standard digital speech using the speech feature extraction model to obtain a speech feature subset comprises:

4. The speech recognition method of claim 1, wherein the training of the preset deep learning model by using the speech feature set to obtain the speech recognition model comprises:

5. The speech recognition method of claim 4, wherein iteratively training the deep learning model using the training set to obtain the speech recognition model comprises:

6. The speech recognition method of claim 1, wherein the performing feature extraction on the speech to be recognized by using the speech feature extraction model to obtain a target speech feature set comprises:

7. The speech recognition method of claim 6, wherein the recognizing the target set of speech features using the speech recognition model to obtain a recognized text comprises:

8. A speech recognition apparatus, comprising:

9. An electronic device, characterized in that the electronic device comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores computer program instructions executable by the at least one processor to enable the at least one processor to perform the speech recognition method of any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech recognition method according to any one of claims 1 to 7.