CN116052648B

CN116052648B - Training method, using method and training system of voice recognition model

Info

Publication number: CN116052648B
Application number: CN202210928141.4A
Authority: CN
Inventors: 王满洪
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-08-03
Filing date: 2022-08-03
Publication date: 2023-10-20
Anticipated expiration: 2042-08-03
Also published as: CN116052648A

Abstract

The embodiment of the application is applied to the field of artificial intelligence voice recognition and provides a training method, a using method and a training system of a voice recognition model. Comprises the steps of obtaining a training sample; judging the current character type of the sample text, and when the current character type of the sample text is only Chinese characters, segmenting the sample text by using a preset word stock to obtain a first segmentation result; inserting a first identifier before a first text unit with the character length being more than 1, and inserting a second identifier before the first text unit with the character length being equal to 1, so as to obtain a second word segmentation result; dividing the first text unit after each first identifier according to words, and inserting separators between two adjacent characters after dividing to obtain a first training text; an initial speech recognition model is trained based on the sample audio and the first training text. According to the technical scheme provided by the embodiment of the application, the text units based on the word granularity can be disguised into the text units based on the word granularity, so that the recognition effect of the voice recognition model is improved.

Description

Training method, using method and training system of voice recognition model

Technical Field

The embodiment of the application relates to the field of artificial intelligence voice recognition, in particular to a training method, a using method and a training system of a voice recognition model.

Background

An automatic speech recognition technique (Automatic Speech Recognition, ASR) is an information technology in which electronic devices convert a person's utterances, bytes or phrases into corresponding words or symbols, or give a response, through a recognition and understanding process. Speech recognition technology has been widely used in everyday life, and by way of example, a user may input information in an electronic device by inputting speech using speech recognition technology.

Traditional ASR is usually modeled based on word granularity, because common phrases are approximately 10 ten thousand, the requirements on memory and calculation power are very high, the method is only suitable for modeling at a server side, and the end-side ASR is usually modeled based on a recurrent neural network converter model based on word granularity.

However, the speech recognition model of the end-side ASR based on word granularity modeling cannot receive the rule information of the words, and the problem of word mismatching easily occurs in the speech recognition result.

Disclosure of Invention

The embodiment of the application provides a training method, a using method and a training system of a voice recognition model, which are used for solving the problems that the voice recognition model based on word granularity modeling cannot receive rule information of words and the words are easy to be mismatched in a voice recognition result, and improving the recognition result of the voice recognition model.

In a first aspect, an embodiment of the present application provides a training method for a speech recognition model, including: obtaining a training sample; the training samples comprise sample audio and sample texts corresponding to the sample audio; the sample text includes at least one character; the characters comprise at least one of Chinese characters and English characters; judging the current character type of the sample text; the character types include: chinese character only, english character only, and chinese-english character; when the current character type of the sample text is Chinese character only, word segmentation is carried out on the sample text by utilizing a preset word stock so as to obtain a first word segmentation result, wherein the first word segmentation result comprises a plurality of first text units; the first text unit is a first target word in the sample text, which is matched with a preset word in a preset word bank, and/or the first text unit is a first non-target word except the first target word in the sample text; a separator is arranged in front of each first target word and each first non-target word; for a first word segmentation result, inserting a first identifier before a first text unit with a character length greater than 1, and inserting a second identifier before the first text unit with a character length equal to 1, so as to obtain a second word segmentation result; aiming at the second word segmentation result, respectively carrying out word segmentation on a first text unit after each first identifier, and inserting separators between two adjacent segmented characters to obtain a first training text; and training an initial speech recognition model according to the sample audio and the first training text to obtain a trained speech recognition model. According to the training method of the voice recognition model, the electronic equipment disguises the text units based on the word granularity into the text units based on the word granularity in a mode of inserting the first identifier and the second identifier into the sample text, and the initial voice recognition model is enabled to distinguish the target words from the non-target words in a more explicit mode. Under the condition that the original word list based on word granularity modeling is not changed and the model structure of the initial voice recognition model is not changed, the electronic equipment can enable the trained voice recognition model to learn more words and information of similar sentences, and the recognition effect of the trained voice recognition model is improved.

In one implementation, the method further comprises: when the current character type of the sample text is English characters only, word segmentation is carried out on the sample text by utilizing an English sub word model so as to obtain a third word segmentation result, wherein the third word segmentation result comprises a plurality of second text units; the second text unit is a sub word of a second target word which can be split by the English sub word model in the sample text and/or a second non-target word which can not be split by the English sub word model in the sample text; a separator is arranged in front of each second text unit; for a third word segmentation result, inserting a first identifier before a second target word with the number of the sub words being greater than 1, and inserting a second identifier before a second non-target word with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result, for the fourth word segmentation result, segmenting each second text unit, and inserting separators before two adjacent segmented second text units, so as to obtain a second training text; and training the initial voice model according to the sample audio and the second training text to obtain a trained voice recognition model. By adopting the embodiment, the electronic equipment can improve the recognition effect of the voice recognition model on the English text.

In one implementation, the method further comprises: when the current character type of the sample text is Chinese and English characters, splitting the training sample into a first sub-text and a second sub-text; performing word segmentation on the first sub-text by using a preset word stock to obtain a first word segmentation result; performing word segmentation on the second sub text by utilizing the English sub word model to obtain a third word segmentation result; for the first word segmentation result, inserting a first identifier before a first text unit with the character length being greater than 1, and inserting a second identifier before the first text unit with the character length being equal to 1, so as to obtain a second word segmentation result; and for the third word segmentation result, inserting a first identifier before a second target word with the number of the sub words being greater than 1, and inserting a second identifier before a second non-target word with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result; for a second word segmentation result, respectively carrying out word segmentation on a first text unit after each first identifier, inserting a separator between two adjacent segmented characters to obtain a first training text, and for a fourth word segmentation result, carrying out segmentation on each second text unit, and inserting a separator in front of two adjacent segmented second text units to obtain a second training text; the first training text and the second training text are spliced into a third training text; and training the initial speech recognition model according to the sample audio and the third training text to obtain a trained speech recognition model. By adopting the embodiment, the electronic equipment can improve the recognition effect of the voice recognition model on the Chinese-English mixed model.

In one implementation, before determining the current character type of the sample text, the method further includes: establishing a preset word library, wherein the preset word library comprises at least one preset word, and establishing an English sub word model, and the English sub word model is a byte pair coding BPE model. By adopting the embodiment, the electronic equipment can establish a preset word bank at the end side, set general words and automatic speech recognition ASR scene words in the preset word bank as preset words, and establish an English sub word model at the end side so as to establish an end-to-end speech recognition model.

In one implementation, word segmentation is performed on a sample text by using a preset word stock to obtain a first word segmentation result, including: determining all first target words matched with the preset words in the sample text by using a preset word bank; and respectively segmenting each first target word into first text units, and segmenting first non-target words except the first target words in the sample text into first text units. By adopting the embodiment, the electronic device can determine the word segmentation mode of the sample text so as to acquire the first target word and the first non-target word.

In one implementation, word segmentation is performed on a sample text by using a preset word stock to obtain a first word segmentation result, including: determining all first non-target words which are not matched with the preset words in the sample text by using a preset word stock; all first non-target words in the sample text are segmented into text units. By adopting the embodiment, the electronic equipment can acquire the first non-target word.

In one implementation, word segmentation is performed on the sample text by using an english subword model to obtain a third word segmentation result, including: determining all sub-words of the second target words which can be split by the English sub-word model in the sample text by utilizing the English sub-word model, and determining all second non-target words which can not be split by the English sub-word model; each subword and each second non-target word are segmented into second text units and stored in the first list. By adopting the embodiment, the electronic equipment can determine the second target words which can be split and the second non-target words which cannot be split. In one implementation, an initial speech recognition model includes an input layer, a coding layer, a predictive network layer, a federated network layer, and an output layer; training an initial speech recognition model according to the sample audio and the first training text to obtain a trained speech recognition model, comprising: inputting the sample audio into a coding layer through an input layer to obtain an audio feature vector, and inputting the first training text into a prediction network layer through the input layer to obtain a text feature vector; inputting the audio feature vector and the text feature vector into a joint network layer to obtain a spliced feature vector; training the spliced feature vectors through a loss function of the voice recognition model and outputting the spliced feature vectors through an output layer to obtain the trained voice recognition model. According to the method and the electronic device, the electronic device learns the first target word and the first non-target word in the sample text through the voice recognition model, so that the recognition effect of the voice recognition model can be improved.

In one implementation, the initial speech recognition model is a recurrent neural network converter RNNT model. By adopting the embodiment, the RNNT model can realize end-to-end joint optimization and has perfect modeling capability of the language model.

In one implementation, the separator is a space character.

In a second aspect, an embodiment of the present application provides a method for using a speech recognition model to obtain speech data; and inputting the voice data into the voice recognition model trained by the training method of the voice recognition model according to the first aspect and the implementation modes to obtain the recognized voice data.

According to the application method of the voice recognition model, the electronic equipment inputs the voice data into the trained voice recognition model, the trained voice recognition model can recognize words in the voice data, and recognition accuracy is high.

In one implementation, the recognized voice data includes recognized text data corresponding to the recognized audio data, the recognized text data includes a first identifier, a second identifier, and a plurality of text units, the first identifier is used for identifying the text units as first target words or sub-words of the second target words, the second identifier is used for identifying the text units as first non-target words or second non-target words, and the first target words, the sub-words of the second target words, the first non-target words, and the second non-target words are obtained by recognition of a trained voice recognition model; inputting the voice data into the voice recognition model trained by the model training method in the first aspect and the implementation manners, and after obtaining the recognized voice data, further including: deleting the first identifier and the second identifier from the identified text data to obtain output text data; the text data is output for display to a user interface. According to the embodiment, the electronic device inputs the voice data into the trained voice recognition model, the target words and the non-target words can be distinguished through the first identifier and the second identifier, the first identifier and the second identifier are used as word lists to be recognized normally, the first identifier and the second identifier are deleted before the text data are displayed on the user interface, and only the final output text data are displayed.

In a third aspect, an embodiment of the present application provides a training system for a speech recognition model, including: the acquisition module is used for acquiring training samples; the training samples comprise sample audio and sample texts corresponding to the sample audio; the sample text comprises at least one character, and the character comprises at least one of Chinese characters and English characters; the judging module is used for judging the current character type of the training sample; the character types include: chinese character only, english character only, and chinese-english character; the word segmentation module is used for segmenting the sample text by utilizing a preset word stock when the current character type of the sample text is only Chinese characters so as to obtain a first word segmentation result, wherein the first word segmentation result comprises a plurality of first text units; the first text unit is a first target word in the sample text, which is matched with a preset word in a preset word bank, and/or the first text unit is a first non-target word except the first target word in the sample text; a separator is arranged in front of each first target word and each first non-target word; the identification module is used for inserting a first identifier before a first text unit with the character length being greater than 1 and inserting a second identifier before the first text unit with the character length being equal to 1 aiming at the first word segmentation result to obtain a second word segmentation result; the segmentation module is used for respectively carrying out word segmentation on the first text units after each first identifier according to the second word segmentation result, and inserting separators between two adjacent segmented characters to obtain a first training text; and the training module is used for training the initial speech recognition model according to the sample audio and the first training text to obtain a trained speech recognition model.

According to the training system for the voice recognition model, provided by the embodiment of the application, the electronic equipment disguises the text units based on the word granularity into the text units based on the word granularity in a mode of inserting the first identifier and the second identifier into the sample text, so that the initial voice recognition model can distinguish the target words from the non-target words in a more explicit mode. Under the condition that the original word list based on word granularity modeling is not changed and the model structure of the initial voice recognition model is not changed, the electronic equipment can enable the trained voice recognition model to learn more words and information of similar sentences, and the recognition effect of the trained voice recognition model is improved.

In one implementation, the method further comprises: the preset module is used for establishing a preset word stock; the preset word library comprises at least one preset word, and an English sub word model is established, wherein the English sub word model is a byte pair coding BPE model. By adopting the embodiment, the electronic equipment can establish a preset word bank at the end side, set general words and automatic speech recognition ASR scene words in the preset word bank as preset words, and establish an English sub word model at the end side so as to establish an end-to-end speech recognition model.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a processor and a memory; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the model training method of the first aspect and any of the various implementations described above.

In a fifth aspect, an embodiment of the present application further provides a chip system, where the chip system includes a processor and a memory, and the memory stores program instructions, where the program instructions, when executed by the processor, cause the chip system to perform the model training method in the first aspect and its respective implementation manners. For example, information related to the above method is generated or processed.

In a sixth aspect, embodiments of the present application further provide a computer readable storage medium, where program instructions are stored, which when run on a computer, cause the computer to perform the model training method in the first aspect and its respective implementation manners.

In a seventh aspect, embodiments of the present application also provide a computer program product, which when run on a computer, causes the computer to perform the model training method of the first aspect and its respective implementations.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic diagram of an automated speech recognition scenario;

FIG. 2 is a schematic diagram of a speech recognition model establishment based on word granularity;

FIG. 3 is a schematic diagram of a speech recognition model establishment based on word granularity;

FIG. 4 is a schematic diagram of an artificial intelligence call scenario;

fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application;

FIG. 6 is a block diagram of the software architecture of an electronic device 100 according to an embodiment of the application;

FIG. 7 is an exemplary flow chart of a method for training a speech recognition model provided by an embodiment of the present application;

FIG. 8 is a schematic representation of an RNNT model according to an embodiment of the present application;

FIG. 9 is another exemplary flow chart of a method for training a speech recognition model provided by an embodiment of the present application;

FIG. 10 is another exemplary flow chart of a method of training a speech recognition model provided by an embodiment of the present application;

FIG. 11 is an exemplary flow chart of a method of model use provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of a model training system according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application.

In the description of the present application, "/" means "or" unless otherwise indicated, for example, A/B may mean A or B. "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. Furthermore, "at least one" means one or more, and "a plurality" means two or more. The terms "first," "second," and the like do not limit the number and order of execution, and the terms "first," "second," and the like do not necessarily differ.

In the present application, the words "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In order to facilitate the technical solution of the embodiments of the present application to be understood by the skilled person, technical terms related to the embodiments of the present application are explained below.

1. The connection timing classification (Connectionist Temporal Classification, CTC) model is an end-to-end speech recognition model that models the conditional probability of all possible output paths for a given input sequence, enabling sequence-to-sequence conversion without forced alignment.

2. The recurrent neural network converter (Recurrent Neural Network Transducer, RNNT) model overcomes the defects of the CTC model, has language modeling capability, is convenient for end-to-end joint optimization, can perform real-time voice recognition and the like, and can cause information loss in the recognition process.

3. The byte pair encoding (Byte Pair Encoding) algorithm is a simple form of data compression, typically by replacing a pair of consecutive bytes of data with bytes not present in the data, and reconstructing the original data by a replacement table at a later time of use.

The application scenario of the embodiment of the present application is first described with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an automatic speech recognition scenario. As shown in fig. 1, the automatic speech recognition technique (Automatic Speech Recognition, ASR) is a technique that converts human speech into text, the performance of which depends approximately on the following factors: the size of the recognition vocabulary and the complexity of the speech, the quality of the speech signal, single-person or multi-person speaking, and hardware settings.

FIG. 2 is a schematic diagram of a speech recognition model building scheme based on word granularity. As shown in FIG. 2, although the number of the common Chinese characters is about 4000, the common phrases are about 10 ten thousand because the Chinese characters have various arrangement and combination modes. If the speech recognition model is built based on word granularity, the word list used for modeling can expand rapidly, and the modeling needs larger memory and higher computational power, so that the speech recognition model can only be built at the server side. When a user sends voice data to a voice recognition model established at a server, if the user encounters a condition of network disconnection and the like, the voice recognition model cannot respond in time.

FIG. 3 is a schematic diagram of a speech recognition model building scheme based on word granularity. As shown in fig. 3, in order to reduce the response time of the speech recognition model, ensure the privacy of speech data, and achieve the purpose of being independent of a network, the speech recognition model may be built on the end side, however, the processor on the end side has limited computational power, so that the speech recognition model is generally built based on word granularity. The current speech recognition model on the self-grinding end side can adopt word granularity modeling based on an RNNT structure, and model gain generated by a prediction (prediction) function is increased on the basis of CTC so as to improve the recognition effect, however, the existing speech recognition model based on the word granularity still has the condition of inaccurate recognition, for example, a 'flower and flower princess' is recognized as a 'Hua Hua princess'.

Therefore, in order to solve the problems that word granularity cannot be modeled at an end side and word granularity-based end side modeling and recognition are inaccurate, the embodiment of the application provides a training method of a voice recognition model, which can realize end side pseudo word granularity modeling and achieve the recognition effect of the voice recognition model based on word granularity modeling at the end side on the basis of not changing word list based on word granularity modeling.

It should be noted that, the end-to-end speech recognition model can directly input the original speech data into the speech recognition model without any preprocessing, and if the finally obtained output result has an error with the expected result, the output result is optimized in each layer of the back propagation model until the expected result is over. Compared with the end-to-end voice recognition model, the non-end-to-end voice recognition model needs to label the original voice data to obtain the processed data with characteristics, the processed data is used as input for recognition, and if errors occur in the process of data extraction, the adjustment cannot be performed. Thus, the end-to-end speech recognition model may be applicable to more scenarios.

Fig. 4 is a schematic diagram of an artificial intelligence (Artificial Intelligence, AI) call scenario. As shown in fig. 4, the AI Speech model generally adopts an end-To-end Speech model, when an electronic device receives an incoming call from an external device, the electronic device acquires the incoming call from the external device, inputs the incoming call into a Speech recognition model To acquire incoming call semantics, processes the incoming call semantics through a processing unit, replies a reply semantics corresponding To the incoming call semantics, generates reply audio through a Text To Speech (TTS) technology, and outputs the reply audio To the external device.

The training method of the voice recognition model and the using method of the voice recognition model provided by the application are suitable for the AI conversation scene shown in fig. 4, and can improve the accuracy of voice recognition in the scene. It should be noted that, the training method of the speech recognition model provided by the present application includes, but is not limited to, application to AI call scenes.

The training method of the voice recognition model and the using method of the voice recognition model can be applied to electronic equipment. The electronic device includes, but is not limited to, a mobile phone, a tablet computer, a personal computer, a workstation device, a large screen device (such as a smart screen, a smart television and the like), a wearable device (such as a smart bracelet, a smart watch) and a palm game machine, a household game machine, a virtual reality device, an augmented reality device, a mixed reality device and the like, a vehicle-mounted intelligent terminal and the like.

Fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device 100 may include a processor 110, a memory 120, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, a camera 192, a display 193, and a subscriber identity module (subscriber identification module, SIM) card interface 194, etc. The sensor module 180 may include a touch sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a geomagnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, and the like. Among them, the gyro sensor 180B, the air pressure sensor 180C, the geomagnetic sensor 180D, the acceleration sensor 180E, and the like can be used to detect a motion state of an electronic apparatus, and thus, may also be referred to as a motion sensor.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

Memory 120 may be used to store computer-executable program code that includes instructions. The memory 120 may include a stored program area and a stored data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the memory 120 may include a high-speed random access memory, and may also include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. The processor 110 performs various functional applications and data processing of the electronic device 100 by executing instructions stored in the memory 120 and/or instructions stored in a memory provided in the processor.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transfer data between the electronic device 100 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and is not meant to limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also employ different interfacing manners in the above embodiments, or a combination of multiple interfacing manners.

The charge management module 140 is configured to receive a charge input from a charger. The charger can be a wireless charger or a wired charger. In some wired charging embodiments, the charge management module 140 may receive a charging input of a wired charger through the USB interface 130. In some wireless charging embodiments, the charge management module 140 may receive wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charge management module 140 and provides power to the processor 110, the memory 120, the display 193, the camera 192, the wireless communication module 160, and the like. The power management module 141 may also be configured to monitor battery capacity, battery cycle number, battery health (leakage, impedance) and other parameters. In other embodiments, the power management module 141 may also be provided in the processor 110. In other embodiments, the power management module 141 and the charge management module 140 may be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas may also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed into a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution for wireless communication including 2G/3G/4G/5G, etc., applied to the electronic device 100. The mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (low noise amplifier, LNA), etc. The mobile communication module 150 may receive electromagnetic waves from the antenna 1, perform processes such as filtering, amplifying, and the like on the received electromagnetic waves, and transmit the processed electromagnetic waves to the modem processor for demodulation. The mobile communication module 150 can amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through the antenna 1 to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be provided in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then transmits the demodulated low frequency baseband signal to the baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then transferred to the application processor. The application processor outputs sound signals through an audio device (not limited to the speaker 170A, the receiver 170B, etc.), or displays images or videos through the display screen 193. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional module, independent of the processor 110.

The wireless communication module 160 may provide solutions for wireless communication including wireless local area network (wireless local area networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) network), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), etc., as applied to the electronic device 100. The wireless communication module 160 may be one or more devices that integrate at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, modulates the electromagnetic wave signals, filters the electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, frequency modulate it, amplify it, and convert it to electromagnetic waves for radiation via the antenna 2.

In some embodiments, antenna 1 and mobile communication module 150 of electronic device 100 are coupled, and antenna 2 and wireless communication module 160 are coupled, such that electronic device 100 may communicate with a network and other devices through wireless communication techniques. The wireless communication techniques may include the Global System for Mobile communications (global system for mobile communications, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC, FM, and/or IR techniques, among others. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (global navigation satellite system, GLONASS), a beidou satellite navigation system (beidou navigation satellite system, BDS), a quasi zenith satellite system (quasi-zenith satellite system, QZSS) and/or a satellite based augmentation system (satellite based augmentation systems, SBAS).

The electronic device 100 implements display functions through a GPU, a display screen 193, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 193 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display 193 is used to display images, videos, and the like. The display 193 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, electronic device 100 may include 1 or N display screens 193, N being a positive integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 192, a video codec, a GPU, a display screen 193, an application processor, and the like.

The ISP is used to process the data fed back by the camera 192. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be located in the camera 192.

The camera 192 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, RYYB, YUV, or the like format. In some embodiments, the electronic device 100 may include 1 or N cameras 192, N being a positive integer greater than 1.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The touch sensor 180A, also referred to as a "touch device". The touch sensor 180A may be disposed on the display 193, and the touch sensor 180A and the display 193 form a touch screen, which is also referred to as a "touch screen". The touch sensor 180A is used to detect a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to the touch operation may be provided through the display 193. In other embodiments, the touch sensor 180A may also be disposed on a surface of the electronic device 100 at a location different from the location of the display 193.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude from barometric pressure values measured by barometric pressure sensor 180C, aiding in positioning and navigation.

The geomagnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip cover using the geomagnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip machine, the electronic device 100 may detect the opening and closing of the flip according to the geomagnetic sensor 180D. And then according to the detected opening and closing state of the leather sheath or the opening and closing state of the flip, the characteristics of automatic unlocking of the flip and the like are set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, the electronic device 100 may range using the distance sensor 180F to achieve quick focus.

The proximity light sensor 180G may include, for example, a light emitting diode and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light outward through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it may be determined that there is an object in the vicinity of the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there is no object in the vicinity of the electronic device 100. The electronic device 100 can detect that the user holds the electronic device 100 close to the ear by using the proximity light sensor 180G, so as to automatically extinguish the screen for the purpose of saving power. The proximity light sensor 180G may also be used in holster mode, pocket mode to automatically unlock and lock the screen.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to unlock the fingerprint, access the application lock, photograph the fingerprint, answer the incoming call, etc.

The temperature sensor 180J is for detecting temperature. In some embodiments, the electronic device 100 performs a temperature processing strategy using the temperature detected by the temperature sensor 180J. For example, when the temperature reported by temperature sensor 180J exceeds a threshold, electronic device 100 performs a reduction in the performance of a processor located in the vicinity of temperature sensor 180J in order to reduce power consumption to implement thermal protection. In other embodiments, when the temperature is below another threshold, the electronic device 100 heats the battery 142 to avoid the low temperature causing the electronic device 100 to be abnormally shut down. In other embodiments, when the temperature is below a further threshold, the electronic device 100 performs boosting of the output voltage of the battery 142 to avoid abnormal shutdown caused by low temperatures.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. For example, touch operations acting on different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also correspond to different vibration feedback effects by touch operations applied to different areas of the display screen 193. Different application scenarios (such as time reminding, receiving information, alarm clock, game, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

The SIM card interface 194 is used to connect to a SIM card. The SIM card may be inserted into the SIM card interface 194, or removed from the SIM card interface 194 to enable contact and separation with the electronic device 100. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 194 may support a Nano SIM card, micro SIM card, etc. The same SIM card interface 194 may be used to insert multiple cards simultaneously. The types of the plurality of cards may be the same or different. The SIM card interface 194 may also be compatible with different types of SIM cards. The SIM card interface 194 may also be compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to realize functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, i.e.: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, taking an Android system with a layered architecture as an example, a software structure of the electronic device 100 is illustrated.

Fig. 6 is a software configuration block diagram of the electronic device 100 according to the embodiment of the present application.

The layered architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, from top to bottom, an application layer, an application framework layer, an Zhuoyun row (Android run) and system libraries, and a kernel layer, respectively.

The application layer may include a series of application packages.

As shown in fig. 6, the application package may include battery management, camera, gallery, calendar, talk, map, navigation, music, video, short message, etc. applications.

The application framework layer provides an application program interface (application programming interface, API) and programming framework for application programs of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 6, the application framework layer may include a window manager, an input manager InputManager, a sensor manager SensorManager, a phone manager, a resource manager, a notification manager, and so forth.

The input manager may be used to monitor input events of the user, such as click events, swipe events, etc., performed by the user's finger on the display screen 193 of the electronic device 100. By listening for input events, the electronic device 100 can determine whether the electronic device is being used.

The sensor manager is used to monitor data returned by various sensors in the electronic device, such as motion sensor data, proximity sensor data, temperature sensor data, and the like. Using the data returned by the various sensors, the electronic device can determine whether it is jittered, whether the display 193 is occluded, etc.

Android run time includes a core library and virtual machines. Android run time is responsible for scheduling and management of the Android system.

The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), 2D graphics engines (e.g., SGL), etc.

The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications.

Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, the electronic device may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The following describes each step of the training method of the speech recognition model provided by the embodiment of the present application.

FIG. 7 is an exemplary flowchart of a method for training a speech recognition model according to an embodiment of the present application.

As shown in fig. 7, the training method of the speech recognition model may include the following steps S101 to S106.

Step S101, obtaining a training sample; the training samples comprise sample audio and sample texts corresponding to the sample audio; the sample text includes at least one character including at least one of chinese characters and english characters.

The electronic device can pre-establish a training set and acquire training samples from the training set, wherein the training set can be a Chinese voice training set or an English voice training set, the training samples in the training set can be a comprehensive corpus recorded in a professional mode, the coverage of the number, the type, the tone, the voice connection, the rhythm and the like of syllable phones can be realized, sample audio in the training samples can be in a WAV audio format, and sample text can be in a TXT text format. The application is not limited to the audio format of the sample audio and the text format of the sample text. Wherein the sample text includes at least one of Chinese characters and English characters. The embodiment of the application does not limit the character types of the sample text. The sample text may be a single word, a plurality of words, a single sentence, or a plurality of sentences, and embodiments of the present application do not limit the length of the sample text. For example, the sample text may be "i like clothes of tidal current", "Hello world", or the like.

In some embodiments, the training method of the speech recognition model may further include the following step S100.

Step S100, a preset word stock is established; the preset word library comprises at least one preset word, and an English sub word model is built. Wherein, the English subword model can be a BPE model.

It should be noted that, in the embodiment of the present application, the time for establishing the preset word stock is not limited, and the preset word stock may be established before step S101 or after step S101, where the preset word stock is established at the end side, and preset words in the preset word stock may be general words and ASR scene words, and exemplary preset words may be "like", "trend", "clothes", and so on.

Step S102, judging the current character type of the sample text, wherein the character type comprises the following steps: chinese character only, english character only, and chinese-english character.

By way of example, the sample text is "i like clothes of trend", then the current character type is chinese only character; the sample text is "Hello world", then the current character type is English-only character; the sample text is "Hello, world", then the current training sample is Chinese and English characters.

In a specific implementation, the electronic device may judge the current character type of the sample text by identifying the byte size in the sample text, so as to distinguish between a chinese character and an english character, if the electronic device identifies that the number of bytes occupied by the current character is 2, the electronic device may determine that the current character is a chinese character, and if the electronic device identifies that the number of bytes occupied by the current character is 1, the electronic device may determine that the current character is an english character. After the electronic equipment identifies the current character type in the sample text, the Chinese characters and the English characters are segmented according to different word segmentation modes.

Step S103, when the current character type of the sample text is only Chinese characters, word segmentation is carried out on the sample text by utilizing a preset word stock so as to obtain a first word segmentation result, wherein the first word segmentation result comprises a plurality of first text units; the first text unit is a first target word in the sample text, which is matched with a preset word in a preset word bank, and/or the first text unit is a first non-target word except the first target word in the sample text; each first target word and each first non-target word is preceded by a separator.

In a specific implementation, the electronic device determines that all target words matched with preset words in a sample text are the clothes which are liked by me in the sample text by using a preset word bank, after the electronic device inputs the sample text into the preset word bank, the sample text is traversed to find out whether the sample text comprises a first target word matched with the preset word, if the preset word bank comprises the first target word matched with the preset word, the electronic device can determine that the first target word is liked by me, the trend and the clothes are matched with the preset word in the preset word bank, so that the first target word is segmented into a first text unit, and meanwhile, the electronic device segments the first non-target words except the first target word, namely, I and I, into the first text unit.

It should be noted that, the electronic device determines only the first target word successfully matched as a text unit and determines the first non-target word unsuccessfully matched as a first text unit, and the manner of defining the first target word and the first non-target word is divided successfully by whether the first target word and the first non-target word are successfully matched, but the content meaning and the self attribute of the first target word and the first non-target word cannot be specifically identified, that is, the electronic device cannot identify the word attribute of the first target word at this time and cannot identify the word attribute of the first non-target word.

In some embodiments, when the current character type of the sample text is chinese-only characters, the electronic device may match the target word with the preset word by way of forward matching, reverse matching, or bi-directional matching. By way of example, when the electronic device adopts a forward matching mode, the electronic device can set the maximum matching length to be 2 because the Chinese words are more words with the character length of 2, and then the electronic device matches the sample text from left to right according to the forward matching direction, so that the results of unsuccessful matching of I's happiness, successful matching of like, successful matching of tide, unsuccessful matching of clothes, successful matching of clothes and successful matching of clothes can be obtained, and the first word segmentation result of clothes like tide is obtained.

However, the matching method has limited application scenes, and for Chinese words with partial character length of 3, setting the maximum matching length to 2 only easily results in low accuracy in recognizing Chinese words with character length of 3.

When the electronic device adopts a forward matching mode, the electronic device can set the maximum matching length to be 3, taking a sample text as an example of 'research life origin', and matching the sample text from left to right according to the forward matching direction by the electronic device, so that the results of 'research life' of successful matching, 'of unsuccessful matching of the mission', unsuccessful matching of 'of origin', and 'of successful matching of origin' can be obtained, and the first word segmentation result of 'research life origin' can be obtained. It can be found that the word segmentation result of the sample text by the electronic device deviates from the actual meaning thereof.

When the electronic equipment adopts a reverse matching mode and the maximum matching length is set to be 3, taking the sample text as an example of 'research life origin', according to the reverse matching direction, the electronic equipment matches the sample text from right to left, so that the results of 'origin' successful match in 'origin', unsuccessful 'life' match, 'life' successful match in 'life research' and 'research' successful match can be obtained, and the first word segmentation result 'research life origin' can be obtained. It can be found that the word segmentation result of the sample text by the electronic device approaches the actual meaning.

Reverse matching often has better effect due to the word rules of Chinese words. However, both forward matching and reverse matching have limitations, and the electronic device can perform bidirectional matching, that is, perform forward matching once and perform reverse matching once, and select one of the results of the forward matching and the reverse matching to determine the result as the first word segmentation result according to the principle that the number of target words with the longest character length and the number of target words with the shortest character length in the sample text are the greatest.

In some embodiments, after determining the first target word and the first non-target word, the electronic device sets a separator before each first target word and each first non-target word to achieve the partitioning of the text units. The separator may be a space character. Taking a sample text as "clothes i like trend" as an example, wherein the first target words are like, trend and clothes, and the words "me" except the first target word are non-target words. The electronic device inserts space characters before each first target word and each first non-target word, and the first word segmentation result is 'clothes that I like trend'.

If the sample text does not include the first target word identical to the preset word, the electronic device can determine that the sample text is the first non-target word. For example, if the preset word does not include "like", "trend", "clothes", the electronic device may determine that the characters of "like", "cheerful", "tide", "stream", "clothes" and the like are all the first non-target words; the electronic device inserts space characters in front of each first non-target word, and the first word segmentation result is 'clothes that I like trend'.

Step S104, for the first word segmentation result, inserting a first identifier before the first text unit with the character length being greater than 1, and inserting a second identifier before the first text unit with the character length being equal to 1, so as to obtain a second word segmentation result.

Since a first text unit is located between every two separators in the first word segmentation result, the electronic device needs to determine the insertion positions of the first identifier and the second identifier through the character length. Because the character length of the preset word in the preset word stock is larger than 1, in the first word segmentation result obtained by the electronic equipment, the character length of the successfully matched first target word is larger than 1, namely the character length of the first text unit is equal to 1, and the first identifier can be used for identifying the first target word before being inserted into the text unit with the character length larger than 1; the character length of the first non-target word that is not successfully matched is equal to 1, i.e. the character length of the first text unit is equal to 1, and the second identifier is inserted before the first text unit with the character length equal to 1, and can be used for identifying the first non-target word. An identifier refers to a symbol used to identify an entity, and may be used to name variables, constants, functions, and blocks of statements, etc., to establish a relationship between the name and use, and is typically composed of letters, numbers, and other characters. Illustratively, in embodiments of the present application, the first identifier may be set to [ PH ], and the second identifier may be set to [ CH ]. The identifiers of the present application are for illustrative purposes only, and other forms of identifiers may be employed in particular implementations.

Taking the first word segmentation result as "clothes that I like tide" as an example, the first text unit acquired by the electronic device includes: five first text units of "me", "like", "trend", "clothes" and the like, wherein the character length of the first text units of "like", "trend", "clothes" and the like is greater than 1, and the electronic equipment inserts a first identifier before the first text units of "like", "trend", "clothes" and the like. The character length of the two first text units of I'm, PH and the like is equal to 1, and the electronic equipment inserts a second identifier in front of the first text units of I'm, PH and the like to obtain a second word segmentation result, [ CH ] I [ PH ] like [ PH ] tide [ CH ] [ PH ] clothes ].

Taking a first word segmentation result as an example of ' tide clothes ', a first text unit acquired by the electronic equipment comprises two first text units of ' tide ', ' clothes ' and the like, wherein the character length of the two first text units is larger than 1, and the electronic equipment inserts a first identifier in front of each first text unit to obtain a second word segmentation result ' [ PH ] tide [ PH ] clothes.

Taking the first word segmentation result as My clothing as an example, the first text units acquired by the electronic equipment comprise three first text units of I, II, I, II and the like, wherein the character lengths of the three first text units are equal to 1, and the electronic equipment inserts a second identifier in front of each first text unit to obtain a second word segmentation result of [ CH ] I [ CH ] of [ CH ] clothing.

Here, it should be noted that the first word segmentation result includes the following several forms: including only the first text unit having a character length greater than 1, including only the first text unit having a character length equal to 1, and including both the first text unit having a character length greater than 1 and the first text unit having a character length equal to 1. Accordingly, there are several cases in which the electronic device inserts only the first identifier, only the second identifier, and both the first identifier and the second identifier in the first word segmentation result. The electronic device is used for training and learning the voice recognition model into a plurality of words when the electronic device inserts only the first identifier in the first word segmentation result, such as ' like trend clothes ', the sample text can be used for training and learning the voice recognition model into the plurality of words at one time, the sample text can be used for learning the characters ' I ', ' I ' and ' I ' without starving ' when the electronic device inserts only the second identifier in the first word segmentation result, the adjacent two characters do not exist, the division between the non-words and the non-words is enhanced, and the sample text can be used for learning the words and the non-words simultaneously when the electronic device inserts both the first identifier and the second identifier in the first word segmentation result, such as ' I like trend clothes ', the sample text can be used for learning the voice recognition model into the words and the non-words simultaneously, and the division between the words and the non-words is enhanced.

Step S105, aiming at the second word segmentation result, word-by-word segmentation is respectively carried out on the first text units after each first identifier, and separators are inserted between two adjacent segmented characters, so that a first training text is obtained.

In some embodiments, taking the second word result as "[ CH ] i am [ PH ] like [ PH ] to load flow [ CH ] [ PH ] clothes" as an example, the first text units after the first identifier include "like", "load flow", "clothes", the first text units are segmented according to the word, and a separator is inserted between two adjacent segmented characters to obtain a first training text "[ CH ] i am [ PH ] like [ PH ] to load flow [ CH ] [ PH ] clothes.

Since the second word segmentation result may not include the first identifier, in some embodiments, the electronic device may further determine, before inserting the first identifier and inserting the second identifier, whether the first word segmentation result includes only the first text unit with the character length equal to 1, if the first word segmentation result includes only the text unit with the character length equal to 1, the electronic device skips and executes step S105 after executing step S104, and directly executes step S106, that is, when the first word segmentation result includes only the text unit with the character length equal to 1, the electronic device directly determines the second word segmentation result as the first training text, and executes step S106.

For example, when the second word segmentation result is "[ CH ] I [ CH ] is not [ CH ] hungry", the electronic device determines "[ CH ] I [ CH ] is not [ CH ] hungry" as the first training text.

Step S106, training an initial speech recognition model according to the sample audio and the first training text to obtain a trained speech recognition model.

In some embodiments, the initial speech recognition model is an RNNT model, and the initial speech model includes an input layer, an encoding layer (Encoder), a predictive Network layer (predictive Network), a Joint Network layer (Joint Network), and an output layer.

FIG. 8 is a schematic representation of an RNNT model according to an embodiment of the present application. As shown in FIG. 8, the embodiment of the application adopts a Transducer model structure of an RNNT model, the model comprises a language model, and the model has a natural stream decoding result and can predict information such as languages, voiceprints and the like in audio.

The electronic equipment inputs the sample audio into the coding layer through the input layer to obtain an audio feature vector, and inputs the sample text into the prediction network layer through the input layer to obtain a text feature vector, further, the electronic equipment inputs the audio feature vector and the text feature vector into the joint network layer to obtain a spliced feature vector, and the electronic equipment trains the spliced feature vector through a loss function of the RNNT model and outputs the spliced feature vector through the output layer to obtain a trained voice recognition model.

By way of example, taking the sample text as [ CH ] i [ PH ] like [ PH ] clothes of [ PH ] trend [ CH ], in the process that the electronic device inputs the sample text into the prediction network layer through the input layer to obtain the text feature vector, although the actual modeling manner adopted by the prediction network layer is based on word granularity modeling, the prediction network layer can determine that "like", "trend", "clothes" are words after the first identifier [ PH ] due to the existence of the first identifier and the second identifier in the sample text, and determine that "i", "clothes" are non-words after the second identifier [ CH ] at the same time, so that the prediction network layer is forced to learn word information. Even though the current word list is not a word list established based on word granularity, but is only a word list established based on word granularity, the prediction network layer can learn more words as well, and the recognition of the words is enhanced.

In a specific implementation, the electronic device firstly carries out frame segmentation processing on the sample audio to obtain a corresponding voice sequence, then carries out pre-emphasis on the voice sequence after frame segmentation, and then sequentially extracts the acoustic feature of each frame, wherein the acoustic feature is feature data for representing voiceprint information of the corresponding voice frame, for example, the feature data can be Mel cepstrum coefficient (Mel-scale Frequency Cepstral Coefficienys, MFCC) or Log Mel Filter Bank (FBANK) features, and the like, and then inputs the feature data of the voiceprint information into the coding layer through the input layer to obtain an audio feature vector corresponding to the sample audio. The electronic device inputs the sample text into the prediction network layer through the input layer to obtain a character probability distribution corresponding to each character in the sample text, and the specific format of the character probability distribution corresponding to the sample text may be according to practical situations (for example, according to the architecture of the recognition network model or the language to which the characters belong).

It should be noted here that the text feature vector is essentially a vector for each character as an M-dimensional vector, that is, in the embodiment of the present application, the RNNT model is calculated based on a word vector rather than a word vector. Thus, the electronic device is actually based on the calculation mode of the word vector of the word granularity to achieve the calculation result of the pseudo word granularity.

The voice recognition model in the embodiment of the application is a neural network model which is obtained by combining with the training of the RNNT loss constraint function on the basis of the neural network model based on the attention mechanism, and can be used for carrying out alignment constraint on the input characteristics and the output characters through the display of the RNNT loss constraint function so as to improve the generalization capability and the recognition effect of the model and improve the accuracy of the voice recognition model.

According to the training method of the voice recognition model, the electronic equipment disguises the text units based on the word granularity into the text units based on the word granularity in a mode of inserting the first identifier and the second identifier into the sample text, and the initial voice recognition model is enabled to distinguish the target words from the non-target words in a more explicit mode. Under the condition that the original word list based on word granularity modeling is not changed and the model structure of the initial voice recognition model is not changed, the electronic equipment can enable the trained voice recognition model to learn more words and information of similar sentences, and the recognition effect of the trained voice recognition model is improved.

In addition, in the embodiment of the application, the electronic equipment disguises the text unit based on the word granularity as the text unit based on word granularity modeling, so that the recognition accuracy of Chinese voice can be improved, the situation that the flower and flower princess is recognized as Hua Hua princess is avoided, and the recognition accuracy of English voice and Chinese-English mixed voice can be improved.

FIG. 9 is another exemplary flow chart of a method for training a speech recognition model provided by an embodiment of the present application.

As shown in fig. 9, after step S102, the training method of the speech recognition model may further include the following steps S107 to S110.

Step S107, when the current character type of the sample text is only English characters, word segmentation is carried out on the sample text by utilizing an English sub-word model so as to obtain a third word segmentation result, wherein the third word segmentation result comprises a plurality of second text units; the second text unit is a sub word of a second target word which can be split by the English sub word model in the sample text and/or a second non-target word which can not be split by the English sub word model in the sample text.

In a specific implementation, after the electronic device inputs the sample text into the english subword model, the english subword model determines the splittable word as the second target word, determines the undetachable word as the second non-target word, outputs the second target word and the second non-target word after splitting, and stores the second target word and the second non-target word into the first list. And "world" cannot be split, the second text unit may be the subwords "He" and "llo" of the second target word in the sample text that can be split by the english subword model, and the second non-target word "world" that cannot be split by the english subword model. The third word segmentation result may be the subword stored in the first list and the second non-target word, i.e., { "He", "llo", "world" }.

In the embodiment of the application, the electronic equipment can adopt the BPE model to segment words, each English word in the sample text is firstly segmented into characters as sub words based on the BPE model, a character is added at the tail end of each English word, the segmented sub words form an initial sub word list, the frequency of adjacent sub word pairs in the English word in the sample text is counted, the sub word pairs with highest frequency are combined to be combined into new sub words, the new sub words are added into the sub word list, and the steps of counting the frequency of the adjacent sub word pairs and combining the sub word pairs are repeated until the set combining times are reached or the set sub word list size is reached. It should be noted that, the embodiments of the present application include, but are not limited to, word segmentation using a BPE model.

Step S108, aiming at the third word segmentation result, inserting a first identifier before a second target word with the number of the sub words being greater than 1, and inserting a second identifier before a second non-target word with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result.

Taking the third word segmentation result as { "He", "llo", "world" } as an example, wherein the second target word which can be split is "Hello", the sub words are "He", "llo", the number is 2, the second non-target word "world" cannot be split, i.e. the number of the sub words is equal to 1, and after the first identifier [ PH ] and the second identifier [ CH ] are inserted, the fourth word segmentation result can be "[ PH ] Hello [ CH ] world".

Step S109, aiming at the fourth word segmentation result, segmenting each second text unit, and inserting separators in front of two adjacent segmented second text units to obtain a second training text;

taking the fourth word segmentation result as "[ PH ] Hello [ CH ] world" as an example, wherein the second text units comprise "He", "llo" and "world", segmenting each second text unit, and inserting separators between two adjacent segmented second text units to obtain a second training text as "[ PH ] Hello [ CH ] world".

Step S110, training an initial voice model according to the sample audio and the second training text to obtain a trained voice recognition model.

The step may be specifically described with reference to step S106.

FIG. 10 is another exemplary flow chart of a method for training a speech recognition model provided by an embodiment of the present application.

As shown in fig. 10, after step S102, the training method of the speech recognition model may further include the following steps S111-S116.

Step S111, when the current character type of the sample text is Chinese and English characters, splitting the sample text into a first sub-text and a second sub-text, wherein the first sub-text is used for recording all Chinese characters in the sample text; the second sub-text is used for recording all English characters in the sample text;

For example, when the current character of the sample text is "Hello, world", the first sub-text is used to record "world", and the second sub-text is used to record "Hello".

Step S112, word segmentation is carried out on the first sub-text by utilizing a preset word stock so as to obtain a first word segmentation result; and performing word segmentation on the second sub text by utilizing the English sub word model to obtain a third word segmentation result.

The step of segmenting the first sub-text by using the preset word stock may refer to step S103, and the step of segmenting the second sub-text by using the english sub-word model may refer to step S106.

Step S113, for the first word segmentation result, inserting a first identifier before a first text unit with a character length greater than 1, and inserting a second identifier before the first text unit with a character length equal to 1, so as to obtain a second word segmentation result; and for the third word segmentation result, inserting a first identifier before a second target word with the number of the sub words being greater than 1, and inserting a second identifier before a second non-target word with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result.

The specific implementation of this step can be seen in step S104, and step S107.

Step S114, aiming at the second word segmentation result, respectively carrying out word segmentation on the first text units after each first identifier, and inserting separators between two adjacent segmented characters to obtain a first training text; and aiming at the fourth word segmentation result, segmenting each second text unit, and inserting separators in front of two adjacent segmented second text units to obtain a second training text.

The specific implementation of this step can be seen in step S105, and step S108.

Step S115, the first training text and the second training text are spliced into a third training text.

For example, the first training text is "[ PH ] world", the second training text is "[ PH ] Hello", and the third training text is "[ PH ] Hello [ PH ] world".

Step S116, training an initial speech recognition model according to the sample audio and the third training text to obtain a trained speech recognition model.

The specific implementation may be referred to step S106.

In the process of the BPE model word segmentation, the electronic equipment is used for segmenting and recovering English words, and splicing errors are easy to generate. Taking the English word "Hello" as an example, after the English word is split and recovered, the English word is easy to identify as "Helo", and in the implementation manner shown in the application, in the splitting process, because the first identifier [ PH ] is inserted before the "Hello", the electronic equipment can determine the "PH ] Hello" as a word, and because other identifiers exist between the word and the next word, and no other identifiers exist between the "He" and the "llo", the electronic equipment can not mistakenly consider the "He" as a word for splitting, so that the splitting and the recovery process generate a splicing error. Because the embodiment of the application can realize Chinese voice recognition, english voice recognition and Chinese-English mixed voice recognition based on word granularity, compared with a voice recognition model only used for Chinese voice recognition and a voice recognition model only used for English voice recognition, the embodiment of the application can be applied to more voice recognition scenes.

The following describes exemplary steps of a method for using a speech recognition model according to an embodiment of the present application.

FIG. 11 is an exemplary flow chart of a method for using a speech recognition model according to an embodiment of the present application.

As shown in fig. 11, the model using method may include the following steps S201 to S202.

Step S201, voice data is acquired.

The voice data can be obtained according to actual needs in a recording mode or the like, and telephone recording, conference recording or the like in daily life of a user can be used as the voice data by way of example, so that the voice data can be identified through the technical scheme shown in the embodiment.

Step S202, inputting the speech data into the trained speech recognition model according to the embodiments of the training method of the speech recognition model, to obtain the recognized speech data.

In some embodiments, the recognized voice data includes recognized text data corresponding to the recognized audio data, the recognized text data including a first identifier for identifying a text unit as a first target word or a sub-word of a second target word, a second identifier for identifying a text unit as a first non-target word or a second non-target word, the first target word, the sub-word of the second target word, the first non-target word, and the second non-target word being recognized by a trained voice recognition model.

In a specific implementation, the electronic device obtains voice data, inputs the voice data into the coding layer through the input layer, takes the historical output of the joint network layer as the input of the prediction network layer, takes the voice data of a user as an example of clothes which are liked by me, and the historical output of the joint network layer is the output after training of the training sample of [ CH ] I [ PH ] liking [ PH ] clothes of [ PH ] tide [ CH ], at this time, the text data recognized by the electronic device is [ PH ] I [ PH ] liking [ PH ] tide [ CH ] clothes ", wherein the recognized text data comprises a plurality of first identifiers [ PH ], a plurality of second identifiers [ CH ], and a plurality of text units of [ I ], [ like", "tide", "clothes ]. The first identifier PH may identify the text unit "like", "trend", "clothes" as a first target word. The second identifier [ CH ] is used to identify the text units "I", "I" as first non-target words.

For example, the text data identified by the electronic device is "[ PH ] He llo [ CH ] world", the first identifier [ PH ] may identify the text unit "He", "llo" as a sub-word of the second target word, and the second identifier [ CH ] is used to identify the text unit "world" as the second non-target word.

In some embodiments, since the electronic device recognizes the first identifier and the second identifier as a vocabulary in the recognized text data, the electronic device inputs the voice data into the trained voice recognition model according to the training method of the voice recognition model, and further includes, after obtaining the recognized voice data: deleting the first identifier and the second identifier from the identified text data to obtain output text data; the text data is output for display to a user interface.

In the embodiment provided by the application, the schemes of the model training method provided by the application are introduced from the aspects of the electronic equipment and the interaction between the electronic equipment and the user. It will be appreciated that the electronic device, in order to achieve the above-described functions, includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 12 is a schematic diagram of a training system for a speech recognition model according to an embodiment of the present application.

In some embodiments, a training system for a speech recognition model includes:

an obtaining module 301, configured to obtain a training sample; the training samples comprise sample audio and sample texts corresponding to the sample audio; the sample text includes at least one character; the characters include at least one of Chinese characters and English characters.

A judging module 302, configured to judge a current character type of the sample text; the character types include: chinese character only, english character only, and chinese-english character.

The word segmentation module 303 is configured to segment the sample text by using a preset word stock when the current character type of the sample text is only chinese characters, so as to obtain a first word segmentation result, where the first word segmentation result includes a plurality of first text units; the first text unit is a first target word in the sample text, which is matched with a preset word in a preset word bank, and/or the first text unit is a first non-target word except the target word in the sample text; a separator is arranged in front of each first target word and each first non-target word;

the identification module 304 is configured to insert, for a first word segmentation result, a first identifier before a first text unit with a character length greater than 1, and insert a second identifier before the first text unit with a character length equal to 1, so as to obtain a second word segmentation result;

The segmentation module 305 is configured to segment the first text unit after each first identifier according to the word, and insert a separator between two adjacent segmented characters to obtain a first training text;

the training module 306 is configured to train the initial speech recognition model according to the sample audio and the first training text, and obtain a trained speech recognition model.

In one implementation, the method further comprises: a preset module 307, configured to establish a preset word stock; the preset word library comprises at least one preset word, and an English sub word model is established, wherein the English sub word model is a byte pair coding BPE model. By adopting the embodiment, the electronic equipment can establish a preset word bank at the end side, and set general words and automatic speech recognition ASR scene words in the preset word bank as preset words so as to establish an end-to-end speech recognition model.

In one implementation manner, the word segmentation module 303 is further configured to segment the sample text by using the english subword model when the current character type of the sample text is only english characters, so as to obtain a third word segmentation result, where the third word segmentation result includes a plurality of second text units; the second text unit is a sub word of a second target word which can be split by the English sub word model in the sample text and/or a second non-target word which can not be split by the English sub word model in the sample text; a separator is arranged in front of each second text unit;

The identification module 304 is further configured to insert, for the third word segmentation result, a first identifier before a second target word with the number of sub words being greater than 1, and insert a second identifier before a second non-target word with the number of sub words being equal to 1, so as to obtain a fourth word segmentation result;

the segmentation module 305 is further configured to segment each second text unit according to the fourth word segmentation result, and insert a separator in front of two adjacent segmented second text units to obtain a second training text;

the training module 306 is further configured to train the initial speech model according to the sample audio and the second training text, and obtain a trained speech recognition model.

By adopting the embodiment, the electronic equipment can determine the word segmentation mode of the sample text so as to acquire the target word and the non-target word.

In one implementation, the method further comprises: the splitting module 308 is configured to split the sample text into a first sub-text and a second sub-text when the current character type of the sample text is a chinese-english character;

a synthesizing module 309, configured to splice the first training text and the second training text into a third training text;

the training module 306 is further configured to train the initial speech recognition model according to the sample audio and the third training text, and obtain a trained speech recognition model.

In some embodiments, the electronic device may implement the corresponding functions by the hardware apparatus shown in fig. 13. As shown in fig. 13, the screen capture apparatus may include: a memory 401 and a processor 402.

In one implementation, the processor 402 may include one or more processing units, e.g., the processor 402 may include an application processor, a controller, a video codec, a digital signal processor, and/or a neural network processor, etc., where the different processing units may be separate devices or may be integrated in one or more processors. Memory 401 is coupled to processor 402 for storing various software programs and/or sets of instructions. In some embodiments, memory 401 may include volatile memory and/or non-volatile memory.

In some embodiments, the software program and/or sets of instructions in the memory 401, when executed by the processor 402, cause the electronic device to perform the method steps of:

obtaining a training sample; the training samples comprise sample audio and sample texts corresponding to the sample audio; the sample text includes at least one character; the characters comprise at least one of Chinese characters and English characters;

judging the current character type of the sample text; the character types include: chinese character only, english character only, and chinese-english character;

when the current character type of the sample text is Chinese character only, word segmentation is carried out on the sample text by utilizing a preset word stock so as to obtain a first word segmentation result, wherein the first word segmentation result comprises a plurality of first text units; the first text unit is a first target word in the sample text, which is matched with a preset word in a preset word bank, and/or the first text unit is a first target word of a first non-target word except the first target word in the sample text; a separator is arranged in front of the first target word of each first target word and each first non-target word;

for a first word segmentation result, inserting a first identifier before a first text unit with a character length greater than 1, and inserting a second identifier before the first text unit with a character length equal to 1, so as to obtain a second word segmentation result;

Aiming at the second word segmentation result, respectively carrying out word segmentation on the first text unit after each first identifier, and inserting separators between two adjacent segmented characters to obtain a training text;

and training the initial speech recognition model according to the sample audio and the training text to obtain a trained speech recognition model.

when the current character type of the sample text is English characters only, word segmentation is carried out on the sample text by utilizing an English sub word model so as to obtain a third word segmentation result, wherein the third word segmentation result comprises a plurality of second text units; the second text unit is a sub word of a second target word which can be split by the English sub word model in the sample text and/or a second non-target word which can not be split by the English sub word model in the sample text;

for the third word segmentation result, inserting a first identifier before a second target word with the number of the sub words being greater than 1, and inserting a second identifier before a second non-target word with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result;

Aiming at the fourth word segmentation result, segmenting each second text unit, and inserting separators in front of two adjacent segmented second text units to obtain a second training text;

and training the initial voice model according to the sample audio and the second training text to obtain a trained voice recognition model.

when the current character type of the sample text is Chinese and English characters, splitting the sample text into a first sub-text and a second sub-text; the first sub-text is used for recording all Chinese characters in the sample text; the second sub-text is used for recording all English characters in the sample text;

performing word segmentation on the first sub-text by using a preset word stock to obtain a first word segmentation result; performing word segmentation on the second sub text by utilizing the English sub word model to obtain a third word segmentation result;

for the first word segmentation result, inserting a first identifier before a first text unit with the character length being greater than 1, and inserting a second identifier before the first text unit with the character length being equal to 1, so as to obtain a second word segmentation result; and for the third word segmentation result, inserting a first identifier before a second target word with the number of the sub words being greater than 1, and inserting a second identifier before a second non-target word with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result;

Aiming at the second word segmentation result, respectively carrying out word segmentation on a first text unit after each first identifier, and inserting separators between two adjacent segmented characters to obtain a first training text; aiming at the fourth word segmentation result, segmenting each second text unit, and inserting separators in front of two adjacent segmented second text units to obtain a second training text;

the first training text and the second training text are spliced into a third training text;

and training the initial speech recognition model according to the sample audio and the third training text to obtain a trained speech recognition model.

establishing a preset word library, wherein the preset word library comprises at least one preset word, and establishing an English sub word model, and the English sub word model is a byte pair coding BPE model. In some embodiments, the software program and/or sets of instructions in the memory 401, when executed by the processor 402, cause the electronic device to perform the method steps of:

determining all first target words matched with the preset words in the sample text by using a preset word bank;

And respectively segmenting each first target word into first text units, and segmenting first non-target words except the first target words in the sample text into first text units.

determining all first non-target words which are not matched with the preset words in the sample text by using a preset word stock; all first non-target words in the sample text are segmented into first text units.

determining all sub-words of the second target words which can be split by the English sub-word model in the sample text by utilizing the English sub-word model, and determining all second non-target words which can not be split by the English sub-word model;

each subword and each second non-target word are segmented into second text units and stored in the first list.

In some embodiments, the initial speech recognition model includes an input layer, an encoding layer, a predictive network layer, a federated network layer, and an output layer; the software programs and/or sets of instructions in the memory 401, when executed by the processor 402, cause the electronic device to perform the method steps of:

Training an initial speech recognition model according to the sample audio and the first training text to obtain a trained speech recognition model, comprising:

inputting the sample audio into a coding layer through an input layer to obtain an audio feature vector, and inputting the first training text into a prediction network layer through the input layer to obtain a text feature vector;

inputting the audio feature vector and the text feature vector into a joint network layer to obtain a spliced feature vector;

training the spliced feature vector through a loss function of the voice recognition model and passing through an output layer.

The application also provides a chip system. The system-on-a-chip comprises a processor for supporting the apparatus or device to implement the functions involved in the above aspects, e.g. to generate or process information involved in the above methods. In one possible design, the system on a chip further includes a memory for storing program instructions and data necessary for the apparatus or device described above. The chip system can be composed of chips, and can also comprise chips and other discrete devices.

Embodiments of the present application also provide a computer-readable storage medium having stored therein program instructions that, when executed on a computer, cause the computer to perform the methods of the above aspects and implementations thereof.

Embodiments of the present application also provide a computer program product which, when run on a computer, causes the computer to perform the methods of the above aspects and implementations thereof.

It should be understood that, in various embodiments of the present application, the sequence number of each process does not mean that the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments.

All parts of the specification are described in a progressive manner, and all parts of the embodiments which are the same and similar to each other are referred to each other, and each embodiment is mainly described as being different from other embodiments. In particular, for apparatus and system embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the description of the method embodiments section.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present application in further detail, and are not to be construed as limiting the scope of the application, but are merely intended to cover any modifications, equivalents, improvements, etc. based on the teachings of the application.

Claims

1. A method for training a speech recognition model, comprising:

obtaining a training sample; the training sample comprises sample audio and sample text corresponding to the sample audio; the sample text includes at least one character; the characters comprise at least one of Chinese characters and English characters;

judging the current character type of the sample text; the character categories include: chinese character only, english character only, and chinese-english character;

when the current character type of the sample text is Chinese character only, word segmentation is carried out on the sample text by utilizing a preset word stock so as to obtain a first word segmentation result, wherein the first word segmentation result comprises a plurality of first text units; the first text unit is a first target word in the sample text, which is matched with a preset word in the preset word stock, and/or the first text unit is a first non-target word in the sample text except the first target word; a separator is arranged in front of each first target word and each first non-target word;

for the first word segmentation result, inserting a first identifier before the first text unit with the character length being greater than 1, and inserting a second identifier before the first text unit with the character length being equal to 1, so as to obtain a second word segmentation result;

Aiming at the second word segmentation result, respectively carrying out word segmentation on the first text units after each first identifier, and inserting separators between two adjacent segmented characters to obtain a first training text;

and training an initial speech recognition model according to the sample audio and the first training text to obtain a trained speech recognition model, wherein the first training text trains the initial speech recognition model based on the word vector of each character.

2. The method for training a speech recognition model of claim 1, further comprising:

when the current character type of the sample text is English characters only, word segmentation is carried out on the sample text by utilizing an English sub word model so as to obtain a third word segmentation result, wherein the third word segmentation result comprises a plurality of second text units; the second text unit is a sub word of a second target word which can be split by the English sub word model in the sample text, and/or a second non-target word which can not be split by the English sub word model in the sample text;

for the third word segmentation result, inserting a first identifier before the second target words with the number of the sub words being greater than 1, and inserting a second identifier before the second non-target words with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result;

Aiming at the fourth word segmentation result, segmenting each second text unit, and inserting separators between two adjacent segmented second text units to obtain a second training text;

and training an initial voice model according to the sample audio and the second training text to obtain a trained voice recognition model.

3. The method for training a speech recognition model of claim 2, further comprising:

performing word segmentation on the first sub-text by using the preset word stock to obtain a first word segmentation result; and word segmentation is carried out on the second sub text by utilizing the English sub word model so as to obtain a third word segmentation result;

for the first word segmentation result, inserting a first identifier before the first text unit with the character length being greater than 1, and inserting a second identifier before the first text unit with the character length being equal to 1, so as to obtain the second word segmentation result; and for the third word segmentation result, inserting a first identifier before the second target word with the number of the sub words being greater than 1, and inserting a second identifier before the second non-target word with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result;

Aiming at the second word segmentation result, respectively carrying out word segmentation on the first text units after each first identifier, and inserting separators between two adjacent segmented characters to obtain the first training text; and aiming at the fourth word segmentation result, segmenting each second text unit, and inserting separators in front of two adjacent segmented second text units to obtain a second training text;

splicing the first training text and the second training text into a third training text;

and training an initial speech recognition model according to the sample audio and the third training text to obtain a trained speech recognition model.

4. The method for training a speech recognition model according to claim 1, wherein before the determining the current character type of the sample text, further comprising:

establishing the preset word library, wherein the preset word library comprises at least one preset word, and establishing an English sub word model, and the English sub word model is a byte pair coding BPE model.

5. The method for training a speech recognition model according to claim 4, wherein the word segmentation of the sample text using a preset word stock to obtain a first word segmentation result comprises:

and respectively segmenting each first target word into the first text unit, and segmenting first non-target words except the first target words in the sample text into the first text units.

6. The method for training a speech recognition model according to claim 5, wherein the word segmentation of the sample text using a preset word stock to obtain a first word segmentation result comprises:

determining all first non-target words which are not matched with the preset words in the sample text by using a preset word stock; and segmenting all the first non-target words in the sample text into the first text units.

7. The method for training a speech recognition model according to claim 2, wherein the word segmentation of the sample text using the english subword model to obtain a third word segmentation result comprises:

determining all sub-words of the second target words which can be split by the English sub-word model in the sample text by utilizing the English sub-word model, and determining all the second non-target words which cannot be split by the English sub-word model;

And respectively segmenting each sub word and each second non-target word into the second text unit, and storing the second text unit in a first list.

8. The method of claim 1, wherein the initial speech recognition model comprises an input layer, a coding layer, a predictive network layer, a federated network layer, and an output layer;

training an initial speech recognition model according to the sample audio and the first training text to obtain a trained speech recognition model, wherein the training comprises the following steps:

inputting the sample audio to the coding layer through the input layer to obtain an audio feature vector, and inputting the first training text to the prediction network layer through the input layer to obtain a text feature vector;

inputting the audio feature vector and the text feature vector into the joint network layer to obtain a spliced feature vector;

training the spliced feature vector through a loss function of the voice recognition model and outputting the spliced feature vector through the output layer to obtain the trained voice recognition model.

9. The method of claim 1, wherein the initial speech recognition model is a recurrent neural network converter RNNT model.

10. The method of claim 1, wherein the separator is a space character.

11. A method of using a speech recognition model, comprising:

acquiring voice data;

inputting the voice data into a voice recognition model trained by the training method of the voice recognition model according to any one of claims 1-10, so as to obtain the recognized voice data.

12. The method according to claim 11, wherein the recognized voice data includes recognized text data corresponding to recognized audio data, the recognized text data including a first identifier for identifying the text unit as a first target word or a second target word, a second identifier for identifying the text unit as a first non-target word or a second non-target word, and a plurality of text units, the first target word, the second target word, the first non-target word, and the second non-target word being recognized by the trained voice recognition model;

Inputting the voice data into a voice recognition model trained by the training method of the voice recognition model according to any one of claims 1-10, and after obtaining the recognized voice data, further comprising:

deleting the first identifier and the second identifier from the identified text data to obtain output text data; the output text data is for display to a user interface.

13. A training system for a speech recognition model, comprising:

the acquisition module is used for acquiring training samples; the training sample comprises sample audio and sample text corresponding to the sample audio; the sample text includes at least one character; the characters comprise at least one of Chinese characters and English characters;

the judging module is used for judging the current character type of the sample text; the character categories include: chinese character only, english character only, and chinese-english character;

the word segmentation module is used for segmenting the sample text by utilizing a preset word stock when the current character type of the sample text is Chinese characters only so as to obtain a first word segmentation result, wherein the first word segmentation result comprises a plurality of first text units; the first text unit is a first target word in the sample text, which is matched with a preset word in the preset word stock, and/or the first text unit is a first non-target word in the sample text except the first target word; a separator is arranged in front of each first target word and each first non-target word;

The identification module is used for inserting a first identifier before the first text unit with the character length being greater than 1 and inserting a second identifier before the first text unit with the character length being equal to 1 aiming at the first word segmentation result to obtain a second word segmentation result;

the segmentation module is used for carrying out word segmentation on the first text units after each first identifier according to the second word segmentation result, and inserting separators between two adjacent segmented characters to obtain a first training text;

and the training module is used for training an initial voice recognition model according to the sample audio and the first training text to obtain a trained voice recognition model, wherein the first training text is used for training the initial voice recognition model based on the word vector of each character.

14. The training system of a speech recognition model of claim 13,

the word segmentation module is further configured to segment the sample text by using an english sub-word model when the current character type of the sample text is only english characters, so as to obtain a third word segmentation result, where the third word segmentation result includes a plurality of second text units; the second text unit is a sub word of a second target word which can be split by the English sub word model in the sample text, and/or a second non-target word which can not be split by the English sub word model in the sample text;

The identification module is further configured to insert, for the third word segmentation result, a first identifier before the second target word with the number of the sub words being greater than 1, and insert a second identifier before the second non-target word with the number of the sub words being equal to 1, so as to obtain a fourth word segmentation result;

the segmentation module is further configured to segment each second text unit according to the fourth word segmentation result, and insert separators in front of two adjacent segmented second text units to obtain a second training text;

the training module is further configured to train an initial speech model according to the sample audio and the second training text, and obtain a trained speech recognition model.

15. The training system of a speech recognition model of claim 14, further comprising:

the splitting module is used for splitting the sample text into a first sub-text and a second sub-text when the current character type of the sample text is Chinese and English characters;

the synthesis module is used for splicing the first training text and the second training text into a third training text;

the training module is further configured to train an initial speech recognition model according to the sample audio and the third training text, and obtain a trained speech recognition model.

16. The training system of a speech recognition model of claim 13, further comprising:

the preset module is used for establishing the preset word library, wherein the preset word library comprises at least one preset word, and establishing an English sub word model, and the English sub word model is a byte pair coding BPE model.

17. An electronic device, comprising: a processor and a memory; the memory stores program instructions that, when executed by the processor, cause the electronic device to perform the training method of the speech recognition model of any one of claims 1-10.