CN115938369A

CN115938369A - Speech recognition method, electronic device, and storage medium

Info

Publication number: CN115938369A
Application number: CN202110958866.3A
Authority: CN
Inventors: 尹旭贤
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-08-20
Filing date: 2021-08-20
Publication date: 2023-04-07

Abstract

The embodiment of the application provides a voice recognition method, electronic equipment and a storage medium, which relate to the technical field of information, and the method comprises the following steps: acquiring a voice to be recognized; calculating the voice to be recognized by using a preset first model to obtain a general result; inputting the general result into a preset second model for calculation to obtain a vertical result; and obtaining a voice recognition result based on the vertical result and the general result. The method provided by the embodiment of the application can improve the accuracy of voice recognition.

Description

Speech recognition method, electronic device, and storage medium

Technical Field

The embodiment of the application relates to the field of artificial intelligence, in particular to a voice recognition method, electronic equipment and a storage medium.

Background

Speech Recognition, also known as Automatic Speech Recognition (ASR), is a technique for converting Speech into corresponding words by a computer. The voice recognition technology is widely applied to scenes such as vehicle navigation, smart home, social chat, application assistant and entertainment games.

Due to the limitations of the training corpus, speech recognition is generally better for general speech recognition, but has a higher recognition error rate for specific vertical classes, such as navigation location, song name. At present, the recognition of the vertical class in the speech recognition is mostly completed based on Weighted Finite State Transducer (WFST). WFST is based on semi-ring algebraic theory, and is internally a directed graph composed of a plurality of state transition arcs, wherein the state transition arcs comprise input characters, output characters and corresponding weights. The WFST is applied to ASR vertical recognition, and usually the character to be recognized is annotated in advance, and the input character on the state transition arc is the annotation corresponding to the character. After the WFST map is constructed, the result of the acoustic model is input at the time of recognition, and a path search is performed in the WFST map, thereby obtaining a path with the highest probability.

In a practical speech recognition system, the optimal path output by the acoustic model does not necessarily match the actual word sequence, and multiple candidate paths with the highest scores are generally obtained. Due to environmental noise, individual accent, uncommon word, etc., the correct word sequence may not be in the candidate paths, so that the correct word sequence cannot be generated after being sent into the WFST network, thereby resulting in low accuracy of speech recognition.

Disclosure of Invention

The embodiment of the application provides a voice recognition method, electronic equipment and a storage medium, and aims to provide a voice recognition mode which can improve the accuracy of voice recognition.

In a first aspect, an embodiment of the present application provides a speech recognition method, including:

acquiring a voice to be recognized; the voice to be recognized may be a voice signal collected by the electronic device through a microphone.

Calculating the voice to be recognized by using a preset first model to obtain a general result; wherein the preset first model may be a generic model.

Inputting the general result into a preset second model for calculation to obtain a vertical result; wherein the preset second model may be a WFST model.

And obtaining a voice recognition result based on the vertical result and the universal result. The vertical result and the general result can be fused, and a final voice recognition result is obtained through comprehensive decision.

In the embodiment of the application, the final voice recognition result is obtained by calculating the result of the generic result and the result of the generic result, and the accuracy of the voice recognition can be improved.

In one possible implementation manner, the type of the general result is a Chinese character, and the general result is input into a preset second model for calculation, so as to obtain a vertical result, where the vertical result includes:

and converting the universal result of the Chinese character type into a universal result of the pinyin type, and inputting the universal result of the pinyin type into a preset second model for calculation to obtain a vertical result.

In the embodiment of the application, the universal result of the character type is converted into the pinyin, and the pinyin is input into the vertical model (such as a WFST model) to perform vertical recognition, so that the parameters of the vertical model can be reduced, the calculation amount is reduced, and the calculation efficiency can be improved.

In one possible implementation, the preset second model is a weighted finite state transducer WFST model.

In one possible implementation manner, the WFST model includes a dictionary WFST and a language WFST, wherein the dictionary WFST is obtained by phonetic transcription of segmented words by a preset pronunciation dictionary.

In one possible implementation manner, the pronunciation dictionary includes one of a pronunciation dictionary based on initials and finals, a dictionary based on pinyin and a dictionary based on chinese characters.

In one possible implementation manner, the pronunciation dictionary comprises a rarely-used word dictionary and a filtered Chinese character set, wherein the filtered Chinese character set comprises confused Chinese characters and fuzzy-phonetic Chinese characters, the confused Chinese characters and the fuzzy-phonetic Chinese characters have corresponding weights, and the rarely-used word dictionary is obtained by performing word frequency statistics on text corpora.

In the embodiment of the application, the accuracy of the verticality identification can be improved in the scenes of environmental noise, individual accent and uncommon words by the aid of the weighted pronunciation dictionary in the verticality model.

In one possible implementation manner, obtaining the speech recognition result based on the vertical result and the general result includes:

calculating clustering results and confidence degrees of the vertical results and the general results based on the vertical results and the general results;

and determining a voice recognition result based on the clustering results and the confidence degrees of the vertical results and the universal results.

In the embodiment of the application, the accuracy of the voice recognition can be improved by clustering the general results and the vertical results and calculating the corresponding confidence coefficient.

In one possible implementation manner, calculating the clustering results of the vertical results and the common results based on the vertical results and the common results includes:

acquiring a general clustering center and a vertical clustering center; the general clustering centers comprise general correct identification clustering centers and general error identification clustering centers, and the vertical class centers comprise vertical correct identification clustering centers and vertical error identification clustering centers;

if the general result is closer to the general correct identification clustering center than the general error identification clustering center, the clustering result of the general result is in a positive class, and if the general result is closer to the general error identification clustering center than the general correct identification clustering center, the clustering result of the general result is in a negative class; and if the vertical result is closer to the vertical correct identification clustering center than to the vertical false identification clustering center, the clustering result of the vertical result is a positive cluster, and if the vertical result is closer to the vertical false identification clustering center than to the vertical correct identification clustering center, the clustering result of the vertical result is a negative cluster.

In one possible implementation, the calculating the confidence levels of the vertical result and the common result based on the vertical result and the common result includes:

if the clustering result of the general result is a positive type, the confidence of the general result can be calculated by the following formula:

conf _ g =100% -abs (Pt-Gp)/abs (Gp-Gn); wherein Conf _ g is the confidence of the universal result, pt is the score of the universal result, gp is the universal correct identification clustering center, and Gn is the universal wrong identification clustering center;

if the clustering result of the general result is a negative class, the confidence of the general result can be calculated by the following formula:

Conf_g＝100％-abs(Pt-Gn)/abs(Gp-Gn)；

if the clustering result of the vertical result is a positive type, the confidence of the vertical result can be calculated by the following formula:

conf _ w =100% -abs (Pw-Wp)/abs (Wp-Wn); wherein Conf _ w is the confidence coefficient of the vertical result, pw is the score of the vertical result, wp is the correct recognition clustering center of the vertical, and Wn is the incorrect recognition clustering center of the vertical;

if the clustering result of the vertical result is a negative type, the confidence of the vertical result can be calculated by the following formula:

Conf_w＝100％-abs(Pw-Wn)/abs(Wp-Wn)。

in one possible implementation manner, determining the speech recognition result based on the clustering results and the confidence degrees of the vertical result and the common result includes:

if the clustering result of the general result is a positive type, and the clustering result of the vertical type result is a negative type; or

The clustering result of the general result is a positive type, the clustering result of the vertical type result is a positive type, and the confidence coefficient of the general result is greater than that of the vertical type result; or

The clustering result of the general result is a negative class, the clustering result of the vertical class result is a negative class, and the confidence coefficient of the general result is less than that of the vertical class result;

determining the general result as a final voice recognition result;

if the clustering result of the general result is a negative type, and the clustering result of the vertical type result is a positive type; or

The clustering result of the general result is a positive type, the clustering result of the vertical type result is a positive type, and the confidence coefficient of the general result is less than that of the vertical type result; or

The clustering result of the general result is a negative class, the clustering result of the vertical class result is a negative class, and the confidence coefficient of the general result is greater than that of the vertical class result;

the vertical class result is determined as a final voice recognition result.

In a second aspect, an embodiment of the present application provides a speech recognition apparatus, including:

the acquisition module is used for acquiring the voice to be recognized;

the first calculation module is used for calculating the voice to be recognized by using a preset first model to obtain a general result;

the second calculation module is used for inputting the general result into a preset second model for calculation to obtain a vertical result;

and the fusion module is used for obtaining a voice recognition result based on the vertical result and the universal result.

In one possible implementation manner, the type of the general result is a Chinese character, and the second calculation module is further configured to convert the general result of the Chinese character type into a pinyin-type general result, and input the pinyin-type general result into a preset second model for calculation to obtain a vertical result.

In one possible implementation, the WFST model includes a dictionary WFST and a language WFST, wherein the dictionary WFST is obtained by annotating segmented words with a predetermined pronunciation dictionary.

In one possible implementation manner, the fusion module is further used for

Acquiring a general clustering center and a vertical clustering center; the universal clustering centers comprise universal correct identification clustering centers and universal error identification clustering centers, and the vertical clustering centers comprise vertical correct identification clustering centers and vertical error identification clustering centers;

In one possible implementation manner, the fusion module is further used for

Conf_g＝100％-abs(Pt-Gn)/abs(Gp-Gn)；

Conf_w＝100％-abs(Pw-Wn)/abs(Wp-Wn)。

in one possible implementation manner, the fusion module is further used for

The clustering result of the general result is a negative type, the clustering result of the vertical type result is a negative type, and the confidence coefficient of the general result is less than that of the vertical type result;

determining the general result as a final voice recognition result;

the vertical class result is determined as a final voice recognition result.

In a third aspect, an embodiment of the present application provides an electronic device, including:

a memory, where the memory is used to store computer program code, where the computer program code includes instructions, and when the electronic device reads the instructions from the memory, the electronic device executes the following steps:

acquiring a voice to be recognized;

calculating the voice to be recognized by using a preset first model to obtain a general result;

inputting the general result into a preset second model for calculation to obtain a vertical result;

and obtaining a voice recognition result based on the vertical result and the universal result.

In one possible implementation manner, the type of the generic result is a chinese character, and when the instruction is executed by the electronic device, the electronic device executes the generic result to input a preset second model for calculation, and the step of obtaining the vertical result includes:

and converting the universal result of the Chinese character type into a pinyin type universal result, and inputting the pinyin type universal result into a preset second model for calculation to obtain a vertical result.

In one possible implementation manner, the pronunciation dictionary includes one of a pronunciation dictionary based on initial consonants and vowels, a dictionary based on pinyin, and a dictionary based on Chinese characters.

In one possible implementation manner, when the instruction is executed by the electronic device, the step of the electronic device executing the obtaining of the speech recognition result based on the vertical result and the general result includes:

In one possible implementation manner, when the instruction is executed by the electronic device, the step of the electronic device executing a clustering result that calculates the vertical result and the common result based on the vertical result and the common result includes:

In one possible implementation manner, when the instruction is executed by the electronic device, the step of the electronic device calculating the confidence levels of the vertical result and the common result based on the vertical result and the common result includes:

Conf_g＝100％-abs(Pt-Gn)/abs(Gp-Gn)；

Conf_w＝100％-abs(Pw-Wn)/abs(Wp-Wn)。

in one possible implementation manner, when the instruction is executed by the electronic device, the step of determining the speech recognition result based on the clustering result and the confidence of the vertical result and the common result by the electronic device includes:

determining the general result as a final voice recognition result;

The clustering result of the general result is a negative type, the clustering result of the vertical type result is a negative type, and the confidence coefficient of the general result is greater than that of the vertical type result;

the vertical class result is determined as a final voice recognition result.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program, which, when run on a computer, causes the computer to perform the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program, which is configured to perform the method according to the first aspect when the computer program is executed by a computer.

In a possible design, the program of the fifth aspect may be stored in whole or in part on a storage medium packaged with the processor, or in part or in whole on a memory not packaged with the processor.

Drawings

Fig. 1 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating one embodiment of a speech recognition method provided herein;

FIG. 3 is a WFST model construction diagram provided in the embodiments of the present application;

FIG. 4 is a diagram illustrating construction of a pronunciation dictionary according to an embodiment of the present application;

fig. 5 is a schematic diagram illustrating calculation of vertical results according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating fusion of generic results and vertical results provided in an embodiment of the present application;

fig. 7 is a schematic diagram of a clustering result calculation provided in the embodiment of the present application;

FIG. 8 is a schematic diagram illustrating a fusion effect provided by an embodiment of the present application;

FIG. 9 is a schematic flow chart diagram illustrating another embodiment of a speech recognition method provided herein;

FIG. 10 is a diagram illustrating speech recognition effects provided by an embodiment of the present application;

fig. 11 is a schematic structural diagram of a speech recognition apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Wherein in the description of the embodiments of the present application, "/" indicates an inclusive meaning, for example, a/B may indicate a or B; "and/or" herein is merely an association describing an associated object, and means that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone.

In the following, the terms "first", "second" are used for descriptive purposes only and are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of the embodiments of the present application, the meaning of "a plurality" is two or more unless otherwise specified.

Speech Recognition, also known as Automatic Speech Recognition (ASR), is a technique for converting Speech into corresponding words by a computer. The voice recognition technology is widely applied to scenes such as vehicle navigation, intelligent home, social chat, application assistants, entertainment games and the like.

Due to corpus constraints, speech recognition is generally better for generic grammars and higher for specific verticals, e.g., navigation sites, song names. At present, the recognition of the vertical class in the speech recognition is mostly completed based on Weighted Finite State Transducer (WFST). WFST is based on semi-ring algebra theory, and is internally a directed graph formed by a plurality of state transition arcs, wherein the state transition arcs comprise input characters, output characters and corresponding weights. The WFST is applied to ASR vertical recognition, and usually the character to be recognized is annotated in advance, and the input character on the state transition arc is the annotation corresponding to the character. After the WFST map is constructed, the result of the acoustic model is input at the time of recognition, and a path search is performed in the WFST map, thereby obtaining a path with the highest probability.

Based on the foregoing problem, the embodiment of the present application provides a speech recognition method, which is applied to the electronic device 100. The electronic device 100 may be, for example, a smart robot, a vehicle-mounted terminal, a mobile phone, a tablet, or other smart terminal. The embodiment of the present application does not particularly limit the specific form of the electronic device 100 for implementing the technical solution.

An exemplary electronic device provided in the following embodiments of the present application is first described below with reference to fig. 1. Fig. 1 shows a schematic structural diagram of an electronic device 100.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a key 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the electronic device 100. In other embodiments of the present application, the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), among others. The different processing units may be separate devices or may be integrated into one or more processors.

The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose-input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

The I2C interface is a bidirectional synchronous serial bus comprising a serial data line (SDA) and a Serial Clock Line (SCL). In some embodiments, processor 110 may include multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, the charger, the flash, the camera 193, etc. through different I2C bus interfaces, respectively. For example: the processor 110 may be coupled to the touch sensor 180K through an I2C interface, so that the processor 110 and the touch sensor 180K communicate through an I2C bus interface to implement a touch function of the electronic device 100.

The I2S interface may be used for audio communication. In some embodiments, processor 110 may include multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 through an I2S bus, enabling communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may transmit the audio signal to the wireless communication module 160 through the I2S interface, so as to implement a function of receiving a call through a bluetooth headset.

The PCM interface may also be used for audio communication, sampling, quantizing and encoding analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled by a PCM bus interface. In some embodiments, the audio module 170 may also transmit the audio signal to the wireless communication module 160 through the PCM interface, so as to implement the function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus used for asynchronous communications. The bus may be a bidirectional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is generally used to connect the processor 110 and the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit the audio signal to the wireless communication module 160 through a UART interface, so as to realize the function of playing music through a bluetooth headset.

MIPI interfaces may be used to connect processor 110 with peripheral devices such as display screen 194, camera 193, and the like. The MIPI interface includes a Camera Serial Interface (CSI), a Display Serial Interface (DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the capture functionality of electronic device 100. Processor 110 and display screen 194 communicate via a DSI interface to implement display functions of electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal and may also be configured as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, I2S interface, UART interface, MIPI interface, and the like.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the electronic device 100, and may also be used to transmit data between the electronic device 100 and a peripheral device. And the earphone can also be used for connecting an earphone and playing audio through the earphone. The interface may also be used to connect other electronic devices, such as AR devices and the like.

It should be understood that the connection relationship between the modules according to the embodiment of the present invention is only illustrative and is not limited to the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive a charging input from a charger. The charger may be a wireless charger or a wired charger. In some wired charging embodiments, the charging management module 140 may receive charging input from a wired charger via the USB interface 130. In some wireless charging embodiments, the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100. The charging management module 140 may also supply power to the electronic device through the power management module 141 while charging the battery 142.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the display 194, the camera 193, the wireless communication module 160, and the like. The power management module 141 may also be used to monitor parameters such as battery capacity, battery cycle count, battery state of health (leakage, impedance), etc. In some other embodiments, the power management module 141 may also be disposed in the processor 110. In other embodiments, the power management module 141 and the charging management module 140 may also be disposed in the same device.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The antennas 1 and 2 are used for transmitting and receiving electromagnetic wave signals. Each antenna in the electronic device 100 may be used to cover a single or multiple communication bands. Different antennas can also be multiplexed to improve the utilization of the antennas. For example: the antenna 1 may be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, the antenna may be used in conjunction with a tuning switch.

The mobile communication module 150 may provide a solution including 2G/3G/4G/5G wireless communication applied to the electronic device 100. The mobile communication module 150 may include at least one filter, a switch, a power amplifier, a Low Noise Amplifier (LNA), and the like. The mobile communication module 150 may receive the electromagnetic wave from the antenna 1, filter, amplify, etc. the received electromagnetic wave, and transmit the electromagnetic wave to the modem processor for demodulation. The mobile communication module 150 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic wave through the antenna 1 to radiate the electromagnetic wave. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the processor 110. In some embodiments, at least some of the functional modules of the mobile communication module 150 may be disposed in the same device as at least some of the modules of the processor 110.

The modem processor may include a modulator and a demodulator. The modulator is used for modulating a low-frequency baseband signal to be transmitted into a medium-high frequency signal. The demodulator is used for demodulating the received electromagnetic wave signal into a low-frequency baseband signal. The demodulator then passes the demodulated low frequency baseband signal to a baseband processor for processing. The low frequency baseband signal is processed by the baseband processor and then passed to the application processor. The application processor outputs a sound signal through an audio device (not limited to the speaker 170A, the receiver 170B, etc.) or displays an image or video through the display screen 194. In some embodiments, the modem processor may be a stand-alone device. In other embodiments, the modem processor may be provided in the same device as the mobile communication module 150 or other functional modules, independent of the processor 110.

The wireless communication module 160 may provide a solution for wireless communication applied to the electronic device 100, including Wireless Local Area Networks (WLANs) (e.g., wireless fidelity (Wi-Fi) networks), bluetooth (bluetooth, BT), global Navigation Satellite System (GNSS), frequency Modulation (FM), near Field Communication (NFC), infrared (IR), and the like. The wireless communication module 160 may be one or more devices integrating at least one communication processing module. The wireless communication module 160 receives electromagnetic waves via the antenna 2, performs frequency modulation and filtering processing on electromagnetic wave signals, and transmits the processed signals to the processor 110. The wireless communication module 160 may also receive a signal to be transmitted from the processor 110, perform frequency modulation and amplification on the signal, and convert the signal into electromagnetic waves through the antenna 2 to radiate the electromagnetic waves.

In some embodiments, antenna 1 of electronic device 100 is coupled to mobile communication module 150 and antenna 2 is coupled to wireless communication module 160 so that electronic device 100 can communicate with networks and other devices through wireless communication techniques. The wireless communication technology may include global system for mobile communications (GSM), general Packet Radio Service (GPRS), code division multiple access (code division multiple access, CDMA), wideband Code Division Multiple Access (WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), long Term Evolution (LTE), BT, GNSS, WLAN, NFC, FM, and/or IR technologies, etc. The GNSS may include a Global Positioning System (GPS), a global navigation satellite system (GLONASS), a beidou navigation satellite system (BDS), a quasi-zenith satellite system (QZSS), and/or a Satellite Based Augmentation System (SBAS).

The electronic device 100 implements display functions via the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may adopt a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), and the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display 194, the application processor, and the like.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to perform fourier transform or the like on the frequency bin energy.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor that processes input information quickly by using a biological neural network structure, for example, by using a transfer mode between neurons of a human brain, and can also learn by itself continuously. Applications such as intelligent recognition of the electronic device 100 can be realized through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like. The storage data area may store data (such as audio data, phone book, etc.) created during use of the electronic device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121 and/or instructions stored in a memory provided in the processor.

The electronic device 100 may implement audio functions via the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into analog audio signals for output, and also used to convert analog audio inputs into digital audio signals. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or some functional modules of the audio module 170 may be disposed in the processor 110.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal. The electronic apparatus 100 can listen to music through the speaker 170A or listen to a handsfree call.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into a sound signal. When the electronic apparatus 100 receives a call or voice information, it can receive voice by placing the receiver 170B close to the ear of the person.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals. When making a call or transmitting voice information, the user can input a voice signal to the microphone 170C by speaking near the microphone 170C through the mouth. The electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C to achieve a noise reduction function in addition to collecting sound signals. In other embodiments, the electronic device 100 may further include three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, perform directional recording, and so on.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be the USB interface 130, or may be a 3.5mm open mobile electronic device platform (OMTP) standard interface, a cellular telecommunications industry association (cellular telecommunications industry association of the USA, CTIA) standard interface.

The pressure sensor 180A is used for sensing a pressure signal, and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a sensor comprising at least two parallel plates having an electrically conductive material. When a force acts on the pressure sensor 180A, the capacitance between the electrodes changes. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic device 100 detects the intensity of the touch operation according to the pressure sensor 180A. The electronic apparatus 100 may also calculate the touched position from the detection signal of the pressure sensor 180A. In some embodiments, the touch operations that are applied to the same touch position but have different touch operation intensities may correspond to different operation instructions. For example: and when the touch operation with the touch operation intensity smaller than the first pressure threshold value acts on the short message application icon, executing an instruction for viewing the short message. And when the touch operation with the touch operation intensity larger than or equal to the first pressure threshold value acts on the short message application icon, executing an instruction of newly building the short message.

The gyro sensor 180B may be used to determine the motion attitude of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., the x, y, and z axes) may be determined by gyroscope sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. Illustratively, when the shutter is pressed, the gyro sensor 180B detects a shake angle of the electronic device 100, calculates a distance to be compensated for the lens module according to the shake angle, and allows the lens to counteract the shake of the electronic device 100 through a reverse movement, thereby achieving anti-shake. The gyroscope sensor 180B may also be used for navigation, somatosensory gaming scenes.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude, aiding in positioning and navigation, from barometric pressure values measured by barometric pressure sensor 180C.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip holster using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the opening and closing state of the leather sheath or the opening and closing state of the flip cover, the automatic unlocking of the flip cover is set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity can be detected when the electronic device 100 is stationary. The method can also be used for identifying the posture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and the like.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, taking a picture of a scene, electronic device 100 may utilize range sensor 180F to range for fast focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light to the outside through the light emitting diode. The electronic device 100 detects infrared reflected light from a nearby object using a photodiode. When sufficient reflected light is detected, it can be determined that there is an object near the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there are no objects near the electronic device 100. The electronic device 100 can utilize the proximity sensor 180G to detect that the user holds the electronic device 100 close to the ear for talking, so as to automatically turn off the screen to save power. The proximity light sensor 180G may also be used in a holster mode, a pocket mode automatically unlocks and locks the screen.

The ambient light sensor 180L is used to sense ambient light brightness. Electronic device 100 may adaptively adjust the brightness of display screen 194 based on the perceived ambient light level. The ambient light sensor 180L can also be used to automatically adjust the white balance when taking a picture. The ambient light sensor 180L may also cooperate with the proximity light sensor 180G to detect whether the electronic device 100 is in a pocket to prevent accidental touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 can utilize the collected fingerprint characteristics to unlock the fingerprint, access the application lock, photograph the fingerprint, answer an incoming call with the fingerprint, and so on.

The temperature sensor 180J is used to detect temperature. In some embodiments, electronic device 100 implements a temperature processing strategy using the temperature detected by temperature sensor 180J. For example, when the temperature reported by the temperature sensor 180J exceeds a threshold, the electronic device 100 performs a reduction in performance of a processor located near the temperature sensor 180J, so as to reduce power consumption and implement thermal protection. In other embodiments, the electronic device 100 heats the battery 142 when the temperature is below another threshold to avoid the low temperature causing the electronic device 100 to shut down abnormally. In other embodiments, when the temperature is lower than a further threshold, the electronic device 100 performs boosting on the output voltage of the battery 142 to avoid abnormal shutdown due to low temperature.

The touch sensor 180K is also called a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor can communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on a surface of the electronic device 100, different from the position of the display screen 194.

The bone conduction sensor 180M may acquire a vibration signal. In some embodiments, the bone conduction sensor 180M may acquire a vibration signal of the human vocal part vibrating the bone mass. The bone conduction sensor 180M may also contact the human pulse to receive the blood pressure pulsation signal. In some embodiments, the bone conduction sensor 180M may also be disposed in a headset, integrated into a bone conduction headset. The audio module 170 may analyze a voice signal based on the vibration signal of the bone block vibrated by the sound part obtained by the bone conduction sensor 180M, so as to implement a voice function. The application processor can analyze heart rate information based on the blood pressure beating signal acquired by the bone conduction sensor 180M, so as to realize the heart rate detection function.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The electronic apparatus 100 may receive a key input, and generate a key signal input related to user setting and function control of the electronic apparatus 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration cues, as well as for touch vibration feedback. For example, touch operations applied to different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also respond to different vibration feedback effects for touch operations applied to different areas of the display screen 194. Different application scenes (such as time reminding, receiving information, alarm clock, game and the like) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be attached to and detached from the electronic device 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. The electronic device 100 may support 1 or N SIM card interfaces, N being a positive integer greater than 1. The SIM card interface 195 may support a NanoSIM card, a Micro SIM card, a SIM card, etc. Multiple cards can be inserted into the same SIM card interface 195 at the same time. The types of the plurality of cards may be the same or different. The SIM card interface 195 may also be compatible with different types of SIM cards. The SIM card interface 195 is also compatible with external memory cards. The electronic device 100 interacts with the network through the SIM card to implement functions such as communication and data communication. In some embodiments, the electronic device 100 employs esims, namely: an embedded SIM card. The eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.

The speech recognition method provided by the embodiment of the present application will now be described with reference to fig. 2 to 10.

Fig. 2 is a schematic flowchart illustrating an embodiment of a speech recognition method according to an embodiment of the present application, including:

step 201, obtaining the voice to be recognized.

Specifically, the voice to be recognized may be a voice collected by the electronic device 100, for example, the electronic device 100 may collect the voice through a microphone. It is to be understood that the above manner of collecting voice through a microphone is only an exemplary illustration and does not constitute a limitation to the embodiments of the present application, and in some embodiments, voice may be collected through other manners.

Step 202, inputting the voice to be recognized into a preset voice recognition model for calculation to obtain a general result.

Specifically, the preset speech recognition model may be a general acoustic model, and the acoustic model may be used for recognizing the speech to be recognized to obtain a general recognition result (for convenience of explanation, the "general recognition result" is hereinafter referred to as a "general result"). The general result is a general grammar and words, that is, the general result does not include the identification of a specific vertical class.

And step 203, inputting the general result into a preset WFST model for calculation to obtain a vertical result.

Specifically, the Weighted Finite State Transducer (WFST) model can be obtained by pre-training.

Next, an exemplary WFST model training method will be described.

Fig. 3 is a schematic flowchart of a method for training a WFST model according to an embodiment of the present application, including:

step 2031, a pronunciation dictionary is constructed.

Specifically, the pronunciation dictionary may include pronunciation dictionaries based on initials and finals, pronunciation dictionaries based on pinyin, and pronunciation dictionaries based on chinese characters. It is understood that the pronunciation dictionary based on initial consonants and vowels, the pronunciation dictionary based on pinyin and the pronunciation dictionary based on chinese characters are only illustrative and do not limit the embodiments of the present application, and in some embodiments, other types of pronunciation dictionaries may be included.

It should be noted that the pronunciation dictionary can be constructed according to the preset speech recognition model in step 202. For example, if the general result output by the preset speech recognition model in the step 202 is pinyin, the pronunciation dictionary constructed in this step is a pronunciation dictionary based on pinyin.

Table 1 exemplarily shows a pronunciation dictionary based on initials and finals.

TABLE 1

Character(s)	Initial consonant	Vowels
			Eating food	ch	i1
Xi Huan	x，h	i3，uan5
			Grab	zh	ua1
Ham	h，t	uo3，ui3

As shown in table 1, the pronunciation dictionary based on initials and finals is used to annotate characters to obtain initials and finals corresponding to the characters.

Table 2 illustrates an exemplary pinyin-based pronunciation dictionary.

TABLE 2

Character(s)	Phonetic alphabet
		Eating	chi1
Xi Huan	xi3huan2
		Grab	zhua1
Ham	huo3tui3

As shown in table 2, the pronunciation dictionary based on pinyin is used to annotate characters to obtain pinyin corresponding to the characters.

Table 3 exemplarily shows a pronunciation dictionary based on chinese characters.

TABLE 3

Character(s)	Character(s)
		MI	Rice and its production process
Eating	Eating
		Xi Huan	Xi Huan

As shown in table 3, the pronunciation dictionary based on chinese characters is used to annotate characters to obtain common characters corresponding to the current characters. For example, if the current word is a common word, the corresponding word is a common word, and if the current word is a uncommon word, the uncommon word can be mapped to be a common word.

Next, the construction of a pronunciation dictionary will be described by taking a pronunciation dictionary based on chinese characters as an example. It is understood that the preset speech recognition model in step 202 outputs a common result as a chinese character.

Fig. 4 is a flowchart of a method for constructing a pronunciation dictionary based on chinese characters according to an embodiment of the present application, including:

step 20311, obtaining the speech corpus and the text corpus.

Specifically, the speech corpus and the text corpus may be samples for training. In a specific implementation, the speech corpus may be collected speech, and the text corpus may be collected text, and the source of the speech corpus and the source of the text corpus are not particularly limited in this embodiment of the present application.

Step 20312, inputting the phonetic corpus into the speech recognition model to obtain a text output, and obtaining an initial Chinese character set based on the text output and the label text.

Specifically, the initial Chinese character set is used for representing a Chinese character set obtained by identifying and counting the voice corpus. In a specific implementation, the speech corpus may be first input into the speech recognition model to obtain a text output, where the text output includes a text result obtained by recognizing the speech corpus. Then, the character output is counted by the label text, so that an initial Chinese character set can be obtained. The label text is a real text corresponding to the voice corpus.

Table 4 shows an exemplary set of initial chinese characters.

TABLE 4

Sundries

Combination of l 1676179

How |2763

' 1899

Is |1021

Spirit 516

Time |408

For feeding

To |1528185

Heel |3098

You |844

760 each

I |476

Upper |459

Electric power

Electric |1813104

Store |1974

Point |1070

Ding |480

Day 366

285

Telephone

Word 1572202

Picture |1878

Change |1631

Flower |851

|611 of

I309

As shown in Table 4, the left-most column is the character to be recognized, and the right-most columns 2-7 are the recognized Chinese character set. Wherein the number following each letter is used to characterize the frequency of the letter recognition.

Step 20312, filter the initial Chinese character set to obtain a filtered Chinese character set.

Specifically, the filtering may be through a preset fuzzy sound dictionary. Because the Chinese character output may have wrong output, the filtering can be used for filtering wrong and completely irrelevant confusing Chinese characters and keeping Chinese characters which are easy to confuse and have fuzzy sound.

In a specific implementation, the filtering method may include: firstly, a fuzzy sound dictionary of the vowels is obtained, and a fuzzy sound probability matrix is generated. Then, a final Chinese character set is obtained by filtering according to the fuzzy sound probability matrix, for example, the original wrong sound distribution model is corrected by the fuzzy sound probability matrix, that is, wrong sounds which do not meet the requirement are filtered, so that the error-prone sparse probability matrix can be generated. The final filtering Chinese character set comprises confusing (homophonic and homophonic or homophonic and different tones) and fuzzy-sound Chinese character sets.

Taking table 4 as an example, after filtering the initial chinese character set shown in table 4, table 5 can be obtained, and table 5 exemplarily shows the filtered chinese character set.

TABLE 5

To give

To |1528185

Electric power

Electric |1813104

Store |1974

Point |1070

Palace |83

Pad |79

Telephone

Word 1572202

Drawing |1878

Change |1631

Flower |851

Hua |254

Line |137

Step 20313, performing word frequency statistics on the text corpus to obtain the rarely-used word dictionary.

Specifically, the word frequency statistics may be used to count the frequency of occurrence of each chinese character in the text corpus, determine frequently used words and rarely used words based on the frequency, and map the rarely used words into frequently used words. For example, the frequency of each chinese character may be compared with a preset frequency threshold, and if the frequency of the chinese character is greater than or equal to the preset frequency threshold, the chinese character may be considered as a frequently used word, and if the frequency of the chinese character is less than the preset frequency threshold, the chinese character may be considered as a rare word. Next, the uncommon word can be mapped to a homophonic common word.

Step 20314, a pronunciation dictionary based on Chinese characters is obtained based on the filtered Chinese character set and the uncommon character dictionary.

Specifically, after the filtered Chinese character set and the uncommon word dictionary are obtained, the filtered Chinese character set and the uncommon word dictionary can be combined to generate a pronunciation dictionary based on Chinese characters. It should be noted that, in the pronunciation dictionary based on chinese characters, each chinese character may also have a corresponding weight. Wherein the weight may be obtained based on a frequency of filtering each Chinese character in the set of Chinese characters. Illustratively, the frequency of each Chinese character can be obtained and normalized, so that the weight of each Chinese character can be obtained.

Table 6 shows an exemplary pronunciation dictionary based on chinese characters.

Electricity	Electric \|0.75	Point \|0.15
			Telephone	0.55 of speech	Equation \|0.30

As shown in table 6, each chinese character has a corresponding weight in the chinese character-based pronunciation dictionary. Taking the Chinese character "electricity" as an example, the weight identified as "electricity" is 0.75, and the weight identified as "dot" is 0.15.

Step 2032, obtaining the training corpus, and performing word segmentation on the training corpus.

Specifically, the corpus may be a training text, and it is understood that the source of the corpus is not particularly limited in the embodiment of the present application. After the corpus is obtained, the corpus may be participled. It should be understood that the word segmentation may be performed by a common word segmentation tool, and the embodiment of the present application does not specifically limit the word segmentation manner.

Step 2033, phonetic notation is performed on the divided words to obtain a dictionary WFST.

Specifically, the phonetic notation may be implemented by a pronunciation dictionary (e.g., a pronunciation dictionary based on initials and finals, a pronunciation dictionary based on pinyin, a pronunciation dictionary based on chinese characters, etc.). For example, the word segmentation may be annotated by the pronunciation dictionary to obtain a dictionary WFST. The dictionary WFST may be regarded as a WFST submodel.

Step 2034, training the participle to obtain the language WFST.

Specifically, the above training of the participle can be realized by an ngram statistical language model. The language WFST may include participles and transitions between participles, as well as corresponding weights. Wherein, the weight can be obtained by training the ngram statistical language model.

It is understood that the execution sequence of step 2034 is not sequential to that of step 2033.

Step 2035 generates a WFST model based on the dictionary WFST and the language WFST.

Specifically, after the dictionaries WFST and the languages WFST are acquired, the dictionaries WFST and the languages WFST may be combined to generate a WFST model. That is, the WFST model includes the dictionary WFST and the language WFST.

After the WFST model is trained, the generic result may be input into the WFST model and calculated, thereby obtaining a verticality result. Wherein the vertical type result is a result based on vertical type recognition. All the common results may be input to the WFST model to be calculated, or a part of the common results with a high probability may be selected and input to the WFST model to be calculated. In addition, when the path search is performed in the WFST model, a method of beam search may be used, for example, beam may be set to 100, and it should be understood that the value of beam 100 is merely an exemplary illustration and does not constitute a limitation to the embodiments of the present application, and in some embodiments, other values may also be used.

The above-described vertical class recognition will now be described with reference to fig. 5. As shown in FIG. 5, 500 is a generic result, which may be one or more Chinese characters and corresponding probabilities. For example, taking two Chinese characters as an example, the first character recognition result is: the probability of "electric" word is 0.7, the probability of "dot" word is 0.15, the probability of "shop" word is 0.09, and the probability of "classical" word is 0.05. The second word recognition results in: the probability of "words" is 0.81, the probability of "words" is 0.07, the probability of "words" is 0.05, and the probability of "flowers" is 0.01. The generic result may then be input into WFST model 510 for computation, thereby resulting in candidate path set 520, i.e., vertical results, for candidate path set 520. The candidate path set 520 may include a plurality of candidate paths, each candidate path has a corresponding probability, and each path and the corresponding probability may be obtained by calculating through the WFST model 510. The WFST model 510 may compute multiple candidate paths with the generic result as input. Taking the "phone" combination as an example, the path of this combination, which finally identifies "phone", includes { "electricity: electricity/0.75 "," telephone: 0.55' }. Since the probability of an "electrical" word is 0.75, "electrical: the probability of an electrical/0.75 "sub-path is 0.75, the probability of a" word "is 0.81, and" word: if the probability of/0.55 "sub path is 0.55, the total probability of" telephone "combined path is 0.7+0.75+0.81+0.55=2.81.

And step 204, fusing the general result and the vertical result to obtain a voice recognition result.

Specifically, the fusion can be realized by clustering and confidence. In specific implementation, after the general result and the vertical result are obtained, the general result and the vertical result may be clustered, and a confidence may be calculated. And then, carrying out comprehensive decision according to the clustering result and the execution degree of the general result and the vertical result to obtain a voice recognition result.

The above-described fusion method is now exemplified with reference to fig. 6-8.

Fig. 6 is a schematic flow chart of a fusion method provided in an embodiment of the present application, including:

step 2041, obtain a general result clustering center and a vertical result clustering center.

Specifically, the general result clustering center and the vertical result clustering center may be obtained through training. The general result clustering centers can include a first general result clustering center Gp and a second general result clustering center Gn, wherein the first general result clustering center Gp can be used for representing a general result clustering center with correct recognition, and the second general result clustering center Gn can be used for representing a general result clustering center with wrong recognition. The vertical result clustering centers can comprise a first vertical result clustering center Wp and a second vertical result clustering center Wn, wherein the first vertical result clustering center Wp can be used for representing a vertical result clustering center with correct recognition, and the second vertical result clustering center Wn can be used for representing a vertical result clustering center with wrong recognition. It is to be understood that the above clustering algorithm may use a clustering algorithm such as k-means, but is not limited to the embodiments of the present application, and in some embodiments, other clustering algorithms may also be used.

And 2042, obtaining a clustering result based on the general result, the vertical result, the general result clustering center and the vertical result clustering center.

Specifically, the clustering result includes a clustering result C _ g of the generic result and a clustering result C _ w of the vertical result, and the clustering result may include a positive class and a negative class. The positive class is used to characterize the class classified to the correct cluster center, and the negative class is used to characterize the class classified to the wrong cluster center. In specific implementation, C _ g =1 indicates that the clustering result of the generic result is a positive class, and C _ g =0 indicates that the clustering result of the generic result is a negative class. C _ w =1 indicates that the clustering result of the vertical class result is a positive class, and C _ w =0 indicates that the clustering result of the vertical class result is a negative class. First, the score Pt of the generic result and the score Pw of the vertical result can be obtained. The score Pt of the common result may be a sum of probabilities, and taking the "telephone" combination shown in fig. 5 as an example, if the probability of "telephone" word is 0.7 and the probability of "word" is 0.81, the sum of probabilities of the "telephone" combination of common results Pt = log (0.7) + log (0.81). The score Pw of the vertical result may be a sum of probabilities of each candidate path, and it is understood that the score Pw may also be calculated by taking a log operation.

The above clustering result will now be described with reference to fig. 7. As shown in FIG. 7, the first generic result clustering center Gp has a value of-0.44, the second generic result clustering center Gn has a value of-1.69, the first vertical result clustering center Wp has a value of-19.03, and the second vertical result clustering center Wn has a value of-38.52. Then, the score Pt of the common result and the positions of the first common result clustering center Gp and the second common result clustering center Gn can be obtained. Since the score Pt of the common result is close to the first common result cluster center Gp, it can be determined that the common result belongs to the positive class. Similarly, the score Pw of the vertical result and the positions of the first vertical result clustering center Wp and the second vertical result clustering center Wn can be obtained. Since the score Pw of the vertical result is close to the second vertical result clustering center Wn, it can be determined that the vertical result belongs to the negative class.

And 2043, obtaining confidence degrees based on the universal result, the vertical result, the universal result clustering center and the vertical result clustering center.

Specifically, the confidence may include the confidence of the common result Conf _ g and the confidence of the vertical result Conf _ w. And if the clustering result of one sample is the positive type and the sample is closer to the correct clustering center, the confidence coefficient of the sample is higher. In a specific implementation, the confidence level may be calculated as follows.

If the clustering result C _ g =1 of the common result of a sample, that is, the clustering result of the common result of the sample is the positive class, abs (Pt-Gp) < = abs (Pt-Gn), the sample is closer to the first common result clustering center Gp. The confidence of the generic result for this sample Conf _ g =100% -abs (Pt-Gp)/abs (Gp-Gn). Wherein the abs operation is an absolute value operation.

If the clustering result C _ g =0 of the common result of a sample, that is, the clustering result of the common result of the sample is a negative class, the confidence Conf _ g =100% -abs (Pt-Gn)/abs (Gp-Gn) of the common result of the sample.

If the clustering result C _ w =1 of the vertical class result of a sample, that is, the clustering result of the vertical class result of the sample is the positive class, abs (Pw-Wp) < = abs (Pw-Wn), the sample is closer to the first vertical class result clustering center Wp. The confidence Conf _ w of the verticals result for this sample is 100% -abs (Pw-Wp)/abs (Wp-Wn).

If the clustering result C _ w of the vertical class result of a sample =0, that is, the clustering result of the vertical class result of the sample is a negative class, the confidence Conf _ w of the vertical class result of the sample =100% -abs (Pw-Wn)/abs (Wp-Wn).

Step 2044, determining a speech recognition result based on the clustering result and the confidence.

Specifically, after the clustering result and the confidence are obtained, the voice recognition result may be determined based on the clustering result and the confidence. For example, the generic result and the vertical result may be fused based on the clustering result and the confidence to obtain a final speech recognition result.

In specific implementation, if the clustering result of the generic result is a positive class, for example, C _ g =1, and the clustering result of the vertical class result is a negative class, for example, C _ w =0; or

The clustering result of the general result is a positive class, the clustering result of the vertical class result is a positive class, and the confidence of the general result is greater than that of the vertical class result, for example, conf _ g > Conf _ w; or

The clustering result of the general result is a negative class, the clustering result of the vertical class result is a negative class, and the confidence of the general result is less than that of the vertical class result, for example, conf _ g < Conf _ w;

the generic result is determined as the final speech recognition result.

If the clustering result of the generic result is a negative class, for example, C _ g =0, and the clustering result of the vertical class result is a positive class, for example, C _ w =1; or

The clustering result of the general result is a positive class, the clustering result of the vertical class result is a positive class, and the confidence of the general result is less than that of the vertical class result, for example, conf _ g < Conf _ w; or

The clustering result of the general result is a negative class, the clustering result of the vertical class result is a negative class, and the confidence of the general result is greater than that of the vertical class result, for example, conf _ g > Conf _ w;

the vertical class result is determined as a final voice recognition result.

The above-described fusion effect will now be described with reference to fig. 8. As shown in fig. 8, "connotative yacht vacation hotels" in the interface 800 are the site droop category. After the speech to be recognized is recognized, the general result (i.e., the output result of the Transform-T model) is "help me navigate to the korean yacht, vacation hotel business building", the clustering result of the general result is a negative class, i.e., C _ g =0, and the confidence of the general result Conf _ g = -0.971239; the vertical result (i.e., the output result of WFST) is "help me navigate to the hotel business building on the park yacht vacation", the clustering result of the vertical result is positive, i.e., C _ w =1, and the confidence of the vertical result Conf _ w = -0.831955. Because the clustering result of the vertical result is a positive type and the clustering result of the general result is a negative type, the vertical result is determined as a final voice recognition result, namely, the final voice recognition result is 'help me navigate to the commercial building of the pleasure boat resort hotel'.

In the interface 810, the general result is 'how you want to go to rhinoceros to sit on the subway', the clustering result of the general result is a negative class, and the confidence of the general result is 0.136566; the vertical result is "how do i want to go to do what you do with iron, the clustering result of the vertical result is a negative class, and the confidence of the vertical result is 0.516903. Because the clustering results of the general result and the vertical result are both negative classes, and the confidence of the general result is less than that of the vertical result, that is, although the general result and the vertical result are clustered to wrong classes, the confidence of the general result is less than that of the vertical result, that is, the vertical result is less reliable than the general result, the general result can be determined as a final speech recognition result, that is, the final speech recognition result is "how you want to go to sit in the subway".

Next, the speech recognition effect provided by the embodiment of the present application is exemplarily described with reference to table 7 and table 8.

TABLE 7

Table 7 shows the test results based on 1000 navigation entities. As shown in table 7, when the WFST model was not used, the probability of the benchmarking was 89.25%, and the probability of the benchmarking was 34.1%. When the WFST model is used, if the fuzzy sound quantity of the pronunciation dictionary is 1, the probability of the word standard can be improved to 92.7 percent; the fuzzy sound quantity is used for representing the total number of the fuzzy sounds corresponding to one Chinese character in the pronunciation dictionary. As the number of fuzzy tones increases, the probability of the word standard increases accordingly. However, as the number of fuzzy tones increases, the size of the WFST model increases, the corresponding memory increases, and the amount of computation increases, so that in practical applications, a balance can be made between the number of fuzzy tones and the number of levels.

TABLE 8

Table 8 is 10000 non-navigational entity data (out-of-set) and 10000 navigational entity data (in-set). Wherein 10000 in the set can be extracted from the training corpus. The test method is that the Text is converted into voice through a Text To Speech (TTS) model, the voice is sent To a voice recognition model To obtain a general result, and then the general result is sent To a WFST model. As shown in table 8, the average word standard of the generic result obtained by using only the speech recognition model (e.g., TTS model) was 95.63%, and the average word standard of the generic result obtained by using only WFST was 95.5%, and the average word standard could be increased to 96.74% after the generic result and the generic result were merged. Likewise, the average sentence level may also be increased to 80.27%.

It is understood that, in the above embodiments, steps 201 to 204 are optional steps, and this application only provides one possible embodiment, and may further include more or less steps than steps 201 to 204, which is not limited in this application.

Fig. 9 is a flowchart illustrating a speech recognition method according to another embodiment of the present application. As shown in fig. 9, after the speech to be recognized in the present embodiment is calculated by the speech recognition model to obtain a corresponding general result (e.g., a chinese character), the chinese character may be converted into a pinyin. The method for converting the Chinese characters into the pinyin can be realized through a character-to-sound module, the character-to-sound module can be a pre-constructed matrix or a neural network model, and the embodiment of the application does not specially limit the specific form of the character-to-sound module. It can be understood that, when constructing the pronunciation dictionary, the phonetic notation for the Chinese character can be the corresponding pinyin and the confusable pinyin. Then, the pinyin obtained after the Chinese character conversion can be input into a WFST model for calculation to obtain a vertical result, and the general result and the vertical result can be fused to obtain a final voice recognition result. Because the number of pinyin is less than the number of chinese characters, the number of fuzzy tones is reduced when the WFST model is constructed, thereby reducing the amount of parameters of the model, further reducing the cache of the electronic device 100, and reducing the amount of computation of the electronic device 100.

Fig. 10 is a schematic diagram illustrating an effect of using the speech recognition method according to the embodiment of the present application. As shown in fig. 10, the uncommon word such as the chinese character sequence "monthly mi" is often wrong in the result output by the general model, and the correct chinese character sequence can be obtained by fusing the general result and the vertical result in the embodiment of the present application. In addition, since all the vertical corpora are used in the WFST training, the general recognition result is poor, for example, through the WFST model, the error "i want to buy a cai in tomorrow" can be recognized, and through the fusion of the general result and the vertical result in the embodiment of the present application, the error recognition caused by the error in the vertical recognition in the above situation can be avoided.

Fig. 11 is a schematic structural diagram of an embodiment of a speech recognition apparatus of the present application, and as shown in fig. 11, the speech recognition apparatus 1100 may include: an obtaining module 1110, a first calculating module 1120, a second calculating module 1130, and a fusing module 1140; wherein, the first and the second end of the pipe are connected with each other,

an obtaining module 1110, configured to obtain a speech to be recognized;

the first calculation module 1120 is configured to calculate the speech to be recognized by using a preset first model to obtain a general result;

the second calculating module 1130 is configured to input the general result into a preset second model for calculation to obtain a vertical result;

and a fusion module 1140, configured to obtain a voice recognition result based on the vertical result and the general result.

In one possible implementation manner, the type of the general result is a chinese character, and the second calculating module 1130 is further configured to convert the general result of the chinese character type into a pinyin-type general result, and input the pinyin-type general result into a preset second model for calculation to obtain a vertical result.

In one possible implementation manner, the fusion module 1140 is further configured to

Conf_g＝100％-abs(Pt-Gn)/abs(Gp-Gn)；

Conf_w＝100％-abs(Pw-Wn)/abs(Wp-Wn)。

determining the general result as a final voice recognition result;

the vertical class result is determined as a final voice recognition result.

The embodiment shown in fig. 11 provides a speech recognition apparatus 1100 for implementing the technical solutions of the method embodiments shown in fig. 1-10 of the present application, and the implementation principles and technical effects thereof can be further referred to the related descriptions in the method embodiments.

It should be understood that the division of the respective modules of the speech recognition apparatus shown in fig. 11 is merely a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling by the processing element in software, and part of the modules can be realized in the form of hardware. For example, the detection module may be a separate processing element, or may be integrated into a chip of the electronic device. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, these modules may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

It should be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

It is to be understood that the electronic device 100 and the like described above include corresponding hardware structures and/or software modules for performing the respective functions in order to realize the functions described above. Those of skill in the art will readily appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

In the embodiment of the present application, the electronic device 100 and the like may be divided into functional modules according to the method example, for example, each functional module may be divided for each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

Through the above description of the embodiments, it is clear to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. For the specific working processes of the system, the apparatus and the unit described above, reference may be made to the corresponding processes in the foregoing method embodiments, and details are not described here again.

Each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be essentially implemented or make a contribution to the prior art, or all or part of the technical solutions may be implemented in the form of a software product stored in a storage medium and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: flash memory, removable hard drive, read only memory, random access memory, magnetic or optical disk, and the like.

The above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech recognition, the method comprising:

acquiring a voice to be recognized;

and obtaining a voice recognition result based on the vertical result and the general result.

2. The method of claim 1, wherein the generic result is a chinese character, and the entering the generic result into a preset second model for calculation to obtain a vertical result comprises:

3. The method according to claim 1 or 2, wherein the predetermined second model is a Weighted Finite State Transducer (WFST) model.

4. The method of claim 3, wherein the WFST model comprises a dictionary WFST and a language WFST, wherein the dictionary WFST is obtained by annotating a segmented word with a predetermined pronunciation dictionary.

5. The method of claim 4, wherein the pronunciation dictionary comprises one of an initial and final based pronunciation dictionary, a pinyin based dictionary, and a chinese character based dictionary.

6. The method according to claim 4, wherein the pronunciation dictionary comprises a rarely-used word dictionary and a filtered Chinese character set, wherein the filtered Chinese character set comprises confusing Chinese characters and fuzzy-phonetic Chinese characters, the confusing Chinese characters and the fuzzy-phonetic Chinese characters have corresponding weights, and the rarely-used word dictionary is obtained by word frequency statistics of text corpus.

7. The method according to any one of claims 1-6, wherein the deriving a speech recognition result based on the vertical class result and the generic result comprises:

8. The method of claim 7, wherein the computing the clustering results of the vertical results and the generic results based on the vertical results and the generic results comprises:

acquiring a general clustering center and a vertical clustering center; the general clustering centers comprise general correct identification clustering centers and general error identification clustering centers, and the vertical class centers comprise vertical class correct identification clustering centers and vertical class error identification clustering centers;

if the general result is closer to the general correct identification clustering center than the general wrong identification clustering center, the clustering result of the general result is in a positive class, and if the general result is closer to the general wrong identification clustering center than the general correct identification clustering center, the clustering result of the general result is in a negative class; and if the vertical result is closer to the vertical correct identification clustering center than the vertical false identification clustering center, the clustering result of the vertical result is a positive cluster, and if the vertical result is closer to the vertical false identification clustering center than the vertical correct identification clustering center, the clustering result of the vertical result is a negative cluster.

9. The method of claim 8, wherein the calculating the confidence levels for the vertical class results and the generic results based on the vertical class results and the generic results comprises:

Conf_g＝100％-abs(Pt-Gn)/abs(Gp-Gn)；

conf _ w =100% -abs (Pw-Wp)/abs (Wp-Wn); wherein Conf _ w is the confidence coefficient of the vertical result, pw is the score of the vertical result, wp is the correct recognition clustering center of the vertical, and Wn is the wrong recognition clustering center of the vertical;

if the clustering result of the vertical result is a negative result, the confidence of the vertical result can be calculated by the following formula:

Conf_w＝100％-abs(Pw-Wn)/abs(Wp-Wn)。

10. the method of claim 9, wherein determining a speech recognition result based on the clustering results and the confidence levels of the vertical results and the generic results comprises:

The clustering result of the general result is a positive class, the clustering result of the vertical class result is a positive class, and the confidence coefficient of the general result is greater than that of the vertical class result; or

determining the general result as a final voice recognition result;

The clustering result of the general result is a positive class, the clustering result of the vertical class result is a positive class, and the confidence coefficient of the general result is less than that of the vertical class result; or

the vertical class result is determined as a final speech recognition result.

11. An electronic device, comprising: a memory for storing computer program code, the computer program code comprising instructions that, when read from the memory by the electronic device, cause the electronic device to perform the method of any of claims 1-10.

12. A computer-readable storage medium comprising computer instructions that, when executed on the electronic device, cause the electronic device to perform the method of any of claims 1-10.