CN113593566A

CN113593566A - Voice recognition processing method and system

Info

Publication number: CN113593566A
Application number: CN202110639813.5A
Authority: CN
Inventors: 马宪泉
Original assignee: Shenzhen Two Monkeys Technology Co ltd
Current assignee: Shenzhen Two Monkeys Technology Co ltd
Priority date: 2021-06-08
Filing date: 2021-06-08
Publication date: 2021-11-02

Abstract

The embodiment of the application provides a voice recognition processing method and a system, wherein the method comprises the following steps: the method comprises the steps that terminal equipment collects voice data, and a first voice recognition algorithm is adopted for the voice data to determine n confidence rates of n words corresponding to each pronunciation group in the voice data; and determining m confidences of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm. The technical scheme provided by the application has the advantage of high accuracy.

Description

Voice recognition processing method and system

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition processing method and system.

Background

Speech Recognition technology, also known as Automatic Speech Recognition (ASR), aims at converting the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes or character sequences. Unlike speaker recognition and speaker verification, the latter attempts to recognize or verify the speaker who uttered the speech rather than the vocabulary content contained therein.

The existing voice recognition technology has low accuracy for voice recognition, especially for polyphonic words, and affects the experience of users.

Disclosure of Invention

The embodiment of the application discloses a voice recognition processing method, which can improve the accuracy of voice recognition and further improve the experience of a user.

A first aspect of an embodiment of the present application provides a speech recognition processing method, where the method is applied to a terminal device, where the method includes the following steps:

the method comprises the steps that terminal equipment collects voice data, and a first voice recognition algorithm is adopted for the voice data to determine n confidence rates of n words corresponding to each pronunciation group in the voice data; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm;

the terminal equipment determines the characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determines the characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text;

the terminal equipment compares the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, the syllables with the same character contents are homophone syllables, if the polyphone syllables are one, n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words are extracted, x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates are obtained, if the x words and the x' words have the same first words, the first words are determined to be correct words of the polyphone syllables, and the correct words and the words corresponding to the homophone syllables are combined to obtain correct texts of the voice data.

In a second aspect, a speech recognition processing system is provided, the system comprising:

the acquisition unit is used for acquiring voice data;

the processing unit is used for acquiring voice data by the terminal equipment and determining n confidence rates of n words corresponding to each pronunciation group in the voice data by adopting a first voice recognition algorithm; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm; determining characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determining characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text; comparing the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, taking the syllables with the same character contents as the homophone syllables, if the polyphone syllables are one, extracting n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words, obtaining x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates, if the x words and the x' words have the same first words, determining that the first words are correct words of the polyphone syllables, and combining the correct words and the homophone syllables to obtain the correct text of the voice data.

A third aspect of embodiments of the present application provides a terminal device, comprising a processor, a memory, a communication interface, and one or more programs, stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of the first aspect.

A fourth aspect of embodiments of the present application discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method of the first aspect.

A fifth aspect of embodiments of the present application discloses a computer program product, wherein the computer program product comprises a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to perform some or all of the steps as described in the first aspect of embodiments of the present application. The computer program product may be a software installation package.

According to the technical scheme, terminal equipment collects voice data, and a first voice recognition algorithm is adopted for the voice data to determine n confidence rates of n words corresponding to each pronunciation in the voice data; determining m confidence rates of m words corresponding to each pronunciation in the voice data by adopting a second voice recognition algorithm; the terminal equipment determines the characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determines the characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text; the terminal equipment compares the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, the syllables with the same character contents are homophone syllables, if the polyphone syllables are one, n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words are extracted, x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates are obtained, if the x words and the x' words have the same first characters, the first characters are determined to be correct characters of the polyphone syllables, and the correct characters and the characters corresponding to the homophone syllables are combined to obtain correct texts of the voice data. This can improve the accuracy of the characters with the syllables of the polyphones.

Drawings

The drawings used in the embodiments of the present application are described below.

Fig. 1 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech recognition processing method according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a speech recognition processing system according to an embodiment of the present application.

Detailed Description

The embodiments of the present application will be described below with reference to the drawings.

The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.

The "plurality" appearing in the embodiments of the present application means two or more. The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application. The term "connect" in the embodiments of the present application refers to various connection manners, such as direct connection or indirect connection, to implement communication between devices, which is not limited in this embodiment of the present application.

The terminal device in the embodiment of the present application is a device having a wireless communication function, and may be referred to as a terminal (terminal), a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), an access terminal device, a vehicle-mounted terminal device, an industrial control terminal device, a UE unit, a UE station, a mobile station, a remote terminal device, a mobile device, a UE terminal device, a wireless communication device, a UE agent, or a UE apparatus. The terminal device may be fixed or mobile. It should be noted that the terminal device may support at least one wireless communication technology, such as LTE, New Radio (NR), and the like. For example, the terminal device may be a mobile phone (mobile phone), a tablet (pad), a desktop, a notebook, a kiosk, a vehicle-mounted terminal, a Virtual Reality (VR) terminal device, an Augmented Reality (AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving, a wireless terminal in remote surgery, a wireless terminal in smart grid, a wireless terminal in transportation safety, a wireless terminal in city (PDA) a wireless terminal in smart home, a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (PDA) a wireless local area, a wireless personal digital assistant (wldi), a handheld wireless terminal with personal communication function, and a wireless terminal with personal communication function, A computing device or other processing device connected to a wireless modem, a wearable device, a terminal device in a future mobile communication network or a terminal device in a future evolved public mobile land network (PLMN), etc. In some embodiments of the present application, the terminal device may also be an apparatus having a transceiving function, such as a system-on-chip. The chip system may include a chip and may also include other discrete devices.

Referring to fig. 1, fig. 1 provides a schematic structural diagram of a terminal device, and as shown in fig. 1, the terminal may specifically include: the device comprises a processor, a memory, a communication unit, an audio acquisition unit and a bus, wherein the processor, the memory, the communication unit and the audio acquisition unit are connected through the bus.

For example, the processor may include a general-purpose processor and an artificial intelligence processor, and the specific structure of the artificial intelligence processor is not limited in this application, for example, the think element 270 artificial intelligence processor may be used, and of course, other AI processors may also be used.

The audio acquisition unit may be a microphone or other audio acquisition devices.

The communication unit may be used to provide the terminal with the capability to communicate with external devices. The communication unit may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and/or optical signals. The wireless communication circuitry in the communication unit may include radio frequency transceiver circuitry, power amplifier circuitry, low noise amplifiers, switches, filters, and antennas. For example, the wireless Communication circuitry in the Communication unit may include circuitry for supporting Near Field Communication (NFC) by transmitting and receiving Near Field coupled electromagnetic signals. For example, the communication unit may include a near field communication antenna and a near field communication transceiver. The communication unit may also include a cellular telephone transceiver and antenna, a wireless local area network transceiver circuit and antenna, and the like.

The communication unit may support various communication protocols, for example, a mobile communication protocol such as 5G, LTE, etc., and may also support a short-range communication protocol such as WIFI, bluetooth, etc.

The terminal may further include a battery, a power management circuit, and other input-output units. The input-output unit may include buttons, a joystick, a display screen, and the like.

Referring to fig. 2, fig. 2 provides a speech recognition processing method, which is applied to the terminal device shown in fig. 1, where the specific structure of the terminal device may be as shown in fig. 1, where the method includes the following steps:

s200, the terminal equipment collects voice data and determines n confidence rates of n words corresponding to each pronunciation group in the voice data by adopting a first voice recognition algorithm; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm;

each pronunciation group of the first speech recognition algorithm has n confidence rates and each pronunciation group of the second speech recognition algorithm has m confidence rates. N and m are integers greater than or equal to 3, and m and n may be equal or unequal.

Step S201, the terminal equipment determines characters corresponding to the highest confidence rate in n confidence rates in each pronunciation as character information in a first text, and determines characters corresponding to the highest confidence rate in m confidence rates in each pronunciation as character information in a second text;

step S202, the terminal equipment compares the same syllables in the character information in the first text and the character information in the second text to determine that syllables with different character contents are polyphone syllables, the syllables with the same character contents are homophone syllables, if the polyphone syllables are one, n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words are extracted, x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates are obtained, if the x words and the x' words have the same first words, the first words are determined to be correct words of the polyphone syllables, and the correct words and the homophone syllable corresponding words are combined to obtain the correct text of the voice data.

The above x and x' may be integers of 2 or more, and may be the same or different.

The principle of the technical scheme of the application is that for existing voice recognition modes, for example, a voice recognition algorithm based on LSTM and a voice recognition algorithm based on RNN, the accuracy rate in actual voice recognition is better, but the recognition rate of polyphone is reduced greatly, so that the technical scheme can improve the accuracy rate of the polyphone syllable by judging whether the first x confidence rates of the polyphone syllables in the confidence rates of all characters in two different voice recognition software have the same character, and if the polyphone syllables have the same character, the accuracy probability is higher, so that the accuracy rate of the polyphone syllables can be improved.

Note that the aforementioned monosyllabic syllables do not mean that the syllable corresponds to only one character, and in the present application, the monosyllabic syllables mean that the syllables corresponding to the same child are the recognition results of the syllables output in two different recognition methods.

In the following a practical example is described,

for example, the original speech data is: searching for small aromatic songs; the result of the first recognition is: "find a small aromatic song"; the second type of recognition results are: "find songs of a cube"; then, the "xiaofang" is determined as the polyphone syllable, where x and x' are both 2, then the first x recognized characters with confidence rates are determined as: the small aromatics (85%), the xiao aromatics (20%), and the second recognized first x' characters with confidence rates are: a small square (50%), a small aromatic (40%), then it is determined that the correct word may be "small aromatic".

For example, the method may further include:

if the x words and the x' words do not have the same words, the terminal device determines the first identity of the voice message, and respectively calculates matching degrees of character information in the first text and character information in the second text and historical information of the first identity to determine character information corresponding to the highest matching degree (which can be the character information in the first text or the character information in the second text) as a correct text.

In practical applications, there may be a plurality of polyphonic syllables, which are processed in a similar manner to one syllable, and only the correct word is determined by processing one by one.

In the following a practical example is described,

for example, the original speech data is: searching for small aromatic songs; the result of the first recognition is: "find a small aromatic song"; the second type of recognition results are: "find songs of a cube"; then, the "xiaofang" is determined as the polyphone syllable, where x and x' are both 2, then the first x recognized characters with confidence rates are determined as: the small aromatics (85%), the xiao aromatics (20%), and the second recognized first x' characters with confidence rates are: and the small party (50%) and the school party (40%) do not have the same words, so that historical information needs to be acquired, and if the historical information of the voice data comprises 'a small song', the text information in the first text is determined, otherwise, if the historical information of the voice data comprises 'a small song', the text information in the second text is determined.

For example, the calculation formula of the RNN speech recognition method can be as follows:

taking time t as an example:

S_t＝X_t×W+S_t-1×W

O_t＝f(S_t)

wherein W represents a weight, X_t-1Input data of the input layer representing the time t-1, X_tInput data of the input layer representing time t, S_t-1Output result of hidden layer representing time t-1, O_t-1The output result of the output layer at the time t-1 is shown;

f represents an activation function including, but not limited to: sigmoid function, tanh function, etc.

Of course, in practical applications, other activation functions may be used.

For example, the calculation formula of the LSTM speech recognition method may specifically include:

the LSTM can be divided into a forgetting gate, an input gate, and an output gate, corresponding to three calculations, and the formula of the calculation is as follows:

forget to remember the door f_t＝σ(h_t-1*X_t+b_f)。

An input gate:

i_t＝σ(h_t-1*X_t+b_i)

C’_t＝tanh(h_t-1*X_t+b_c)；

an output gate:

O_t＝σ(h_t-1*X_t+b_O)；

h_t＝O_t*tanh(C_t)。

wherein, C_t＝C_t-1*f_t+i_t*C’_t。

Above, b_fDenotes f_tThe offset of the function, the value being constant, and, similarly, b_i、b_c、b_oRespectively representing the offsets of the corresponding formulas; c_t-1Is shown onCell output value, h, of t-1 at a time_t-1Is the output value of the previous moment, X_tRepresenting the input data at the current time t. Where σ denotes a sigmod function and tanh denotes a tanh function, which are activation functions. O is_tIndicating the output result of the output gate.

For example, the method may further include:

the terminal equipment generates a control instruction corresponding to the correct text.

Referring to fig. 3, fig. 3 provides a speech recognition processing system comprising:

an acquisition unit 301 for acquiring voice data;

the processing unit 302 is configured to collect voice data by a terminal device, and determine n confidence rates of n words corresponding to each uttering group in the voice data by using a first voice recognition algorithm for the voice data; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm; determining characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determining characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text; comparing the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, taking the syllables with the same character contents as the homophone syllables, if the polyphone syllables are one, extracting n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words, obtaining x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates, if the x words and the x' words have the same first words, determining that the first words are correct words of the polyphone syllables, and combining the correct words and the homophone syllables to obtain the correct text of the voice data.

An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a network device, the method flow shown in fig. 2 is implemented.

An embodiment of the present application further provides a computer program product, and when the computer program product runs on a terminal, the method flow shown in fig. 2 is implemented.

Embodiments of the present application also provide a terminal including a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps in the method of the embodiment shown in fig. 2.

The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It will be appreciated that the electronic device, in order to carry out the functions described above, may comprise corresponding hardware structures and/or software templates for performing the respective functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no acts or templates referred to are necessarily required by the application.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A speech recognition processing method is applied to a terminal device, wherein the method comprises the following steps:

2. The method of claim 1, further comprising:

if the x words and the x' words do not have the same words, the terminal equipment determines the first identity of the voice information, calculates the matching degrees of the character information in the first text, the character information in the second text and the historical information of the first identity respectively, and determines the character information corresponding to the highest matching degree to be the correct text.

3. The method of claim 1,

the first speech recognition algorithm is an LSTM speech recognition algorithm, and the second speech recognition algorithm is an RNN speech recognition algorithm.

4. The method of claim 3, wherein the RNN speech recognition algorithm has a computational formula comprising:

S_t＝X_t×W+S_t-1×W

O_t＝f(S_t)

f denotes an activation function.

5. The method according to claim 4, characterized in that said activation function comprises in particular:

sigmoid function or tanh function

6. The method of claim 3, wherein the calculation formula of the LSTM speech recognition algorithm comprises:

forget to remember the door f_t＝σ(h_t-1*X_t+b_f)。

An input gate:

i_t＝σ(h_t-1*X_t+b_i)

C’_t＝tanh(h_t-1*X_t+b_c)；

an output gate:

O_t＝σ(h_t-1*X_t+b_O)；

h_t＝O_t*tanh(C_t)。

wherein, C_t＝C_t-1*f_t+i_t*C’_t；

Wherein, b_fDenotes f_tOffset of function, b_i、b_c、b_oRespectively representing the offsets of the corresponding formulas; c_t-1Cell output value, h, representing t-1 at the previous time instant_t-1Is the output value of the previous moment, X_tRepresents the input data at the current time instant t, sigma represents the sigmod function,tan h represents tan h function, O_tIndicating the output result of the output gate.

7. A speech recognition processing system, the system comprising:

the acquisition unit is used for acquiring voice data;

8. The system of claim 7,

9. A terminal device, characterized in that the terminal device comprises: a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1-6.

10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-6.