CN113593566A - Voice recognition processing method and system - Google Patents

Voice recognition processing method and system Download PDF

Info

Publication number
CN113593566A
CN113593566A CN202110639813.5A CN202110639813A CN113593566A CN 113593566 A CN113593566 A CN 113593566A CN 202110639813 A CN202110639813 A CN 202110639813A CN 113593566 A CN113593566 A CN 113593566A
Authority
CN
China
Prior art keywords
words
syllables
confidence rates
text
rates
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110639813.5A
Other languages
Chinese (zh)
Inventor
马宪泉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Two Monkeys Technology Co ltd
Original Assignee
Shenzhen Two Monkeys Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Two Monkeys Technology Co ltd filed Critical Shenzhen Two Monkeys Technology Co ltd
Priority to CN202110639813.5A priority Critical patent/CN113593566A/en
Publication of CN113593566A publication Critical patent/CN113593566A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/027Syllables being the recognition units

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

The embodiment of the application provides a voice recognition processing method and a system, wherein the method comprises the following steps: the method comprises the steps that terminal equipment collects voice data, and a first voice recognition algorithm is adopted for the voice data to determine n confidence rates of n words corresponding to each pronunciation group in the voice data; and determining m confidences of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm. The technical scheme provided by the application has the advantage of high accuracy.

Description

Voice recognition processing method and system
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a speech recognition processing method and system.
Background
Speech Recognition technology, also known as Automatic Speech Recognition (ASR), aims at converting the vocabulary content in human Speech into computer-readable input, such as keystrokes, binary codes or character sequences. Unlike speaker recognition and speaker verification, the latter attempts to recognize or verify the speaker who uttered the speech rather than the vocabulary content contained therein.
The existing voice recognition technology has low accuracy for voice recognition, especially for polyphonic words, and affects the experience of users.
Disclosure of Invention
The embodiment of the application discloses a voice recognition processing method, which can improve the accuracy of voice recognition and further improve the experience of a user.
A first aspect of an embodiment of the present application provides a speech recognition processing method, where the method is applied to a terminal device, where the method includes the following steps:
the method comprises the steps that terminal equipment collects voice data, and a first voice recognition algorithm is adopted for the voice data to determine n confidence rates of n words corresponding to each pronunciation group in the voice data; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm;
the terminal equipment determines the characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determines the characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text;
the terminal equipment compares the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, the syllables with the same character contents are homophone syllables, if the polyphone syllables are one, n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words are extracted, x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates are obtained, if the x words and the x' words have the same first words, the first words are determined to be correct words of the polyphone syllables, and the correct words and the words corresponding to the homophone syllables are combined to obtain correct texts of the voice data.
In a second aspect, a speech recognition processing system is provided, the system comprising:
the acquisition unit is used for acquiring voice data;
the processing unit is used for acquiring voice data by the terminal equipment and determining n confidence rates of n words corresponding to each pronunciation group in the voice data by adopting a first voice recognition algorithm; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm; determining characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determining characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text; comparing the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, taking the syllables with the same character contents as the homophone syllables, if the polyphone syllables are one, extracting n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words, obtaining x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates, if the x words and the x' words have the same first words, determining that the first words are correct words of the polyphone syllables, and combining the correct words and the homophone syllables to obtain the correct text of the voice data.
A third aspect of embodiments of the present application provides a terminal device, comprising a processor, a memory, a communication interface, and one or more programs, stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of the first aspect.
A fourth aspect of embodiments of the present application discloses a computer-readable storage medium storing a computer program for electronic data exchange, wherein the computer program causes a computer to execute the method of the first aspect.
A fifth aspect of embodiments of the present application discloses a computer program product, wherein the computer program product comprises a non-transitory computer-readable storage medium storing a computer program, the computer program being operable to cause a computer to perform some or all of the steps as described in the first aspect of embodiments of the present application. The computer program product may be a software installation package.
According to the technical scheme, terminal equipment collects voice data, and a first voice recognition algorithm is adopted for the voice data to determine n confidence rates of n words corresponding to each pronunciation in the voice data; determining m confidence rates of m words corresponding to each pronunciation in the voice data by adopting a second voice recognition algorithm; the terminal equipment determines the characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determines the characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text; the terminal equipment compares the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, the syllables with the same character contents are homophone syllables, if the polyphone syllables are one, n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words are extracted, x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates are obtained, if the x words and the x' words have the same first characters, the first characters are determined to be correct characters of the polyphone syllables, and the correct characters and the characters corresponding to the homophone syllables are combined to obtain correct texts of the voice data. This can improve the accuracy of the characters with the syllables of the polyphones.
Drawings
The drawings used in the embodiments of the present application are described below.
Fig. 1 is a schematic structural diagram of a terminal device provided in an embodiment of the present application;
fig. 2 is a schematic flowchart of a speech recognition processing method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a speech recognition processing system according to an embodiment of the present application.
Detailed Description
The embodiments of the present application will be described below with reference to the drawings.
The term "and/or" in this application is only one kind of association relationship describing the associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in this document indicates that the former and latter related objects are in an "or" relationship.
The "plurality" appearing in the embodiments of the present application means two or more. The descriptions of the first, second, etc. appearing in the embodiments of the present application are only for illustrating and differentiating the objects, and do not represent the order or the particular limitation of the number of the devices in the embodiments of the present application, and do not constitute any limitation to the embodiments of the present application. The term "connect" in the embodiments of the present application refers to various connection manners, such as direct connection or indirect connection, to implement communication between devices, which is not limited in this embodiment of the present application.
The terminal device in the embodiment of the present application is a device having a wireless communication function, and may be referred to as a terminal (terminal), a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), an access terminal device, a vehicle-mounted terminal device, an industrial control terminal device, a UE unit, a UE station, a mobile station, a remote terminal device, a mobile device, a UE terminal device, a wireless communication device, a UE agent, or a UE apparatus. The terminal device may be fixed or mobile. It should be noted that the terminal device may support at least one wireless communication technology, such as LTE, New Radio (NR), and the like. For example, the terminal device may be a mobile phone (mobile phone), a tablet (pad), a desktop, a notebook, a kiosk, a vehicle-mounted terminal, a Virtual Reality (VR) terminal device, an Augmented Reality (AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in self driving, a wireless terminal in remote surgery, a wireless terminal in smart grid, a wireless terminal in transportation safety, a wireless terminal in city (PDA) a wireless terminal in smart home, a cellular phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a wireless local loop (PDA) a wireless local area, a wireless personal digital assistant (wldi), a handheld wireless terminal with personal communication function, and a wireless terminal with personal communication function, A computing device or other processing device connected to a wireless modem, a wearable device, a terminal device in a future mobile communication network or a terminal device in a future evolved public mobile land network (PLMN), etc. In some embodiments of the present application, the terminal device may also be an apparatus having a transceiving function, such as a system-on-chip. The chip system may include a chip and may also include other discrete devices.
Referring to fig. 1, fig. 1 provides a schematic structural diagram of a terminal device, and as shown in fig. 1, the terminal may specifically include: the device comprises a processor, a memory, a communication unit, an audio acquisition unit and a bus, wherein the processor, the memory, the communication unit and the audio acquisition unit are connected through the bus.
For example, the processor may include a general-purpose processor and an artificial intelligence processor, and the specific structure of the artificial intelligence processor is not limited in this application, for example, the think element 270 artificial intelligence processor may be used, and of course, other AI processors may also be used.
The audio acquisition unit may be a microphone or other audio acquisition devices.
The communication unit may be used to provide the terminal with the capability to communicate with external devices. The communication unit may include analog and digital input-output interface circuits, and wireless communication circuits based on radio frequency signals and/or optical signals. The wireless communication circuitry in the communication unit may include radio frequency transceiver circuitry, power amplifier circuitry, low noise amplifiers, switches, filters, and antennas. For example, the wireless Communication circuitry in the Communication unit may include circuitry for supporting Near Field Communication (NFC) by transmitting and receiving Near Field coupled electromagnetic signals. For example, the communication unit may include a near field communication antenna and a near field communication transceiver. The communication unit may also include a cellular telephone transceiver and antenna, a wireless local area network transceiver circuit and antenna, and the like.
The communication unit may support various communication protocols, for example, a mobile communication protocol such as 5G, LTE, etc., and may also support a short-range communication protocol such as WIFI, bluetooth, etc.
The terminal may further include a battery, a power management circuit, and other input-output units. The input-output unit may include buttons, a joystick, a display screen, and the like.
Referring to fig. 2, fig. 2 provides a speech recognition processing method, which is applied to the terminal device shown in fig. 1, where the specific structure of the terminal device may be as shown in fig. 1, where the method includes the following steps:
s200, the terminal equipment collects voice data and determines n confidence rates of n words corresponding to each pronunciation group in the voice data by adopting a first voice recognition algorithm; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm;
each pronunciation group of the first speech recognition algorithm has n confidence rates and each pronunciation group of the second speech recognition algorithm has m confidence rates. N and m are integers greater than or equal to 3, and m and n may be equal or unequal.
Step S201, the terminal equipment determines characters corresponding to the highest confidence rate in n confidence rates in each pronunciation as character information in a first text, and determines characters corresponding to the highest confidence rate in m confidence rates in each pronunciation as character information in a second text;
step S202, the terminal equipment compares the same syllables in the character information in the first text and the character information in the second text to determine that syllables with different character contents are polyphone syllables, the syllables with the same character contents are homophone syllables, if the polyphone syllables are one, n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words are extracted, x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates are obtained, if the x words and the x' words have the same first words, the first words are determined to be correct words of the polyphone syllables, and the correct words and the homophone syllable corresponding words are combined to obtain the correct text of the voice data.
The above x and x' may be integers of 2 or more, and may be the same or different.
According to the technical scheme, terminal equipment collects voice data, and a first voice recognition algorithm is adopted for the voice data to determine n confidence rates of n words corresponding to each pronunciation in the voice data; determining m confidence rates of m words corresponding to each pronunciation in the voice data by adopting a second voice recognition algorithm; the terminal equipment determines the characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determines the characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text; the terminal equipment compares the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, the syllables with the same character contents are homophone syllables, if the polyphone syllables are one, n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words are extracted, x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates are obtained, if the x words and the x' words have the same first characters, the first characters are determined to be correct characters of the polyphone syllables, and the correct characters and the characters corresponding to the homophone syllables are combined to obtain correct texts of the voice data. This can improve the accuracy of the characters with the syllables of the polyphones.
The principle of the technical scheme of the application is that for existing voice recognition modes, for example, a voice recognition algorithm based on LSTM and a voice recognition algorithm based on RNN, the accuracy rate in actual voice recognition is better, but the recognition rate of polyphone is reduced greatly, so that the technical scheme can improve the accuracy rate of the polyphone syllable by judging whether the first x confidence rates of the polyphone syllables in the confidence rates of all characters in two different voice recognition software have the same character, and if the polyphone syllables have the same character, the accuracy probability is higher, so that the accuracy rate of the polyphone syllables can be improved.
Note that the aforementioned monosyllabic syllables do not mean that the syllable corresponds to only one character, and in the present application, the monosyllabic syllables mean that the syllables corresponding to the same child are the recognition results of the syllables output in two different recognition methods.
In the following a practical example is described,
for example, the original speech data is: searching for small aromatic songs; the result of the first recognition is: "find a small aromatic song"; the second type of recognition results are: "find songs of a cube"; then, the "xiaofang" is determined as the polyphone syllable, where x and x' are both 2, then the first x recognized characters with confidence rates are determined as: the small aromatics (85%), the xiao aromatics (20%), and the second recognized first x' characters with confidence rates are: a small square (50%), a small aromatic (40%), then it is determined that the correct word may be "small aromatic".
For example, the method may further include:
if the x words and the x' words do not have the same words, the terminal device determines the first identity of the voice message, and respectively calculates matching degrees of character information in the first text and character information in the second text and historical information of the first identity to determine character information corresponding to the highest matching degree (which can be the character information in the first text or the character information in the second text) as a correct text.
In practical applications, there may be a plurality of polyphonic syllables, which are processed in a similar manner to one syllable, and only the correct word is determined by processing one by one.
In the following a practical example is described,
for example, the original speech data is: searching for small aromatic songs; the result of the first recognition is: "find a small aromatic song"; the second type of recognition results are: "find songs of a cube"; then, the "xiaofang" is determined as the polyphone syllable, where x and x' are both 2, then the first x recognized characters with confidence rates are determined as: the small aromatics (85%), the xiao aromatics (20%), and the second recognized first x' characters with confidence rates are: and the small party (50%) and the school party (40%) do not have the same words, so that historical information needs to be acquired, and if the historical information of the voice data comprises 'a small song', the text information in the first text is determined, otherwise, if the historical information of the voice data comprises 'a small song', the text information in the second text is determined.
For example, the calculation formula of the RNN speech recognition method can be as follows:
taking time t as an example:
St=Xt×W+St-1×W
Ot=f(St)
wherein W represents a weight, Xt-1Input data of the input layer representing the time t-1, XtInput data of the input layer representing time t, St-1Output result of hidden layer representing time t-1, Ot-1The output result of the output layer at the time t-1 is shown;
f represents an activation function including, but not limited to: sigmoid function, tanh function, etc.
Figure BDA0003106140680000071
Of course, in practical applications, other activation functions may be used.
For example, the calculation formula of the LSTM speech recognition method may specifically include:
the LSTM can be divided into a forgetting gate, an input gate, and an output gate, corresponding to three calculations, and the formula of the calculation is as follows:
forget to remember the door ft=σ(ht-1*Xt+bf)。
An input gate:
it=σ(ht-1*Xt+bi)
C’t=tanh(ht-1*Xt+bc);
an output gate:
Ot=σ(ht-1*Xt+bO);
ht=Ot*tanh(Ct)。
wherein, Ct=Ct-1*ft+it*C’t
Above, bfDenotes ftThe offset of the function, the value being constant, and, similarly, bi、bc、boRespectively representing the offsets of the corresponding formulas; ct-1Is shown onCell output value, h, of t-1 at a timet-1Is the output value of the previous moment, XtRepresenting the input data at the current time t. Where σ denotes a sigmod function and tanh denotes a tanh function, which are activation functions. O istIndicating the output result of the output gate.
For example, the method may further include:
the terminal equipment generates a control instruction corresponding to the correct text.
Referring to fig. 3, fig. 3 provides a speech recognition processing system comprising:
an acquisition unit 301 for acquiring voice data;
the processing unit 302 is configured to collect voice data by a terminal device, and determine n confidence rates of n words corresponding to each uttering group in the voice data by using a first voice recognition algorithm for the voice data; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm; determining characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determining characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text; comparing the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, taking the syllables with the same character contents as the homophone syllables, if the polyphone syllables are one, extracting n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words, obtaining x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates, if the x words and the x' words have the same first words, determining that the first words are correct words of the polyphone syllables, and combining the correct words and the homophone syllables to obtain the correct text of the voice data.
An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a network device, the method flow shown in fig. 2 is implemented.
An embodiment of the present application further provides a computer program product, and when the computer program product runs on a terminal, the method flow shown in fig. 2 is implemented.
Embodiments of the present application also provide a terminal including a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs including instructions for performing the steps in the method of the embodiment shown in fig. 2.
The above description has introduced the solution of the embodiment of the present application mainly from the perspective of the method-side implementation process. It will be appreciated that the electronic device, in order to carry out the functions described above, may comprise corresponding hardware structures and/or software templates for performing the respective functions. Those of skill in the art will readily appreciate that the present application is capable of hardware or a combination of hardware and computer software implementing the various illustrative elements and algorithm steps described in connection with the embodiments provided herein. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiment of the present application, the electronic device may be divided into the functional units according to the method example, for example, each functional unit may be divided corresponding to each function, or two or more functions may be integrated into one processing unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit. It should be noted that the division of the unit in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are presently preferred and that no acts or templates referred to are necessarily required by the application.
In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit may be stored in a computer readable memory if it is implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be substantially implemented or a part of or all or part of the technical solution contributing to the prior art may be embodied in the form of a software product stored in a memory, and including several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the above-mentioned method of the embodiments of the present application. And the aforementioned memory comprises: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable memory, which may include: flash Memory disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.
The foregoing detailed description of the embodiments of the present application has been presented to illustrate the principles and implementations of the present application, and the above description of the embodiments is only provided to help understand the method and the core concept of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A speech recognition processing method is applied to a terminal device, wherein the method comprises the following steps:
the method comprises the steps that terminal equipment collects voice data, and a first voice recognition algorithm is adopted for the voice data to determine n confidence rates of n words corresponding to each pronunciation group in the voice data; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm;
the terminal equipment determines the characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determines the characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text;
the terminal equipment compares the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, the syllables with the same character contents are homophone syllables, if the polyphone syllables are one, n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words are extracted, x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates are obtained, if the x words and the x' words have the same first words, the first words are determined to be correct words of the polyphone syllables, and the correct words and the words corresponding to the homophone syllables are combined to obtain correct texts of the voice data.
2. The method of claim 1, further comprising:
if the x words and the x' words do not have the same words, the terminal equipment determines the first identity of the voice information, calculates the matching degrees of the character information in the first text, the character information in the second text and the historical information of the first identity respectively, and determines the character information corresponding to the highest matching degree to be the correct text.
3. The method of claim 1,
the first speech recognition algorithm is an LSTM speech recognition algorithm, and the second speech recognition algorithm is an RNN speech recognition algorithm.
4. The method of claim 3, wherein the RNN speech recognition algorithm has a computational formula comprising:
St=Xt×W+St-1×W
Ot=f(St)
wherein W represents a weight, Xt-1Input data of the input layer representing the time t-1, XtInput data of the input layer representing time t, St-1Output result of hidden layer representing time t-1, Ot-1The output result of the output layer at the time t-1 is shown;
f denotes an activation function.
5. The method according to claim 4, characterized in that said activation function comprises in particular:
sigmoid function or tanh function
Figure FDA0003106140670000021
6. The method of claim 3, wherein the calculation formula of the LSTM speech recognition algorithm comprises:
forget to remember the door ft=σ(ht-1*Xt+bf)。
An input gate:
it=σ(ht-1*Xt+bi)
C’t=tanh(ht-1*Xt+bc);
an output gate:
Ot=σ(ht-1*Xt+bO);
ht=Ot*tanh(Ct)。
wherein, Ct=Ct-1*ft+it*C’t
Wherein, bfDenotes ftOffset of function, bi、bc、boRespectively representing the offsets of the corresponding formulas; ct-1Cell output value, h, representing t-1 at the previous time instantt-1Is the output value of the previous moment, XtRepresents the input data at the current time instant t, sigma represents the sigmod function,tan h represents tan h function, OtIndicating the output result of the output gate.
7. A speech recognition processing system, the system comprising:
the acquisition unit is used for acquiring voice data;
the processing unit is used for acquiring voice data by the terminal equipment and determining n confidence rates of n words corresponding to each pronunciation group in the voice data by adopting a first voice recognition algorithm; determining m confidence rates of m words corresponding to each pronunciation group in the voice data by adopting a second voice recognition algorithm; determining characters corresponding to the highest confidence rates in the n confidence rates in each pronunciation as character information in the first text, and determining characters corresponding to the highest confidence rates in the m confidence rates in each pronunciation as character information in the second text; comparing the same syllables in the character information in the first text with the same syllables in the character information in the second text to determine that the syllables with different character contents are polyphone syllables, taking the syllables with the same character contents as the homophone syllables, if the polyphone syllables are one, extracting n confidence rates of n words corresponding to the polyphone syllables and m confidence rates of m words, obtaining x words corresponding to the first x confidence rates in the n confidence rates and x 'words corresponding to the first x confidence rates in the m confidence rates, if the x words and the x' words have the same first words, determining that the first words are correct words of the polyphone syllables, and combining the correct words and the homophone syllables to obtain the correct text of the voice data.
8. The system of claim 7,
the first speech recognition algorithm is an LSTM speech recognition algorithm, and the second speech recognition algorithm is an RNN speech recognition algorithm.
9. A terminal device, characterized in that the terminal device comprises: a processor, a memory, a communication interface, and one or more programs stored in the memory and configured to be executed by the processor, the programs comprising instructions for performing the steps of the method of any of claims 1-6.
10. A computer-readable storage medium, characterized in that a computer program for electronic data exchange is stored, wherein the computer program causes a computer to perform the method according to any one of claims 1-6.
CN202110639813.5A 2021-06-08 2021-06-08 Voice recognition processing method and system Pending CN113593566A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110639813.5A CN113593566A (en) 2021-06-08 2021-06-08 Voice recognition processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110639813.5A CN113593566A (en) 2021-06-08 2021-06-08 Voice recognition processing method and system

Publications (1)

Publication Number Publication Date
CN113593566A true CN113593566A (en) 2021-11-02

Family

ID=78243558

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110639813.5A Pending CN113593566A (en) 2021-06-08 2021-06-08 Voice recognition processing method and system

Country Status (1)

Country Link
CN (1) CN113593566A (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190042556A1 (en) * 2017-08-01 2019-02-07 International Business Machines Corporation Dynamic Homophone/Synonym Identification and Replacement for Natural Language Processing
CN109545203A (en) * 2018-12-14 2019-03-29 深圳壹账通智能科技有限公司 Audio recognition method, device, equipment and storage medium
CN110942763A (en) * 2018-09-20 2020-03-31 阿里巴巴集团控股有限公司 Voice recognition method and device
CN111125310A (en) * 2019-12-24 2020-05-08 北京百度网讯科技有限公司 Voice query method and device
CN111145756A (en) * 2019-12-26 2020-05-12 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition
CN111142681A (en) * 2018-11-06 2020-05-12 北京嘀嘀无限科技发展有限公司 Method, system, device and storage medium for determining pinyin of Chinese characters
CN111798834A (en) * 2020-07-03 2020-10-20 北京字节跳动网络技术有限公司 Method and device for identifying polyphone, readable medium and electronic equipment
CN112329926A (en) * 2020-11-30 2021-02-05 珠海采筑电子商务有限公司 Quality improvement method and system for intelligent robot
CN112435671A (en) * 2020-11-11 2021-03-02 深圳市小顺智控科技有限公司 Intelligent voice control method and system for accurately recognizing Chinese

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190042556A1 (en) * 2017-08-01 2019-02-07 International Business Machines Corporation Dynamic Homophone/Synonym Identification and Replacement for Natural Language Processing
CN110942763A (en) * 2018-09-20 2020-03-31 阿里巴巴集团控股有限公司 Voice recognition method and device
CN111142681A (en) * 2018-11-06 2020-05-12 北京嘀嘀无限科技发展有限公司 Method, system, device and storage medium for determining pinyin of Chinese characters
CN109545203A (en) * 2018-12-14 2019-03-29 深圳壹账通智能科技有限公司 Audio recognition method, device, equipment and storage medium
CN111125310A (en) * 2019-12-24 2020-05-08 北京百度网讯科技有限公司 Voice query method and device
CN111145756A (en) * 2019-12-26 2020-05-12 北京搜狗科技发展有限公司 Voice recognition method and device for voice recognition
CN111798834A (en) * 2020-07-03 2020-10-20 北京字节跳动网络技术有限公司 Method and device for identifying polyphone, readable medium and electronic equipment
CN112435671A (en) * 2020-11-11 2021-03-02 深圳市小顺智控科技有限公司 Intelligent voice control method and system for accurately recognizing Chinese
CN112329926A (en) * 2020-11-30 2021-02-05 珠海采筑电子商务有限公司 Quality improvement method and system for intelligent robot

Similar Documents

Publication Publication Date Title
JP5996783B2 (en) Method and terminal for updating voiceprint feature model
CN105719659A (en) Recording file separation method and device based on voiceprint identification
CN112489621B (en) Speech synthesis method, device, readable medium and electronic equipment
CN104834847A (en) Identity verification method and device
CN104462058B (en) Character string identification method and device
CN107221330A (en) Punctuate adding method and device, the device added for punctuate
CN103366743A (en) Voice-command operation method and device
CN110263131B (en) Reply information generation method, device and storage medium
CN112334978A (en) Electronic device supporting personalized device connection and method thereof
CN113270104B (en) Artificial intelligence processing method and system for voice
CN110931000B (en) Method and device for speech recognition
CN112017669B (en) Voice countercheck sample detection method and device, terminal equipment and storage medium
CN107483069A (en) Voice help-asking method, apparatus, equipment and the storage medium of Wearable
CN112329926A (en) Quality improvement method and system for intelligent robot
CN107371144B (en) Method and device for intelligently sending information
JP2024037831A (en) Voice terminal voice verification and restriction method
CN112528266A (en) Visiting registration method and related device
CN113593566A (en) Voice recognition processing method and system
CN115240638A (en) Audio generation method and device, storage medium and chip
CN114708849A (en) Voice processing method and device, computer equipment and computer readable storage medium
CN111899747B (en) Method and apparatus for synthesizing audio
CN115116458A (en) Voice data conversion method and device, computer equipment and storage medium
CN115019806A (en) Voiceprint recognition method and device
KR20220138669A (en) Electronic device and method for providing personalized audio information
CN103294370A (en) Method and equipment for triggering keystroke operation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination