CN111128127A

CN111128127A - Voice recognition processing method and device

Info

Publication number: CN111128127A
Application number: CN201811197430.1A
Authority: CN
Inventors: 张新; 王慧君; 秦萍; 万会; 毛跃辉; 廖湖锋
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2018-10-15
Filing date: 2018-10-15
Publication date: 2020-05-08

Abstract

The invention provides a voice recognition processing method and a voice recognition processing device, wherein the method comprises the following steps: establishing wireless connection with an intelligent household appliance, and acquiring a voice signal input by a user through the intelligent household appliance; performing voice feature extraction on the voice signal; determining a habit voice feature model matched with the voice feature from a pre-stored habit voice feature model database; and performing semantic recognition on the voice features according to the habit voice feature model. According to the invention, the problem of poor voice recognition effect of the intelligent household appliance on the user when the user is greatly different from the training voice library in the related technology can be solved, and by establishing the habit voice feature model for different users, the recognition precision of the intelligent household appliance on the user voice is improved, and the effect of improving the user experience is achieved.

Description

Voice recognition processing method and device

Technical Field

The present invention relates to the field of communications, and in particular, to a method and an apparatus for speech recognition processing.

Background

The voice interaction is the main mode of man-machine interaction at present, and pain points are many, and one of the pain points is the pronunciation standard. Then, due to the differences of dialects, physiology, psychology, life style and the like, each person has own pronunciation habits, the speaking speed is different, the pause time between speeches is different, and the habits all influence the speech frequency spectrum characteristics of the speeches at that time, so that the performance of the recognition system is reduced.

It is not every time a painful thing to recognize for a user who uses frequently and pronounces a mistake for a long time. The existing solution is based on a big data training speech library and utilizes artificial intelligence to conduct autonomous learning, however, if a speaker is greatly different from the training speech library, the recognition performance is seriously deteriorated, and the artificial intelligence is not as smart as imagination at present, and has a lot of problems in the aspect of self-adaptation of complex environment. At present, the intelligent household appliance realizes voice interaction with a user, and the voice interaction effect between the intelligent household appliance and the user is poor due to poor voice recognition precision of the user.

Aiming at the problem that the voice recognition effect of the intelligent household appliance on the user is poor when the user and the training voice library have large difference in the related technology, no solution is provided.

Disclosure of Invention

The embodiment of the invention provides a voice recognition processing method and a voice recognition processing device, which are used for at least solving the problem that in the related technology, when a user and a training voice library have a large difference, the voice recognition effect of an intelligent household appliance on the user is poor.

According to an embodiment of the present invention, there is provided a speech recognition processing method including:

establishing wireless connection with an intelligent household appliance, and acquiring a voice signal input by a user through the intelligent household appliance;

performing voice feature extraction on the voice signal;

determining a habit voice feature model matched with the voice feature from a pre-stored habit voice feature model database;

and performing semantic recognition on the voice features according to the habit voice feature model.

Optionally, before the voice signal input by the user is collected by the smart appliance, the method further includes:

providing a training list for a user through a display interface of the mobile terminal;

acquiring and training voice input by a user according to the training list to obtain a habit voice feature model of the user;

and storing the habit speech feature model into the habit speech feature model database.

Optionally, the determining the habit speech feature model matching the speech feature from the pre-stored habit speech feature model database includes:

comparing the voice characteristics with the habitual voice characteristic modules in the habitual voice characteristic model database in sequence;

and determining the habitual voice feature model with the highest similarity as the habitual voice feature model matched with the voice features.

Optionally, after performing semantic recognition on the speech feature according to the habitual speech feature model, the method further includes:

converting the identified semantics corresponding to the voice features into control instructions;

and sending the control instruction to the intelligent household appliance for the intelligent household appliance to execute the operation corresponding to the control instruction.

Optionally, the habitual voice feature model includes an acoustic model and a voice model, wherein the voice model models full-pronunciation variations through a multi-pronunciation dictionary, and the acoustic model models partial-pronunciation variations through a context-free partial variation phonon model.

According to another embodiment of the present invention, there is also provided a speech recognition processing apparatus including:

the system comprises an acquisition module, a processing module and a control module, wherein the acquisition module is used for establishing wireless connection with an intelligent household appliance and acquiring a voice signal input by a user through the intelligent household appliance;

the feature extraction module is used for extracting voice features of the voice signals;

the matching module is used for determining a habit voice feature model matched with the voice feature from a prestored habit voice feature model database;

and the recognition module is used for carrying out semantic recognition on the voice features according to the habit voice feature model.

Optionally, the apparatus further comprises:

the system comprises a providing module, a training module and a training module, wherein the providing module is used for providing a training list for a user through a display interface of the mobile terminal;

the training module is used for acquiring and training the voice input by the user according to the training list to obtain a habit voice feature model of the user;

Optionally, the matching module comprises:

the comparison unit is used for comparing the voice characteristics with the habitual voice characteristic modules in the habitual voice characteristic model database in sequence;

and the determining unit is used for determining the habitual voice feature model with the highest similarity as the habitual voice feature model matched with the voice features.

Optionally, the apparatus further comprises:

the conversion module is used for converting the identified semantics corresponding to the voice features into control instructions;

and the sending module is used for sending the control instruction to the intelligent household appliance for the intelligent household appliance to execute the operation corresponding to the control instruction.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, as the wireless connection is established with the intelligent household appliance, the voice signal input by the user is collected through the intelligent household appliance; performing voice feature extraction on the voice signal; determining a habit voice feature model matched with the voice feature from a pre-stored habit voice feature model database; according to the habit voice feature model, the voice features are subjected to semantic recognition, so that the problem that in the related technology, when a user and a training voice library have a large difference, the voice recognition effect of the intelligent household appliance on the user is poor can be solved, the habit voice feature model is established for different users, the recognition precision of the intelligent household appliance on the voice of the user is improved, and the effect of improving the user experience is achieved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a speech recognition processing method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of a speech recognition processing method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an intelligent home appliance voice recognition system according to an embodiment of the present invention;

fig. 4 is a flowchart of voice recognition of an intelligent appliance according to an embodiment of the present invention;

FIG. 5 is a block diagram of a Mandarin Chinese phonetic structure according to an embodiment of the present invention;

fig. 6 is a block diagram of a speech recognition processing apparatus according to an embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

The method provided by the first embodiment of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a mobile terminal as an example, fig. 1 is a hardware structure block diagram of a mobile terminal of a speech recognition processing method according to an embodiment of the present invention, as shown in fig. 1, a mobile terminal 10 may include three (only one is shown in fig. 1) processors 102 (the processors 102 may include but are not limited to a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for a communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the message receiving method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

Based on the above-mentioned mobile terminal, the present embodiment provides a speech recognition processing method, and fig. 2 is a flowchart of a speech recognition processing method according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, establishing wireless connection with an intelligent household appliance, and acquiring a voice signal input by a user through the intelligent household appliance;

step S204, extracting voice characteristics of the voice signal;

step S206, determining a habit voice feature model matched with the voice feature from a pre-stored habit voice feature model database;

and S208, performing semantic recognition on the voice features according to the habit voice feature model.

Through the steps, as the wireless connection is established with the intelligent household appliance, the voice signal input by the user is collected through the intelligent household appliance; performing voice feature extraction on the voice signal; determining a habit voice feature model matched with the voice feature from a pre-stored habit voice feature model database; according to the habit voice feature model, the voice features are subjected to semantic recognition, so that the problem that in the related technology, when a user and a training voice library have a large difference, the voice recognition effect of the intelligent household appliance on the user is poor can be solved, the habit voice feature model is established for different users, the recognition precision of the intelligent household appliance on the voice of the user is improved, and the effect of improving the user experience is achieved.

In the embodiment of the invention, an interface of a training model is displayed to a user through a mobile terminal, and a training list is provided to the user through a display interface of the mobile terminal before a voice signal input by the user is acquired through the intelligent household appliance; acquiring and training voice input by a user according to the training list to obtain a habit voice feature model of the user; and storing the habit speech feature model into the habit speech feature model database.

Optionally, the determining, from a pre-stored habitual speech feature model database, a habitual speech feature model matching the speech feature may specifically include: comparing the voice characteristics with the habitual voice characteristic modules in the habitual voice characteristic model database in sequence; and determining the habitual voice feature model with the highest similarity as the habitual voice feature model matched with the voice features.

In the embodiment of the invention, after the mobile terminal identifies the voice collected by the intelligent household appliance, namely the semantics of the voice collected by the intelligent household appliance is identified, the reservation is converted into the control instruction and is sent to the intelligent household appliance, and the intelligent household appliance executes corresponding operation according to the instruction. Specifically, after semantic recognition is carried out on the voice features according to the habit voice feature model, the recognized semantics corresponding to the voice features are converted into control instructions; and sending the control instruction to the intelligent household appliance for the intelligent household appliance to execute the operation corresponding to the control instruction, so that the recognition precision of the intelligent household appliance to the voice is improved.

Optionally, the habitual voice feature model includes an acoustic model and a voice model, wherein the voice model models full-pronunciation variations through a multi-pronunciation dictionary, and the acoustic model models partial-pronunciation variations through a context-free partial variation phonon model. The process of training the habit speech feature model is as follows: acquiring a training result, and performing signal processing and feature extraction on the voice, wherein the signal processing and feature extraction comprises the following steps of: silence, blasting, noise, air supply, transition, initiation, core and beginning and end, obtaining an acoustic model and a voice model of the user, establishing a user habit voice characteristic model, and preparing for better service for different types of users.

The user performs sound production test training (such as plausible dialogue, reading aloud and the like) on the APP to obtain a voice library based on pronunciation habit models of the user, such as flat tongue and warped tongue, training pause habits, accent mute, tone and the like. When the user performs voice interaction, the server performs speaker recognition and semantic correction according to the pronunciation habit of the user and the voice library, so that the accuracy of the control instruction is ensured.

The method has the advantages that false recognition caused by the fact that pronunciation habits of users are not standard is achieved, the machine self-learning effect is poor, the cost is high, the time is long, and the problem that specific users are difficult to recognize due to noise in a multi-user scene is solved.

Fig. 3 is a schematic diagram of a voice recognition system for an intelligent home appliance according to an embodiment of the present invention, as shown in fig. 3, a mobile terminal establishes a connection with the intelligent home appliance through an APP to control the intelligent home appliance; the intelligent household appliance is provided with a voice recognition system and a communication module, communication is established with the mobile terminal through the communication module, collected voice is transmitted to the mobile terminal, and the mobile terminal stores a habit voice feature model obtained through training into a server for other equipment to use; or the mobile terminal transmits the collected voice characteristics of the user to a server, the server has a related algorithm, and the habitual voice characteristic model of the user is obtained by training the text.

Fig. 4 is a flowchart of voice recognition of an intelligent home appliance according to an embodiment of the present invention, as shown in fig. 4, including the following steps:

step S402, a user conducts pronunciation habit test training on the APP, and the user habit voice library is customized based on the test result.

The pronunciation organ and the difference are reflected in the waveform of the speaker voice in a complex form, so that the voice of each person has strong personal color, speaker recognition is carried out according to the voice, and the analyzed semantics can be corrected.

The pronunciation habit test training content comprises the following steps: fig. 5 is a frame diagram of a mandarin chinese phonetic structure according to an embodiment of the present invention, and as shown in fig. 5, a training list (which may be in the form of a machine-to-human conversation, reading aloud, practicing basic control commands, etc.) is generated by selecting a high-dimensional feature vector composed of several feature phones as training content according to the syllable structure frame details of mandarin chinese. The training needs to compare the similarity with the normal pronunciation through the pronunciation habit of the user to obtain the difference. The specific training process is to narrate or read the targeted subjects including silence, blasting, noise, air supply, transition, initiation, core and beginning and end by normal speaking sound. And performing habit wiping training of the user through the content read by the user.

For example: the situation is assumed as a dialogue, A is APP push, and B is the content needing to be read by the user; read around passwords, etc.

In step S404, the pronunciation habit test training method includes: the user speaks the corpus parameters (each word and each sentence) in the training list in sequence according to normal conditions, and establishes a speaker's sound spectrum library through feature vector extraction, and the speaker is identified according to a certain judgment rule.

And step S406, the server identifies the speaker based on the user pronunciation habit voice spectrum model, and if the speaker is judged to be the user, the current voice identification result of the user is corrected.

In the recognition stage, the feature vectors of the input voice are compared with the similarity of each template in the pronunciation habit voice library in sequence, and the highest similarity is used as a user recognition result (if the similarity threshold is not reached, no user exists in the voice, and no semantic correction is performed). And then carrying out semantic analysis to obtain a first semantic meaning, and correcting a semantic meaning result according to the pronunciation habit of the pronunciation habit voice library to obtain a second semantic meaning. And the server converts the second semantic meaning into a control command and sends the control command to the corresponding equipment for execution.

For example: the first semantic result is "play.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

Example 2

In this embodiment, a speech recognition processing apparatus is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 6 is a block diagram of a speech recognition processing apparatus according to an embodiment of the present invention, as shown in fig. 6, including:

the acquisition module 62 is used for establishing wireless connection with the intelligent household appliance and acquiring a voice signal input by a user through the intelligent household appliance;

a feature extraction module 64, configured to perform voice feature extraction on the voice signal;

a matching module 66, configured to determine a habit speech feature model matching the speech feature from a pre-stored habit speech feature model database;

and the recognition module 68 is used for performing semantic recognition on the voice features according to the habit voice feature model.

Optionally, the apparatus further comprises:

Optionally, the matching module comprises:

Optionally, the apparatus further comprises:

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Example 3

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s11, establishing wireless connection with the intelligent household appliance, and acquiring voice signals input by a user through the intelligent household appliance;

s12, extracting voice characteristics of the voice signal;

s13, determining a habit voice feature model matched with the voice feature from a prestored habit voice feature model database;

and S14, performing semantic recognition on the voice features according to the habit voice feature model.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Example 4

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s12, extracting voice characteristics of the voice signal;

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A speech recognition processing method, comprising:

performing voice feature extraction on the voice signal;

2. The method of claim 1, wherein prior to collecting the voice signal input by the user by the smart appliance, the method further comprises:

3. The method of claim 1, wherein determining a habitual speech feature model matching the speech feature from a pre-saved database of habitual speech feature models comprises:

4. The method of claim 1, wherein after semantically recognizing the speech feature according to the habitual speech feature model, the method further comprises:

5. The method according to any one of claims 1 to 4, wherein the habitual speech feature models comprise an acoustic model and a speech model, wherein the speech model models full-pronunciation variations through a multi-pronunciation dictionary, and wherein the acoustic model models partial-pronunciation variations through a context-free partial-variation phonon model.

6. A speech recognition processing apparatus, comprising:

7. The apparatus of claim 6, further comprising:

8. The apparatus of claim 6, wherein the matching module comprises:

9. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 5 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 5.