CN111508491A

CN111508491A - Intelligent voice interaction equipment based on deep learning

Info

Publication number: CN111508491A
Application number: CN202010307735.4A
Authority: CN
Inventors: 罗东华; 鲁娜; 董善志
Original assignee: Shandong Media Vocational College
Current assignee: Shandong Media Vocational College
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2020-08-07

Abstract

The invention discloses intelligent voice interaction equipment based on deep learning, which has two states of an awakening mode and a non-awakening mode; it includes: the voice acquisition module is used for acquiring voice information in real time; the voice preprocessing module is connected with the voice acquisition module and is used for filtering noise of the voice information to obtain target voice; the voice recognition module is used for recognizing target voice in an awakening mode to obtain target content; the retrieval module is respectively connected with the voice recognition module and the storage module pre-stored with the response sentences and is used for acquiring response contents according to target contents in an awakening mode; the output module is connected with the retrieval module and used for acquiring the response content in the wake-up mode and outputting the response content; and the intelligent voice interaction equipment enters a non-awakening mode when no content is output and no target voice exists in the set time. The invention can pick up sound in real time, and can accurately obtain external effective voice in the voice output process.

Description

Intelligent voice interaction equipment based on deep learning

Technical Field

The invention relates to the technical field of voice interaction, in particular to intelligent voice interaction equipment based on deep learning.

Background

With the continuous development of artificial intelligence, the speech recognition technology has made remarkable progress and started to move from the laboratory to the market. Speech recognition technology has begun to enter various fields of industry, home appliances, communications, automotive electronics, medical care, home services, consumer electronics, and the like.

In the prior art, the smart sound box continuously enters the visual field of people, such as a siri system, a millet smart sound box, a Nano smart sound box, various children voice interaction toys and the like. Although these systems can complete the basic interactive process, they often have the drawback that before the speech is input, the system needs to be manually or by a specific word to a specific state for acquiring the speech, so that the speech can be successfully recognized. This causes inconvenience to the interaction process.

Disclosure of Invention

An object of the present invention is to provide an intelligent voice interaction device based on deep learning, so as to solve the deficiencies in the prior art, and the device can pick up sound in real time, so that external effective voice can be accurately obtained in the voice output process. The voice interaction process is more intelligent.

The invention provides intelligent voice interaction equipment based on deep learning, wherein,

the intelligent voice interaction equipment has two states of an awakening mode and a non-awakening mode; this intelligence voice interaction equipment includes:

the voice acquisition module is used for acquiring voice information in real time;

the voice preprocessing module is connected with the voice acquisition module and used for acquiring the voice information and filtering noise of the voice information to obtain target voice; judging whether the target voice is a set awakening word or not in a non-awakening mode, if so, entering the awakening mode, and if not, keeping the non-awakening mode;

the voice recognition module is used for recognizing the target voice in an awakening mode to obtain target content;

the retrieval module is respectively connected with the voice recognition module and a storage module pre-stored with a response statement, and is used for acquiring response content from the storage module according to the target content or from a network according to the target content in an awakening mode;

the output module is connected with the retrieval module and used for acquiring the response content in an awakening mode and outputting the response content;

and when no content is output and no sound information is acquired within the set time, the intelligent voice interaction equipment enters a non-awakening mode.

The intelligent voice interaction device based on deep learning as described above, optionally, further comprising a mode control module, where the mode control module is electrically connected to the voice preprocessing module, the voice recognition module, the retrieval module, and the output module respectively;

the mode control module is used for acquiring mode information and respectively sending the current mode information to the voice preprocessing module, the voice recognition module, the retrieval module and the output module;

in a non-awakening mode, the mode control module generates an awakening state identifier according to a judgment result of the voice preprocessing module on whether the target voice is a set awakening word or not and under the condition that the target voice is the set awakening word, and outputs the awakening state identifier to the voice preprocessing module, the voice recognition module, the retrieval module and the output module respectively;

in the wake-up mode, the mode control module acquires a time node when the response content is output, and monitors whether the voice preprocessing module acquires the target content in real time; if the target content is not acquired within the set time, generating a non-awakening state identifier, and respectively outputting the non-awakening state identifier to the voice preprocessing module, the voice recognition module, the retrieval module and the output module.

The intelligent voice interaction device based on deep learning as described above, wherein optionally, the intelligent voice interaction device further comprises a voiceprint processing module and an identity marking module;

the voiceprint processing module is electrically connected with the voice preprocessing module and the identity marking module respectively; the identity marking module is connected with the retrieval module; the voiceprint processing module is used for acquiring voiceprint information of a target voice with the same content as a preset awakening word when the target voice is acquired, searching whether an identity file corresponding to the voiceprint information exists in the identity marking module or not, if so, establishing association between the identity file and the retrieval module so as to enable output content to be matched with the identity file, meanwhile, storing content information related to identity preference acquired in the interaction process into the identity file corresponding to the voiceprint, and if not, generating the identity file corresponding to the voiceprint information in the identity marking module.

The intelligent voice interaction device based on deep learning as described above, optionally, the voice obtaining module marks the voice information as first voice information and sends the first voice information to the voice preprocessing module after a pause in the voice information reaches a set time when the voice information is obtained, and obtains the first target voice according to the first voice information; the voice acquisition module acquires subsequent voice information intermittently and records the subsequent voice information as second voice information;

the voice recognition module recognizes the first target voice, obtains first target content, and judges whether the first target content is complete information, if the first target content information is complete, and if the first target content information is incomplete, the first target voice and the second target voice are combined into an integral target voice, and the integral target voice is recognized, so that the integral target content is obtained.

The intelligent voice interaction device based on deep learning as described above, wherein optionally, the voice preprocessing module is further electrically connected to the output module;

when the output module outputs the voice:

the voice preprocessing module acquires voice information from the voice acquisition module and acquires the voice to be filtered from the output module, and the voice preprocessing module filters the voice to be filtered from the voice information to obtain third voice information and identifies the third voice information; and judging whether the third sound information is valid sound information, if so, controlling the output module to stop outputting the sound, controlling the output module to output response content obtained on the basis of the third sound information, and if not, continuously outputting the sound being output.

The intelligent voice interaction device based on deep learning as described above, wherein optionally, the method for obtaining the response content based on the third sound information is as follows:

the voice recognition module recognizes the third sound information to obtain interruption target content;

and the retrieval module acquires response content from the storage module according to the interruption target content or from a network according to the target content.

The intelligent voice interaction device based on deep learning as described above, wherein optionally, a history association module is further included;

the history correlation module is respectively electrically connected with the retrieval module and the voice recognition module;

the history association module is used for emptying when entering an awakening mode; acquiring the response content retrieved by the retrieval module in the wake-up mode, and recording the response content in the history association module;

and the retrieval module acquires the historical information related to the target content from the historical association module and acquires the response content according to the historical information and the target content.

The intelligent voice interaction device based on deep learning as described above, wherein optionally, the history association module is further configured to delete the corresponding response content from the history association module after the output of the response content is interrupted.

The intelligent voice interaction device based on deep learning as described above, wherein optionally the voice acquisition module comprises a microphone and the output module comprises a speaker.

The intelligent voice interaction device based on deep learning as described above, wherein optionally, the voice recognition module recognizes the voice based on a deep neural network.

Compared with the prior art, the voice acquisition module acquires the voice information in real time, so that the interactive equipment can recognize corresponding effective voice information no matter in an awakening mode or a non-awakening mode. In the wake-up mode, in the interaction process, a user does not need to add a specific wake-up word before each word, so that the interaction process can be more free and random, and the intelligence of the interaction equipment is improved. In addition, because the voice acquisition module acquires the voice information in real time, even if the interactive equipment is in the voice output process, the effective voice information can be accurately identified, so that the interactive equipment is allowed to be interrupted in the voice output process, and the communication is more efficient and smooth.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is an overall block diagram of the present invention;

FIG. 2 is a flow chart of the steps in use of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

Referring to fig. 1, the present invention provides an intelligent voice interaction device based on deep learning,

the intelligent voice interaction equipment has two states of an awakening mode and a non-awakening mode; specifically, the wake-up mode referred to in the present invention means that the intelligent voice interaction device can recognize various voice information, and in the non-wake-up mode, only specific wake-up words, such as "small a", "small B", "power on", "lovely", and the like, can be recognized, and these wake-up words can be specifically set as required, and the intelligent voice interaction device in the non-wake-up mode can recognize voice only after recognizing the specific wake-up words. And no matter in the awakening mode or the non-awakening mode, the specific awakening words can be recognized, so that after the voice communication is ensured to be in the awakening state, the voice communication can be directly carried out without adding the specific awakening words before each communication by a user.

Specifically, the intelligent voice interaction device comprises:

the voice acquisition module is used for acquiring voice information in real time; in a specific implementation, the voice acquiring module includes a plurality of microphones, and specifically, the number of the microphones is multiple and is distributed on the periphery of the voice interaction device. By arranging the voice acquisition module, the voice information can be acquired at any time.

The voice preprocessing module is connected with the voice acquisition module and used for acquiring the voice information and filtering noise of the voice information to obtain target voice; and judging whether the target voice is a set awakening word or not in the non-awakening mode, if so, entering the awakening mode, and if not, keeping the non-awakening mode. That is, when the target speech is in the wake-up mode, the target speech is directly output to the speech recognition module without judging whether the target speech is the set wake-up word. And enabling the equipment to enter the awakening mode by judging whether the target voice is the set awakening word in the non-awakening mode. This means that in the wake-up mode, voice communication can be performed without the need for a wake-up word, so that voice communication can be made more convenient and efficient, and the communication process can be made more in line with human communication habits. That is, the wake-up word does not need to be added before each sentence in the communication process.

And the voice recognition module is used for recognizing the target voice in an awakening mode to obtain target content. That is, the voice recognition module operates only in the awake mode and does not operate in the non-awake mode. Specifically, the speech recognition module recognizes speech based on a deep learning network, and the speech recognition module may be disposed on the intelligent speech interaction device or on a third-party server. The voice recognition module is electrically connected with the voice processing module. Of course, both the voice preprocessing module and the voice recognition module may be disposed on a third-party server. But whether it is provided on the interactive device or on a third party server, as long as it fulfills its role.

And the retrieval module is respectively connected with the voice recognition module and the storage module in which response sentences are pre-stored, and is used for acquiring response contents from the storage module according to the target contents or from a network according to the target contents in an awakening mode. Therefore, by arranging the retrieval module, massive information can be quickly acquired from the target network or the storage module, and the selected matched content is used as the response content.

And the output module is connected with the retrieval module and used for acquiring the response content in an awakening mode and outputting the response content. Specifically, the output module further includes a voice conversion module, where the voice conversion module is configured to convert the response content into voice information for output, and of course, if the response content is video information, the video information may also be directly output as needed. For example, in the case where a display is present, the video information may be directly output. In a specific implementation, the output module includes a speaker, and specifically, a display may be added as needed.

And when no content is output and no sound information is acquired within the set time, the intelligent voice interaction equipment enters a non-awakening mode. Specifically, of course, the intelligent voice interaction device may also enter the non-awake mode after obtaining a corresponding instruction to switch to the non-awake mode.

In specific use, please refer to fig. 2, when the interactive device is in the non-wake-up mode, the voice information is first acquired, and the voice information is preprocessed. And judging whether the user needs to enter the awakening mode, if so, entering the awakening mode, and if not, keeping the non-awakening mode.

After the voice information is acquired after the voice information enters the awakening mode, preprocessing the voice information, identifying the voice to obtain target content, acquiring response content according to the target content, and outputting the response content. If the sound information is obtained again, the operation process is repeated. And the sound information is obtained again, it is not limited to be after the response content is completely output.

In the wake-up mode, in the interaction process, a user does not need to add a specific wake-up word before each sentence, so that the interaction process can be more free and random, and the intelligence of the interaction equipment is improved.

As a better implementation manner, the voice recognition system further comprises a mode control module, and the mode control module is electrically connected with the voice preprocessing module, the voice recognition module, the retrieval module and the output module respectively. Specifically, when the speech recognition module is a third-party server and is in a non-awakening state, the speech recognition module is controlled not to send data to the speech recognition module through a network.

Specifically, the mode control module is configured to obtain mode information, and send the current mode information to the voice preprocessing module, the voice recognition module, the retrieval module, and the output module, respectively. Therefore, the mode control module can ensure that all modules can obtain uniform mode information, and the whole interactive equipment is ensured to be in the same mode at the same time.

In a non-wake-up mode, the mode control module generates a wake-up state identifier according to a judgment result of the voice preprocessing module on whether the target voice is a set wake-up word or not, and under the condition that the target voice is the set wake-up word, and outputs the wake-up state identifier to the voice preprocessing module, the voice recognition module, the retrieval module and the output module respectively. And after each module acquires the awakening state identification, each module enters an awakening state until a non-awakening state identification is acquired.

In the wake-up mode, the mode control module acquires a time node when the response content is output, and monitors whether the voice preprocessing module acquires the target content in real time; if the target content is not acquired within the set time, generating a non-awakening state identifier, and respectively outputting the non-awakening state identifier to the voice preprocessing module, the voice recognition module, the retrieval module and the output module. And after each module receives the non-awakening state identification, each module enters the non-awakening state until the awakening state identification is obtained. Certainly, in specific implementation, in order to facilitate voice control, the mode control module generates a non-awake state identifier in an awake state after acquiring an instruction to enter the non-awake state, and outputs the non-awake state identifier to the voice preprocessing module, the voice recognition module, the retrieval module, and the output module, respectively.

In the existing voice interaction equipment, a more appropriate interaction process cannot be performed through identification of an interactor, and particularly in the voice interaction equipment used in a family, the requirements for different people may be different, and it is difficult to make a satisfactory interaction experience for all people in the family. In order to solve the problem, the intelligent voice interaction system disclosed by the invention further comprises a voiceprint processing module and an identity marking module. Specifically, the voiceprint processing module is electrically connected with the voice preprocessing module and the identity marking module respectively; the identity marking module is connected with the retrieval module; the voiceprint processing module is used for acquiring voiceprint information of a target voice with the same content as a preset awakening word when the target voice is acquired, searching whether an identity file corresponding to the voiceprint information exists in the identity marking module or not, if so, establishing association between the identity file and the retrieval module so as to enable output content to be matched with the identity file, meanwhile, storing content information related to identity preference acquired in the interaction process into the identity file corresponding to the voiceprint, and if not, generating the identity file corresponding to the voiceprint information in the identity marking module. Specifically, the interaction is performed once from entering the awake mode to entering the non-awake mode. By arranging the identity marking module, the favorite information of different users and commonly used instructions can be added into the corresponding identity marking module. In this way, in subsequent use, more matching content can be selected according to the records in the identity file, so that the whole interaction process is more efficient and more satisfactory to the user. For example, by recording the category information of a song that a user plays frequently, when receiving an instruction for playing a song from the user next time, the user selects a song of a corresponding category and plays the song. This enables the interaction process to be more intelligent.

As a better implementation manner, when acquiring the voice information, after the pause in the voice information reaches the set time, the voice acquisition module marks the voice information as the first voice information and sends the first voice information to the voice preprocessing module, and acquires the first target voice according to the first voice information; the voice obtaining module obtains the subsequent sound information intermittently and records the subsequent sound information as second sound information. In specific implementations, the set time may be 0.5 seconds, 1 second, or 1.2 seconds. The voice acquisition module acquires the voice information in real time, so that the voice information is interrupted, which becomes an important problem. The voice information is split by using pause, so that the meaning of an interactive person can be better understood, but in the identification process, as the interactive person possibly pauses in the speaking process, the possibility of misjudgment exists, and the problem is solved. The voice recognition module recognizes the first target voice, obtains first target content, and judges whether the first target content is complete information, if the first target content information is complete, and if the first target content information is incomplete, the first target voice and the second target voice are combined into an integral target voice, and the integral target voice is recognized, so that the integral target content is obtained. Therefore, on one hand, the time of voice recognition can be shortened, and on the other hand, the problem that the interactive person cannot recognize under the condition of long speaking pause time can be solved. Specifically, whether the first target content information is complete or not may be determined according to whether a complete meaning of a sentence can be recognized or not. And combining the first target voice and the second target voice when the meaning of the first target content information can not be recognized.

Because the voice acquiring module acquires the voice information in real time, that is, when the output module outputs the voice, the voice information acquired by the voice acquiring module includes the voice output by the output module. In order to eliminate the influence of the output sound of the interaction equipment, the voice preprocessing module is also electrically connected with the output module; when the output module outputs the voice: the voice preprocessing module acquires voice information from the voice acquisition module and acquires voice to be filtered from the output module, the voice preprocessing module filters the voice to be filtered from the voice information to obtain third voice information, and the third voice information is identified through the voice identification module; and judging whether the third sound information is valid sound information, if so, controlling the output module to stop outputting the sound, controlling the output module to output response content obtained on the basis of the third sound information, and if not, continuously outputting the sound being output. And judging whether the third voice information is valid voice information or not according to the recognition result of the voice recognition module, and if the third voice recognition module cannot recognize the corresponding meaning, determining that the third voice information is invalid voice information.

Further, the method for obtaining the response content based on the third sound information comprises:

and the voice recognition module recognizes the third sound information to obtain the interruption target content. And the retrieval module acquires response content from the storage module according to the interruption target content or from a network according to the target content.

As a preferred implementation manner, the interactive device further includes a history association module. Specifically, the history association module is electrically connected to the retrieval module and the voice recognition module, respectively.

The history association module is used for emptying when entering an awakening mode. Namely, the previous history interaction information in the history association module is cleared, and the record is started from the current interaction. And acquiring the response content retrieved by the retrieval module in the wake-up mode, and recording the response content in the history association module. And the retrieval module acquires the historical information related to the target content from the historical association module and acquires the response content according to the historical information and the target content. In this way, the contents of the preceding and following dialogs can be associated in one interaction (including a plurality of dialogs), so that the interaction process can have relevance. Compared with the prior art, the interactive equipment can not identify the association phenomenon between the front sentence and the back sentence, and the method and the device can realize the association between the front dialogue and the back dialogue and have higher intelligence. Further, the history association module is further configured to delete the corresponding response content from the history association module after the output of the response content is interrupted. Therefore, the interrupted response content is usually invalid content, so that the storage capacity of the history association module can be reduced, and the hardware requirement is favorably reduced.

In specific implementation, the voice recognition module recognizes the voice based on the deep neural network.

Although some specific embodiments of the present invention have been described in detail by way of examples, it should be understood by those skilled in the art that the above examples are for illustrative purposes only and are not intended to limit the scope of the present invention. It will be appreciated by those skilled in the art that modifications may be made to the above embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. An intelligent voice interaction device based on deep learning is characterized in that,

the intelligent voice interaction equipment has two states of an awakening mode and a non-awakening mode; the intelligent voice interaction device comprises:

2. The intelligent voice interaction device based on deep learning of claim 1, further comprising a mode control module electrically connected to the voice preprocessing module, the voice recognition module, the retrieval module and the output module, respectively;

3. The intelligent voice interaction device based on deep learning of claim 1, further comprising a voiceprint processing module and an identity tagging module;

the voiceprint processing module is electrically connected with the voice preprocessing module and the identity marking module respectively; the identity marking module is connected with the retrieval module;

the voiceprint processing module is used for acquiring voiceprint information of a target voice with the same content as a preset awakening word when the target voice is acquired, searching whether an identity file corresponding to the voiceprint information exists in the identity marking module or not, if so, establishing association between the identity file and the retrieval module so as to enable output content to be matched with the identity file, meanwhile, storing content information related to identity preference acquired in the interaction process into the identity file corresponding to the voiceprint, and if not, generating the identity file corresponding to the voiceprint information in the identity marking module.

4. The intelligent voice interaction device based on deep learning of claim 1, wherein the voice obtaining module marks the voice information as first voice information and sends the first voice information to the voice preprocessing module after a pause in the voice information reaches a set time when the voice information is obtained, and obtains the first target voice according to the first voice information; the voice acquisition module acquires subsequent voice information intermittently and records the subsequent voice information as second voice information;

5. The intelligent deep learning based voice interaction device of claim 1, wherein the voice preprocessing module is further electrically connected with the output module;

when the output module outputs the voice:

6. The intelligent voice interaction device based on deep learning of claim 5, wherein the method for obtaining response content based on the third sound information is:

7. The intelligent voice interaction device based on deep learning of claim 1, further comprising a history association module;

8. The intelligent voice interaction device based on deep learning of claim 7, wherein the history association module is further configured to delete the corresponding response content from the history association module after the output of the response content is interrupted.

9. The intelligent deep learning based voice interaction device of any one of claims 1-8, wherein the voice acquisition module comprises a microphone and the output module comprises a speaker.

10. The intelligent voice interaction device based on deep learning of any one of claims 1-8, wherein the voice recognition module recognizes voice based on a deep neural network.