CN112562734B

CN112562734B - Voice interaction method and device based on voice detection

Info

Publication number: CN112562734B
Application number: CN202011342535.9A
Authority: CN
Inventors: 缪纯; 韩瑞; 吴鹏程
Original assignee: China Inspection Enlightenment Beijing Technology Co ltd
Current assignee: China Inspection Enlightenment Beijing Technology Co ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2021-08-27
Anticipated expiration: 2040-11-25
Also published as: CN112562734A

Abstract

The invention discloses a voice interaction method and a device based on voice detection.A voice to be detected is divided into a plurality of different types of audio contents according to the characteristic information of the audio contents, and then the audio contents are respectively fed back to a user, and the user confirms which audio content is the input information, so that the interference of the audio contents generated by other users or environmental noise can be eliminated, and the accuracy of subsequent voice interaction is improved; and the audio content confirmed by the user is identified and fed back to the user, and the user confirms whether the audio content completely expresses the real meaning of the user, so that the unsmooth interaction caused by identification errors can be avoided, and the accuracy of voice interaction and the experience of the user are further improved.

Description

Voice interaction method and device based on voice detection

Technical Field

The present application relates to the field of voice interaction technology, and in particular, to a voice interaction method and apparatus based on voice detection.

Background

With the development of communication technology and the popularization of intelligent terminals, various network communication tools become one of the main tools for public communication. The operation and transmission convenience of voice information become the main transmission information of various network communication tools. When various network communication tools are used, a process of converting the voice information into text is also involved, and the process is a voice recognition technology.

Speech recognition technology is a technology that enables a machine to convert speech information into corresponding text or commands through a recognition and understanding process. When the deep learning method is used for voice recognition, the voice information at the current moment needs to be recognized in time to determine the voice recognition result, which has higher requirements on the efficiency and accuracy of the voice recognition.

Disclosure of Invention

In order to solve the technical problems, the application provides a voice interaction method and a voice interaction device based on voice detection, wherein voice to be detected is divided into a plurality of different types of audio contents according to characteristic information of the audio contents, and then the audio contents are respectively fed back to a user, and the user determines which audio content is input information, so that interference of the audio contents generated by other users or environmental noise can be eliminated, and the accuracy of subsequent voice interaction is improved; and the audio content confirmed by the user is identified and fed back to the user, and the user confirms whether the audio content completely expresses the real meaning of the user, so that the unsmooth interaction caused by identification errors can be avoided, and the accuracy of voice interaction and the experience of the user are further improved.

According to an aspect of the present application, a voice interaction method based on voice detection is provided, including: acquiring a voice to be detected; the voice to be detected comprises various types of audio contents; splitting the voice to be detected into a plurality of different types of audio contents according to the characteristic information of different audio contents; respectively feeding back the audio contents of the plurality of different categories to the user; acquiring first confirmation information of the user; the first confirmation information is used for confirming the audio content corresponding to the input information of the user in the plurality of different types of audio content; identifying the audio content corresponding to the first confirmation information to obtain an identification content; feeding back the identification content to the user; acquiring second confirmation information of the user; the second confirmation information is used for confirming whether the identification content is the expression of the real meaning of the user; and when the second confirmation information is the expression of the real meaning of the user, determining the interactive information according to the identification content.

In one embodiment, the characteristic information includes a tone, a tone color, and a volume; the splitting the voice to be detected into a plurality of different types of audio contents according to the feature information of different audio contents comprises: and splitting the voice to be detected into a plurality of audio contents according to the tone, the tone and the volume of the voice to be detected.

In one embodiment, the separately feeding back the plurality of different categories of audio content to the user includes: according to a preset time length, dividing each audio content into audio segments with the time less than or equal to the preset time length; and feeding back at least one audio segment of each of the audio contents to a user, respectively.

In an embodiment, before the separately feeding back the plurality of different categories of audio content to the user, the voice interaction method further includes: acquiring a plurality of attribute tags of the user; the plurality of attribute tags characterize respective different dimensional features of the user.

In an embodiment, the obtaining the plurality of attribute tags of the user includes: and acquiring the face image of the user, and analyzing the face image to obtain a plurality of attribute labels of the user.

In one embodiment, the attribute tags include any one or combination of the following dimensional characteristics: region, age, sex, interest, mood.

In one embodiment, the separately feeding back the plurality of different categories of audio content to the user includes: and determining the feedback sequence of the audio contents of the different categories according to the attribute labels of the user.

In an embodiment, the determining the feedback order of the plurality of different categories of audio content according to the plurality of attribute tags of the user includes: calculating similarity between a plurality of attribute labels of the user and the feature information of the different categories of audio content; and feeding back the audio contents of the different categories in the order of the similarity from big to small.

In an embodiment, the calculating the similarity between the plurality of attribute tags of the user and the feature information of the different categories of audio content includes: respectively calculating single-dimensional similarity between each attribute label of the user and corresponding characteristic information of the audio content; and weighting the single-dimensional similarity to obtain the similarity between the attribute labels of the user and the feature information of the audio contents of different categories.

According to another aspect of the present application, there is provided a voice interaction device based on voice detection, including: the acquisition module is used for acquiring the voice to be detected; the voice to be detected comprises various types of audio contents; the splitting module is used for splitting the voice to be detected into a plurality of different types of audio contents according to the characteristic information of different audio contents; the first feedback module is used for respectively feeding back the audio contents of the different categories to the user; the first confirmation module is used for acquiring first confirmation information of the user; the first confirmation information is used for confirming the audio content corresponding to the input information of the user in the plurality of different types of audio content; the identification module is used for identifying the audio content corresponding to the first confirmation information to obtain identification content; the second feedback module is used for feeding back the identification content to the user; the second confirmation module is used for acquiring second confirmation information of the user; the second confirmation information is used for confirming whether the identification content is the expression of the real meaning of the user; and the interaction module is used for determining the interaction information according to the identification content when the second confirmation information is the expression of the real meaning of the user.

According to another aspect of the present application, there is provided a computer-readable storage medium storing a computer program for performing any of the above-described voice interaction methods.

According to another aspect of the present application, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; the processor is configured to execute any of the voice interaction methods described above.

According to the voice interaction method and the voice interaction device based on the voice detection, the voice to be detected is divided into a plurality of different types of audio contents according to the characteristic information of the audio contents, then the audio contents are respectively fed back to the user, and the user confirms which audio content is the input information, so that the interference of the audio contents generated by other users or environmental noise can be eliminated, and the accuracy of subsequent voice interaction is improved; and the audio content confirmed by the user is identified and fed back to the user, and the user confirms whether the audio content completely expresses the real meaning of the user, so that the unsmooth interaction caused by identification errors can be avoided, and the accuracy of voice interaction and the experience of the user are further improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a flowchart illustrating a voice interaction method based on voice detection according to an exemplary embodiment of the present application.

Fig. 2 is a flowchart illustrating a method for feeding back audio content according to an exemplary embodiment of the present application.

Fig. 3 is a flowchart illustrating a voice interaction method based on voice detection according to another exemplary embodiment of the present application.

Fig. 4 is a flowchart illustrating a method for feeding back audio content according to an exemplary embodiment of the present application.

Fig. 5 is a schematic structural diagram of a voice interaction apparatus based on voice detection according to an exemplary embodiment of the present application.

Fig. 6 is a schematic structural diagram of a voice interaction apparatus based on voice detection according to another exemplary embodiment of the present application.

Fig. 7 is a block diagram of an electronic device provided in an exemplary embodiment of the present application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Exemplary method

Fig. 1 is a flowchart illustrating a voice interaction method based on voice detection according to an exemplary embodiment of the present application. As shown in fig. 1, the voice interaction method includes the following steps:

step 110: acquiring a voice to be detected; wherein the speech to be detected comprises a plurality of categories of audio content.

With the continuous development of intelligent control, more and more devices can realize voice interaction control, such as voice interaction devices of various malls and banks, large-scale machine devices and the like. The devices receive the voice information of the user and convert the voice information into corresponding interactive instructions, so that interaction or specific instructions can be carried out, and the experience of the user is improved. The basis of the voice interaction is to accurately recognize voice, and only when the voice information of a user is accurately recognized, the voice interaction or the instruction execution can be better carried out. The embodiment of the application can acquire the voice of the current user as the voice to be recognized through a voice acquisition module, such as a voice recording module. It should be understood that the speech to be recognized in the embodiment of the present application may also be directly input by a user, for example, the speech to be recognized is directly input into a speech recognition system or apparatus through a device such as a usb disk.

Step 120: and splitting the voice to be detected into a plurality of different types of audio contents according to the characteristic information of different audio contents.

Since the voice interaction device in a shopping mall or a bank is usually in an open environment, and various sounds exist in the open environment, including sounds of different customers, sounds of staff, various notification broadcasts, noise, and the like, it is difficult to accurately know the audio content to be expressed by the user in the combination of the various sounds. Therefore, the voice to be detected is split into the audio contents of a plurality of different types according to the feature information of different audio contents, that is, the acquired voice to be detected is split into voices or other sounds (such as noise) of different users, so that the voice of the corresponding user can be identified and interacted, and the interaction accuracy and the interaction effect are improved. In one embodiment, the characteristic information of the audio content may include a tone, a tone color, and a volume; the specific implementation of step 120 may be: and splitting the voice to be detected into a plurality of audio contents according to the tone, the tone and the volume of the voice to be detected. Because the tone, the tone and the volume of each person are different, the audio contents of different users can be distinguished according to the tone, the tone and the volume, and the audio contents in the acquired voice to be detected can be distinguished.

Step 130: and respectively feeding back a plurality of different types of audio contents to the users.

After the audio contents are split, in order to improve the accuracy of voice interaction, the audio contents of different types obtained after splitting are all fed back to the user, and the user determines which audio content is the interactive audio content of the user, so that the accuracy of voice interaction is further improved.

Step 140: acquiring first confirmation information of a user; the first confirmation information is used for confirming the audio contents corresponding to the input information of the user in the plurality of different types of audio contents.

After the different types of audio contents are fed back to the user, the user is waited to confirm which audio content is the interactive audio content of the user, after the user gives first confirmation information, the characteristic information of the audio content can be reserved, and only the audio contents with the characteristic information similar to or the same as the characteristic information are collected in the current interactive process, so that the efficiency and the effect of voice recognition can be improved. It should be understood that the form of the first confirmation message may be a specific voice, such as "YES" or "YES", and may also be a click action or a check action, which is not limited in this application.

Step 150: and identifying the audio content corresponding to the first confirmation information to obtain the identified content.

After the user confirms the audio content of the interaction, the audio content is identified to obtain the identified content, and a basis is provided for the subsequent interaction.

Step 160: the feedback identifies the content to the user.

And after the identification content is obtained, feeding the identification content back to the user, and confirming or auditing the identification content by the user. Due to the interference of the pronunciation of the user and the environment, the obtained interactive audio content of the user may deviate from the actual input audio content of the user, and the recognition result is fed back to the user, so that the failure of voice interaction caused by wrong recognition result can be avoided, and the efficiency and the effect of voice interaction can be improved.

Step 170: acquiring second confirmation information of the user; wherein the second confirmation information is used to confirm whether the identification content is an expression of the true meaning of the user.

After the confirmation signal of the user is obtained, the user confirms that the identification content is the expression of the real meaning of the user, and the interactive information can be acquired more accurately at the moment. In one embodiment, when the user task recognition content deviates from the user's true meaning, the recognition content can be actively modified by the user to correct the recognition content, thereby further improving the effect of voice interaction.

Step 180: and when the second confirmation information is the expression of the real meaning of the user, determining the interactive information according to the identification content.

When the user confirms that the identification content (which may be the identification content actively modified by the user) is an expression of the true meaning of the user, the interactive information is determined according to the identification content. The specific way of determining the interaction information may be: and searching the interaction information similar to or identical to the identification content in the database according to the identification content. In an embodiment, when the same and similar interactive information as the identification content does not exist in the database, the identification content may be divided into a plurality of keywords, and the related interactive information is searched in the database according to the plurality of keywords, and the searched interactive information is displayed to the user and actively selected by the user.

According to the voice interaction method based on voice detection, the voice to be detected is divided into a plurality of different types of audio contents according to the characteristic information of the audio contents, then the audio contents are respectively fed back to the user, and the user confirms which audio content is the input information of the user, so that the interference of the audio contents generated by other users or environmental noise can be eliminated, and the accuracy of subsequent voice interaction is improved; and the audio content confirmed by the user is identified and fed back to the user, and the user confirms whether the audio content completely expresses the real meaning of the user, so that the unsmooth interaction caused by identification errors can be avoided, and the accuracy of voice interaction and the experience of the user are further improved.

Fig. 2 is a flowchart illustrating a method for feeding back audio content according to an exemplary embodiment of the present application. As shown in fig. 2, the step 130 may include the following sub-steps:

step 131: and according to the preset time length, dividing each audio content into audio segments with the time less than or equal to the preset time length.

Step 132: at least one audio segment of each audio content is fed back to the user separately.

Because a plurality of audio contents may exist in the voice to be detected, if a plurality of longer audio contents are fed back to the user, the voice needs to be played for a longer time, and therefore, each audio content is split into small segments, that is, the time of each audio segment is less than or equal to a preset time, for example, the audio contents are all split into audio segments with the time of less than or equal to 10 seconds. Since the user can determine whether the audio content is the interactive audio content of the user by a sentence or a few words, at least one audio segment of each audio content can be fed back to the client, so that the efficiency of client confirmation and thus the efficiency of interaction can be improved.

Fig. 3 is a flowchart illustrating a voice interaction method based on voice detection according to another exemplary embodiment of the present application. As shown in fig. 3, before step 130, the above embodiment may further include:

step 190: acquiring a plurality of attribute tags of a user; the plurality of attribute tags characterize respective different dimensional features of the user.

Due to the fact that attribute labels corresponding to different users are different, namely, characteristics of different users are different, the attribute labels may include any one or a combination of the following dimensional characteristics: region, age, sex, interest, mood. For example, if the current user is a male with age 30 and is interested in science and technology, the voice information of the user can be identified more specifically according to the attribute tag of the user. In an embodiment, a specific manner of obtaining the multiple attribute tags of the current user may be: and analyzing to obtain a plurality of attribute labels of the current user according to the characteristic information of the voice to be recognized. Due to the fact that the voice characteristics of each person are different, the voice to be recognized can be analyzed, and attribute labels of the user can be obtained, such as the gender, the accent, the emotion and the like of the user. In another embodiment, a specific manner of obtaining the multiple attribute tags of the current user may be: the method comprises the steps of obtaining a face image of a current user, and analyzing the face image to obtain a plurality of attribute labels of the current user. The camera module is used for acquiring a face image of the current user, and a plurality of attribute labels of the current user, such as gender, age, emotion and the like, can be obtained through image analysis of the face image. It should be understood that, in the embodiment of the present application, different manners of obtaining the attribute tag of the user may be selected according to requirements of an actual application scenario, for example, a combination of the two manners may also be adopted, or a part or all of the attribute tags may also be manually input and set by a user, and after a voice to be recognized is obtained subsequently, matching of a corresponding user is performed according to the voice to be recognized, so as to obtain the attribute tag of the user corresponding to the voice to be recognized, which is not limited in the embodiment of the present application.

In an embodiment, as shown in fig. 3, the step 130 may specifically include: and determining the feedback sequence of the audio contents of different categories according to the attribute labels of the user.

Since the feature information of the audio contents of different users may be significantly different, for example, the difference between the audio contents of different genders is large, and the attention content of different users is different, for example, the attention degree of men to cosmetics is lower than that of women, a plurality of audio contents can be ranked according to a plurality of attribute tags of the users, that is, feedback is performed according to the probability, so as to further reduce the time for confirming the user and improve the interaction efficiency.

Fig. 4 is a flowchart illustrating a method for feeding back audio content according to an exemplary embodiment of the present application. As shown in fig. 4, the step 130 may include the following sub-steps:

step 133: similarity between a plurality of attribute labels of the user and feature information of different categories of audio content is calculated.

Step 134: and feeding back a plurality of different categories of audio contents according to the sequence of the similarity from big to small.

The similarity between the attribute tags of the user and the feature information of the audio contents is calculated, so that the audio contents which are more consistent with the user in the audio contents can be obtained, the corresponding audio contents are fed back to the user according to the similarity, the user can confirm the interactive audio contents only by listening to the first or the first audio contents, the confirmation time can be shortened, and the interaction efficiency can be improved.

In an embodiment, a specific implementation manner of calculating the similarity between the multiple attribute tags of the user and the feature information of the different categories of audio content may be: respectively calculating the single-dimensional similarity between each attribute label of the user and the corresponding characteristic information of the audio content; and weighting the single-dimensional similarity to obtain the similarity between the attribute labels of the user and the feature information of the audio contents of different categories. Because the attribute labels of the users and the feature information of the audio content are multidimensional and have certain correspondence, the similarity calculation can be respectively carried out on the corresponding attribute labels and the feature information, then the similarity of multiple dimensions is integrated to obtain the final similarity between the attribute labels of the users and the feature information of the audio content of different categories, and the conformity degree between the audio content and the users can be more comprehensively reflected.

Exemplary devices

Fig. 5 is a schematic structural diagram of a voice interaction apparatus based on voice detection according to an exemplary embodiment of the present application. As shown in fig. 5, the voice interaction apparatus 50 includes: an obtaining module 51, configured to obtain a voice to be detected; the voice to be detected comprises various types of audio contents; the splitting module 52 is configured to split the to-be-detected speech into a plurality of different types of audio contents according to the feature information of the different audio contents; a first feedback module 53, configured to feed back a plurality of different categories of audio content to users, respectively; a first confirmation module 54, configured to obtain first confirmation information of the user; the first confirmation information is used for confirming the audio contents corresponding to the input information of the user in the plurality of different types of audio contents; the identification module 55 is configured to identify the audio content corresponding to the first confirmation information, so as to obtain an identified content; a second feedback module 56 for feeding back the identification content to the user; a second confirmation module 57, configured to obtain second confirmation information of the user; the second confirmation information is used for confirming whether the identification content is the expression of the true meaning of the user; and an interaction module 58 for determining the interaction information based on the identification content when the second confirmation information is an expression of the true meaning of the user.

According to the device for voice interaction based on voice detection, the acquisition module 51 acquires a voice to be detected, the splitting module 52 splits the voice to be detected into a plurality of different types of audio contents according to the characteristic information of the audio contents, then the first feedback module 53 feeds the audio contents back to a user respectively, the first confirmation module 54 acquires first confirmation information of the user, and the user confirms which audio content is the input information of the user, so that the interference of the audio contents generated by other users or environmental noise can be eliminated, and the accuracy of subsequent voice interaction is improved; and the audio content confirmed by the user is recognized through the recognition module 55 and the recognized content is fed back to the user through the second feedback module 56, the second confirmation module 57 obtains the second confirmation information of the user, the user confirms whether the audio content completely expresses the true meaning of the user, and then the interaction module 58 determines the interaction information, so that the interaction unsmooth caused by recognition error can be avoided, and the accuracy of voice interaction and the experience of the user are further improved.

In one embodiment, the characteristic information of the audio content may include a tone, a tone color, and a volume; the splitting module 52 may be further configured to: and splitting the voice to be detected into a plurality of audio contents according to the tone, the tone and the volume of the voice to be detected.

In an embodiment, the second confirmation module 57 may be further configured to: when the user task identification content is deviated from the real meaning of the user, the user can actively modify the identification content to correct the identification content.

In an embodiment, the interaction module 58 may be further configured to: and searching the interaction information similar to or identical to the identification content in the database according to the identification content. In an embodiment, the interaction module 58 may be further configured to: when the same and similar interactive information as the identification content does not exist in the database, the identification content can be divided into a plurality of keywords, the related interactive information is searched in the database according to the keywords, and the searched interactive information is displayed to the user and is actively selected by the user.

Fig. 6 is a schematic structural diagram of a voice interaction apparatus based on voice detection according to another exemplary embodiment of the present application. As shown in fig. 6, the first feedback module 53 may include: the splitting unit 531 is configured to split each audio content into audio segments with time less than or equal to a preset time duration according to the preset time duration; a segment feedback unit 532 for feeding back at least one audio segment of each audio content to the user, respectively.

In one embodiment, as shown in fig. 6, the voice interaction apparatus 50 may further include: an attribute tag obtaining module 59, configured to obtain a plurality of attribute tags of a user; the plurality of attribute tags characterize respective different dimensional features of the user.

In an embodiment, the attribute tag obtaining module 59 may be further configured to: and analyzing to obtain a plurality of attribute labels of the current user according to the characteristic information of the voice to be recognized. In another embodiment, the attribute tag obtaining module 59 may be further configured to: the method comprises the steps of obtaining a face image of a current user, and analyzing the face image to obtain a plurality of attribute labels of the current user.

In an embodiment, the first feedback module 53 may be further configured to: and determining the feedback sequence of the audio contents of different categories according to the attribute labels of the user.

In one embodiment, as shown in fig. 6, the first feedback module 53 may include: a calculating unit 533, configured to calculate similarities between the multiple attribute labels of the user and feature information of different categories of audio content; and the order feedback unit 534 is used for feeding back a plurality of different categories of audio contents according to the order of the similarity from big to small.

In an embodiment, the calculating unit 533 may be further configured to: respectively calculating the single-dimensional similarity between each attribute label of the user and the corresponding characteristic information of the audio content; and weighting the single-dimensional similarity to obtain the similarity between the attribute labels of the user and the feature information of the audio contents of different categories.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present application is described with reference to fig. 7. The electronic device may be either or both of the first device and the second device, or a stand-alone device separate from them, which stand-alone device may communicate with the first device and the second device to receive the acquired input signals therefrom.

FIG. 7 illustrates a block diagram of an electronic device in accordance with an embodiment of the present application.

As shown in fig. 7, the electronic device 10 includes one or more processors 11 and memory 12.

The processor 11 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 10 to perform desired functions.

Memory 12 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, Random Access Memory (RAM), cache memory (cache), and/or the like. The non-volatile memory may include, for example, Read Only Memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium and executed by the processor 11 to implement the voice interaction methods of the various embodiments of the present application described above and/or other desired functions. Various contents such as an input signal, a signal component, a noise component, etc. may also be stored in the computer-readable storage medium.

In one example, the electronic device 10 may further include: an input device 13 and an output device 14, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is a first device or a second device, the input device 13 may be a camera for capturing an input signal of an image. When the electronic device is a stand-alone device, the input means 13 may be a communication network connector for receiving the acquired input signals from the first device and the second device.

The input device 13 may also include, for example, a keyboard, a mouse, and the like.

The output device 14 may output various information including the determined distance information, direction information, and the like to the outside. The output devices 14 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 10 relevant to the present application are shown in fig. 7, and components such as buses, input/output interfaces, and the like are omitted. In addition, the electronic device 10 may include any other suitable components depending on the particular application.

Exemplary computer program product and computer-readable storage Medium

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions that, when executed by a processor, cause the processor to perform the steps in the voice interaction method according to various embodiments of the present application described in the "exemplary methods" section of this specification, supra.

The computer program product may be written with program code for performing the operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the voice interaction method according to various embodiments of the present application described in the "exemplary methods" section above of this specification.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A voice interaction method based on voice detection is characterized by comprising the following steps:

acquiring a voice to be detected; the voice to be detected comprises various types of audio contents;

splitting the voice to be detected into a plurality of different types of audio contents according to the characteristic information of different audio contents;

acquiring a face image of a user, and analyzing the face image to obtain a plurality of attribute labels of the user; the attribute tags characterize different dimensional features of the user;

respectively feeding back the audio contents of the plurality of different categories to the user;

acquiring first confirmation information of the user; the first confirmation information is used for confirming the audio content corresponding to the input information of the user in the plurality of different types of audio content;

identifying the audio content corresponding to the first confirmation information to obtain an identification content;

feeding back the identification content to the user;

acquiring second confirmation information of the user; the second confirmation information is used for confirming whether the identification content is the expression of the real meaning of the user;

when the second confirmation information is the expression of the real meaning of the user, determining interactive information according to the identification content; and

when no interactive information which is the same as or similar to the identification content exists in a database, splitting the identification content into a plurality of key words, searching for related interactive information in the database according to the key words, displaying the searched interactive information to a user, and actively selecting the interactive information by the user;

wherein, the implementation mode of respectively feeding back the audio contents of the plurality of different categories to the user comprises:

according to a preset time length, dividing each audio content into audio segments with the time less than or equal to the preset time length;

determining a feedback sequence of the plurality of different categories of audio content according to the plurality of attribute tags of the user; and

and respectively feeding back at least one audio segment of each audio content to the user according to the feedback sequence.

2. The voice interaction method of claim 1, wherein the feature information includes a tone, a tone color, and a volume; the splitting the voice to be detected into a plurality of different types of audio contents according to the feature information of different audio contents comprises:

and splitting the voice to be detected into a plurality of audio contents according to the tone, the tone and the volume of the voice to be detected.

3. The voice interaction method of claim 1, wherein the attribute tags comprise any one or a combination of the following dimensional features: region, age, sex, interest, mood.

4. The method of claim 1, wherein determining the feedback order of the plurality of different categories of audio content based on the plurality of attribute tags of the user comprises:

calculating similarity between a plurality of attribute labels of the user and the feature information of the different categories of audio content; and

and feeding back the audio contents of the different categories in the order of the similarity from big to small.

5. The method of claim 4, wherein the calculating the similarity between the plurality of attribute tags of the user and the feature information of the different categories of audio content comprises:

respectively calculating single-dimensional similarity between each attribute label of the user and corresponding characteristic information of the audio content; and

and weighting the single-dimensional similarity to obtain the similarity between the attribute labels of the user and the feature information of the audio contents of different categories.

6. A voice interaction device based on voice detection is characterized by comprising:

the acquisition module is used for acquiring the voice to be detected; the voice to be detected comprises various types of audio contents;

the splitting module is used for splitting the voice to be detected into a plurality of different types of audio contents according to the characteristic information of different audio contents;

the attribute label acquisition module is used for acquiring a face image of a user and analyzing the face image to obtain a plurality of attribute labels of the user; the attribute tags characterize different dimensional features of the user;

the first feedback module is used for respectively feeding back the audio contents of the different categories to the user;

the first confirmation module is used for acquiring first confirmation information of the user; the first confirmation information is used for confirming the audio content corresponding to the input information of the user in the plurality of different types of audio content;

the identification module is used for identifying the audio content corresponding to the first confirmation information to obtain identification content;

the second feedback module is used for feeding back the identification content to the user;

the second confirmation module is used for acquiring second confirmation information of the user; the second confirmation information is used for confirming whether the identification content is the expression of the real meaning of the user; and

the interaction module is used for determining interaction information according to the identification content when the second confirmation information is the expression of the real meaning of the user; when the interactive information which is the same as or similar to the identification content does not exist in the database, the identification content is divided into a plurality of key words, the related interactive information is searched in the database according to the key words, the searched interactive information is displayed to a user and is actively selected by the user;

wherein the first feedback module is further configured to: determining a feedback sequence of the plurality of different categories of audio content according to the plurality of attribute tags of the user; the first feedback module comprises: the splitting unit is used for splitting each audio content into audio segments with the time less than or equal to the preset time length according to the preset time length; and the segmented feedback unit is used for respectively feeding back at least one audio segment of each audio content to the user according to the feedback sequence.