CN109710799B

CN109710799B - Voice interaction method, medium, device and computing equipment

Info

Publication number: CN109710799B
Application number: CN201910005993.4A
Authority: CN
Inventors: 肖军军; 张敏; 张汉雁; 魏永振
Original assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Current assignee: Hangzhou Netease Cloud Music Technology Co Ltd
Priority date: 2019-01-03
Filing date: 2019-01-03
Publication date: 2021-08-27
Anticipated expiration: 2039-01-03
Also published as: CN109710799A

Abstract

The embodiment of the invention provides a voice interaction method, which comprises the following steps: receiving voice information input by a user, and converting the voice information into a sentence text; obtaining comment information matched with the sentence text from a preset music comment library; and outputting the comment information as a response to the voice information. The embodiment of the disclosure makes full use of the existing music comment information as the response, greatly reduces the manpower input in the response content writing, can cause the emotional resonance of the user who inputs the voice information at present, and meets the emotional requirements of the user. The embodiment of the invention also provides a voice interaction device, a medium and a computing device.

Description

Voice interaction method, medium, device and computing equipment

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a voice interaction method, a medium, a device and a computing device.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

The essence of voice interaction is man-machine interaction, which means that a user and a machine perform interaction, communication and information exchange by taking voice as a carrier to generate a series of input and output, and finally, a corresponding task is finished or a corresponding purpose is achieved.

The existing voice interaction scheme requires a developer to write machine response content in advance, when a user inputs voice information, the voice information is converted into text, and response content matched with the text is selected as output. On one hand, the writing of the response content needs to invest a large amount of manpower and is low in efficiency, and on the other hand, the pre-written response content is rigid and cannot meet the emotional requirements of the user.

Disclosure of Invention

For the reasons, the existing voice interaction scheme needs to invest a great deal of manpower to write the response content, and the response content is rigid and hard, so that the emotional requirements of the user cannot be met.

Therefore, an improved voice interaction method is needed to realize more efficient and emotional-resonance human-computer interaction.

In this context, embodiments of the present invention are intended to provide a voice interaction method and apparatus.

In a first aspect of embodiments of the present invention, a voice interaction method is provided, including: receiving voice information input by a user, and converting the voice information into a sentence text; obtaining comment information matched with the sentence text from a preset music comment library; and outputting the comment information as a response to the voice information.

In one embodiment of the present invention, after the outputting the comment information as a response to the voice information, the method further includes: and playing music corresponding to the comment information.

In another embodiment of the present invention, before the obtaining of the comment information matched with the sentence text from the preset music comment library, the method further includes: acquiring a plurality of pieces of comment information about music, which meet preset conditions, and constructing a preset music comment library according to the acquired plurality of pieces of comment information; and identifying the focus information and the intention information of each comment information in the preset music comment library. The obtaining of the comment information matched with the sentence text from the preset music comment library includes: and obtaining comment information matched with the statement text based on the focus information and the intention information of each comment information in the preset music comment library.

In still another embodiment of the present invention, the above-mentioned obtaining of the pieces of comment information on music that satisfy the preset condition includes: obtaining comment information corresponding to the personalized music of the user according to the historical music interaction behavior data of the user, wherein the personalized music of the user comprises at least one of the following: music collected by the user, music created by the user, music liked by the user, or music played by the user; and/or obtaining comment information corresponding to the current popularization music; and/or obtaining comment information with the number of praise exceeding a first threshold value.

In another embodiment of the present invention, the obtaining comment information matched with the sentence text based on the focus information and the intention information of each comment information in the preset music comment library includes: identifying focus information and intention information of the sentence text; matching the focus information of the statement text with the focus information of each comment information in the preset music comment library, and screening out the comment information with the matched focus; matching the intention information of the statement text with the intention information of the focus-matched comment information, and screening out the focus-matched and intention-matched comment information.

In a further embodiment of the present invention, the identifying the focus information and the intention information of each comment information in the preset music comment library includes: extracting labels for representing corresponding focus information from the each comment information based on a label library, and extracting intention sentence patterns for representing corresponding intention information from the each comment information based on an intention classification library. The above identifying the focus information and the intention information of the sentence text includes: extracting tags for representing corresponding focus information from the sentence text based on the tag library, and extracting an intention sentence pattern for representing corresponding intention information from the sentence text based on the intention classification library. The matching of the focus information of the sentence text and the focus information of each comment information in the music comment library includes: matching the label of the statement text with the label of each comment information, and determining the comment information as focus matching when the matching degree exceeds a second threshold value. And the matching of the intention information of the sentence text and the intention information of the focus-matched comment information includes: and matching the intention sentence pattern of the sentence text with the intention sentence pattern of the comment information with the focus matching, and determining the comment information with the focus matching and the intention matching when the matching degree exceeds a third threshold value.

In another embodiment of the present invention, the obtaining comment information matched with the sentence text from the music comment library further includes: when a plurality of pieces of comment information which are matched in focus and matched in intention are screened out, acquiring the priority of music corresponding to each comment information; and ranking the comments based on the priority of the music, and selecting a piece of comment information based on a ranking result.

In another embodiment of the present invention, the obtaining of the priority of the music corresponding to each comment information includes: determining a comprehensive score of music corresponding to each piece of comment information according to the historical music interactive behavior data of the user, wherein the historical music interactive behavior data of the user comprises at least one of the following: the behavior data of the user collecting music, the behavior data of the user likes music, the behavior data of the user playing music, the behavior data of the user commenting music, the behavior data of the user sharing music or the behavior data of the user creating music.

In a second aspect of the embodiments of the present invention, there is provided a voice interaction apparatus, including: the device comprises a receiving module, a matching module and an output module. The receiving module is used for receiving voice information input by a user and converting the voice information into a sentence text. And the matching module is used for acquiring comment information matched with the sentence text from a preset music comment library. The output module is used for outputting the comment information as a response to the voice information.

In an embodiment of the present invention, the apparatus further includes a playing module, configured to play music corresponding to the comment information after the output module outputs the comment information as a response to the voice information.

In another embodiment of the present invention, the apparatus further includes: the device comprises a first preprocessing module and a second preprocessing module. The first preprocessing module is used for acquiring a plurality of pieces of comment information about music meeting preset conditions before the matching module acquires the comment information matched with the sentence text from a preset music comment library, and the preset music comment library is constructed by the acquired plurality of pieces of comment information. The second preprocessing module is used for identifying the focus information and the intention information of each comment information in the preset music comment library. The matching module is used for acquiring comment information matched with the statement text from a preset music comment library based on the focus information and the intention information of each comment information in the preset music comment library.

In another embodiment of the present invention, the first preprocessing module is specifically configured to obtain comment information corresponding to personalized music of the user according to historical music interaction behavior data of the user, where the personalized music of the user includes at least one of the following: music collected by the user, music created by the user, music liked by the user, or music played by the user; and/or obtaining comment information corresponding to the current popularization music; and/or obtaining comment information with the number of praise exceeding a first threshold value.

In still another embodiment of the present invention, the matching module includes: an identification submodule, a first matching submodule and a second matching submodule. The recognition submodule is used for recognizing the focus information and the intention information of the sentence text. And the first matching submodule is used for matching the focus information of the statement text with the focus information of each comment information in the preset music comment library and screening out the comment information with the focus matched. And the second matching submodule is used for matching the intention information of the statement text with the intention information of the focus-matched comment information and screening out the focus-matched and intention-matched comment information.

In yet another embodiment of the present invention, the second preprocessing module is specifically configured to extract tags for characterizing the corresponding focus information from the review information based on a tag library, and extract intent sentences for characterizing the corresponding intent information from the review information based on an intent classification library. The identification submodule is specifically used for extracting tags for representing corresponding focus information from the statement text based on the tag library, and extracting intent sentences for representing corresponding intent information from the statement text based on the intent classification library. The first matching sub-module is specifically configured to match the tags of the sentence text with the tags of the comment information, and determine the comment information as the focus matching when the matching degree exceeds a second threshold. And the second matching sub-module is specifically used for matching the intention sentence pattern of the sentence text with the intention sentence pattern of the focus-matched comment information, and determining the focus-matched comment information as the focus-matched and intention-matched comment information when the matching degree exceeds a third threshold value.

In a further embodiment of the present invention, the matching module further includes an obtaining sub-module and a sorting sub-module. The obtaining submodule is used for obtaining the priority of the category to which the music corresponding to each piece of comment information belongs when a plurality of pieces of comment information which are matched in focus and matched in intention are screened out. And the sequencing submodule is used for sequencing the comments based on the priority of the music and selecting a piece of comment information based on a sequencing result.

In a further embodiment of the present invention, the obtaining sub-module is specifically configured to determine, according to the historical music interaction behavior data of the user, a comprehensive score of music corresponding to each piece of comment information, where the historical music interaction behavior data of the user includes at least one of: the behavior data of the user collecting music, the behavior data of the user likes music, the behavior data of the user playing music, the behavior data of the user commenting music, the behavior data of the user sharing music or the behavior data of the user creating music.

In a third aspect of embodiments of the present invention, there is provided a medium storing computer executable instructions, which when executed by a processor, are operable to implement: the voice interaction method of any of the above embodiments.

In a fourth aspect of embodiments of the present invention, there is provided a computing device comprising: a memory, a processor, and executable instructions stored on the memory and executable on the processor, the processor when executing the instructions implementing: the voice interaction method of any of the above embodiments.

According to the voice interaction method and the voice interaction device, comment information matched with the voice information input by the current user is selected from a plurality of existing comment information about music to serve as a response, developers do not need to write response content in advance, and manpower input in writing the response content is greatly reduced. And because the comment information is expressed by the real user based on the real emotion of the corresponding music, the matched comment information is used as the response of the voice information, so that the emotion resonance of the user who inputs the voice information at present can be caused, and the emotion requirement of the user is met.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 schematically illustrates an application scenario of a voice interaction method and a device thereof according to an embodiment of the present invention;

FIG. 2 schematically shows a flow diagram of a voice interaction method according to one embodiment of the invention;

FIG. 3 schematically shows a flow chart of a method of voice interaction according to another embodiment of the invention;

FIG. 4A schematically illustrates a diagram of a library of preset music reviews, according to one embodiment of the invention;

FIG. 4B schematically shows a diagram of a voice interaction process according to one embodiment of the invention;

FIG. 5A schematically illustrates a block diagram of a voice interaction device, in accordance with one embodiment of the present invention;

FIG. 5B schematically shows a block diagram of a voice interaction device, according to another embodiment of the present invention;

FIG. 6 schematically shows a block diagram of a matching module according to one embodiment of the invention;

FIG. 7 schematically shows a schematic view of a computer-readable storage medium product according to an embodiment of the invention; and

FIG. 8 schematically shows a block diagram of a computing device according to an embodiment of the present invention.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present invention will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the invention, and are not intended to limit the scope of the invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

According to the embodiment of the invention, a voice interaction method, a medium, a device and a computing device are provided.

In this context, it is to be understood that the terms referred to include: voice information, sentence text, a preset music comment library, comment information, and the like. The voice information is audio data based on sound recording, and the content of the voice information is converted into a corresponding text, namely a sentence text. The comment information refers to music comment information, and any user can comment any music to obtain corresponding comment information about the music. The preset music comment library is constructed by a plurality of comment information about music. Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

The principles and spirit of the present invention are explained in detail below with reference to several representative embodiments of the invention.

Summary of The Invention

In carrying out the disclosed concept, the inventors discovered: the existing voice interaction scheme requires a developer to write machine response content in advance, when a user inputs voice information, the voice information is converted into text, and response content matched with the text is selected as output. This solution has the following problems: on one hand, the writing of the response content needs to invest a large amount of manpower and is low in efficiency, and on the other hand, the pre-written response content is rigid and cannot meet the emotional requirements of the user.

Therefore, the embodiment of the invention provides a voice interaction method and a voice interaction device, wherein the method comprises the following steps: receiving voice information input by a user, and converting the voice information into a sentence text; obtaining comment information matched with the sentence text from a preset music comment library; and outputting the comment information as a response to the voice information. According to the embodiment of the disclosure, comment information matched with the voice information input by the current user is selected from a plurality of existing comment information about music as a response, developers are not required to write response content in advance, and manpower invested in writing the response content is greatly reduced. And because the comment information is expressed by the real user based on the real emotion of the corresponding music, the matched comment information is used as the response of the voice information, so that the emotion resonance of the user who inputs the voice information at present can be caused, and the emotion requirement of the user is met.

Having described the general principles of the invention, various non-limiting embodiments of the invention are described in detail below.

Application scene overview

First, referring to fig. 1, an application scenario of the voice interaction method and the apparatus thereof according to the embodiment of the present invention is described in detail.

Fig. 1 schematically shows an application scenario of a voice interaction method and a device thereof according to an embodiment of the present invention. As shown in fig. 1, the application scenario includes an electronic device 110 and a user 120, where the electronic device 110 has a voice interaction function and performs voice interaction with the user 120. In this embodiment, the electronic device 110 is a smart speaker, and in other embodiments, the electronic device 110 may be any device having a voice interaction function, such as a smart phone, a computer, a smart watch, various smart appliances, and the like, which is not limited herein.

The electronic device 110 collects the voice information input by the user 120 through the microphone, and makes a corresponding response and executes a corresponding task according to the voice information. For example, the user 120 inputs the voice message "how the weather is today", the electronic device 110 inquires the weather and makes a response "lowest temperature-4 degrees celsius, highest temperature 6 degrees celsius, clear to cloudy", according to the inquiry result, or the user 120 inputs the voice message "several points now", the electronic device 110 inquires the current time and makes a response "9 points 05 points" according to the inquiry result. In both examples, the voice information input by the user 120 has a definite answer, and the electronic device 110 may directly query the definite answer as a response, however, in daily life, the voice information input by the user 120 has no definite answer in most cases, for example, the user 120 inputs the voice information "it is good and bad today", and at this time, the electronic device 110 should respond with a content that best meets the current psychological needs of the user 120.

Exemplary method

A voice interaction method according to an exemplary embodiment of the present invention is described below with reference to fig. 2 to 4B in conjunction with the application scenario of fig. 1. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present invention, and the embodiments of the present invention are not limited in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.

Fig. 2 schematically shows a flow chart of a voice interaction method according to an embodiment of the invention. As shown in fig. 2, the method includes the following operations:

operation S201 receives voice information input by a user, and converts the voice information into a sentence text.

And operation S202, obtaining comment information matched with the sentence text from a preset music comment library.

Operation S203 outputs the comment information as a response to the voice information.

It can be seen that, in the method shown in fig. 2, for the voice information input by the user, comment information matched with the voice information input by the current user is selected from a plurality of existing comment information about music as a response, and therefore, developers are not required to write response content in advance, and manpower invested in writing response content is greatly reduced. And because the comment information is expressed by the real user based on the real emotion of the corresponding music, the matched comment information is used as the response of the voice information, so that the emotion resonance of the user who inputs the voice information at present can be caused, and the emotion requirement of the user is met.

Fig. 3 schematically shows a flow chart of a voice interaction method according to another embodiment of the invention. As shown in fig. 3, the method includes operations S201 to S204, where the operations S201 to S203 are the same as those shown in fig. 2, and are not described herein again.

In operation S204, music corresponding to the comment information is played.

In a specific example, the music corresponding to the comment information may be directly played after the comment information is output as a response, or the music corresponding to the comment information may be played when a predetermined trigger condition is satisfied within a predetermined time after the comment information is output as a response. The music corresponding to the comment information may be various audio files such as songs, pure music, voices, speeches, broadcasts, and the like, and is not limited herein.

In the embodiment of the disclosure, for any music, the user can evaluate the music to obtain comment information, so that any music corresponds to one or more comment information of the user about the music, and the preset music comment library includes the comment information of one or more music. After receiving voice information input by a user, converting the voice information into a sentence text, acquiring comment information matched with the sentence text from a preset music comment library, wherein the acquired comment information can express emotion similar to the input voice information, outputting the comment information as a response, and naturally causing emotional resonance of the user who inputs the voice information at present. Furthermore, music corresponding to the comment information is played after the comment information is output, and because the emotion expressed by the comment information is caused by the music corresponding to the comment information, the music is adapted to the emotion, and the music is played to the user who inputs the voice information at present, the atmosphere adapted to the emotion of the user can be created, so that the voice interaction process is more natural and emotional, and the method is not hard man-machine interaction in the prior art.

In the embodiment of the present disclosure, before the operation S202 acquires comment information matching the sentence text from a preset music comment library, the method shown in fig. 2 or fig. 3 may further include some preprocessing processes: acquiring a plurality of pieces of comment information about music, which meet preset conditions, and constructing a preset music comment library according to the acquired plurality of pieces of comment information; and identifying the focus information and the intention information of each comment information in the preset music comment library. The preprocessing process constructs a preset music comment library, wherein the preset music comment library comprises a plurality of pieces of comment information about music, which meet preset conditions, and the comment information in the preset music comment library needs to be identified, so that the focus information and the intention information of each comment information are identified. The focus information of the comment information refers to the most important information expressed by the comment information, is a part which an initiator of the comment information wants a viewer of the comment information to pay attention to when seeing the comment information, and each comment information may include one or more focus information, and the focus information may be represented by one or more tags. The intention information of the comment information refers to an operation or purpose that the initiator of the comment information expresses through the comment information and is intended to achieve.

On this basis, the operation S202 of obtaining comment information matched with the sentence text from a preset music comment library includes: and obtaining comment information matched with the statement text based on the focus information and the intention information of each comment information in the preset music comment library. According to the scheme of the embodiment, the comment information is matched with the sentence text of the input voice information based on the focus information and the intention information, and because the focus information and the intention information can reflect subjective factors such as the emotion, the idea and the viewpoint of a real user, the comment information expressing the emotion, the idea and the viewpoint similar to the input voice information can be effectively acquired based on the two types of information, and the user of the current input voice information can be adapted to the maximum degree from the psychological aspect.

Specifically, as an alternative embodiment, the above-mentioned acquiring a plurality of pieces of comment information on music that satisfy a preset condition includes: obtaining comment information corresponding to the personalized music of the user according to the historical music interaction behavior data of the user, wherein the personalized music of the user comprises at least one of the following: music collected by the user, music created by the user, music liked by the user, or music played by the user. And/or obtaining comment information corresponding to the current popularization music. And/or obtaining comment information with the number of praise exceeding a first threshold value.

According to the scheme of this embodiment, the preset music comment library may include comment information of personalized music of the user currently performing voice interaction, the personalized music of the user reflects music preference of the user, comment information matched with voice information input by the user is obtained from the comment information of music preferred by the user as a response, and corresponding music is further played, so that resonance of the user is more easily caused. The preset music comment library can also comprise comment information corresponding to the current popularization music, comment information matched with voice information input by a user is obtained from the comment information of the current popularization music as a response, and the corresponding popularization music is further played, so that the voice interaction requirements of the user can be met, and the corresponding popularization music can be recommended to the user. The preset music comment library also can comprise comment information with the number of praise exceeding a first threshold value, namely popular comment information, wherein the popular comment information is representative comment information capable of causing resonance of most people, comment information matched with voice information input by a user is obtained from the popular comment information to serve as a response, corresponding music is further played, and resonance of the user is more easily caused.

In an embodiment of the disclosure, the obtaining of comment information matched with the sentence text based on the focus information and the intention information of each comment information in the preset music comment library includes: identifying focus information and intention information of the sentence text; matching the focus information of the statement text with the focus information of each comment information in a preset music comment library, and screening out the comment information with the matched focus; matching the intention information of the statement text with the intention information of the focus-matched comment information, and screening out the focus-matched and intention-matched comment information.

The focus information of the sentence text refers to the most important information expressed by the sentence text, and is a part that an initiator of the voice information corresponding to the sentence text wants a receiver to pay attention to, and each sentence text may include one or more focus information, and the focus information may be characterized by one or more tags. The intention information of the sentence text refers to an operation or purpose to be realized expressed by the originator of the voice information corresponding to the sentence text. The above process firstly carries out focus matching, screens out comment information matched with the focus of the sentence text, screens out a large amount of irrelevant comment information, then carries out intention matching, screens out comment information matched with the focus of the sentence text and matched with the intention, and improves matching efficiency.

Specifically, as an alternative embodiment, the identifying the focus information and the intention information of each comment information in the preset music comment library includes: extracting labels for representing corresponding focus information from the each comment information based on a label library, and extracting intention sentence patterns for representing corresponding intention information from the each comment information based on an intention classification library. The above identifying the focus information and the intention information of the sentence text includes: extracting tags for representing corresponding focus information from the sentence text based on the tag library, and extracting an intention sentence pattern for representing corresponding intention information from the sentence text based on the intention classification library. The matching of the focus information of the sentence text and the focus information of each comment information in the music comment library includes: matching the label of the statement text with the label of each comment information, and determining the comment information as focus matching when the matching degree exceeds a second threshold value. And the matching of the intention information of the sentence text and the intention information of the focus-matched comment information includes: and matching the intention sentence pattern of the sentence text with the intention sentence pattern of the comment information with the focus matching, and determining the comment information with the focus matching and the intention matching when the matching degree exceeds a third threshold value.

The label library and the intention classification library used for identifying the focus information and the intention information can be preset and continuously updated and expanded in the using process. The process of identifying the focus information of the comment information and the process of identifying the focus information of the sentence text both use the same tag library, so that the extraction standards of the focus information are consistent, and the accuracy of subsequent focus matching is ensured. And the process of identifying the intention information of the comment information and the process of identifying the intention information of the sentence text both use the same intention classification library, so that the extraction standards of the intention information are consistent, and the accuracy of subsequent intention matching is ensured.

In another embodiment of the present disclosure, the focus information and the intention information of the comment information in the preset music comment library may not be identified in advance, and after converting the voice information input by the user into the sentence text, the operation S202 of obtaining the comment information matching the sentence text from the preset music comment library may include: identifying focus information and intention information of the sentence text; matching the focus information of the statement text with each comment information in a preset music comment library, and screening out the comment information with the matched focus; matching the intention information of the statement text with the intention information of the focus-matched comment information, and screening out the focus-matched and intention-matched comment information.

When a piece of focus-matched and intention-matched comment information is screened out, the comment information is directly output as a response, and when a plurality of focus-matched and intention-matched comment information are screened out, as an optional embodiment, the operation S202 of obtaining the comment information matched with the sentence text from a preset music comment library further includes: when a plurality of pieces of comment information which are matched in focus and matched in intention are screened out, acquiring the priority of music corresponding to each comment information; and ranking the comments based on the priority of the music, and selecting a piece of comment information based on a ranking result.

Optionally, the obtaining of the priority of the music corresponding to each piece of comment information includes: determining a comprehensive score of music corresponding to each piece of comment information according to the historical music interactive behavior data of the user, wherein the historical music interactive behavior data of the user comprises at least one of the following: the behavior data of the user collecting music, the behavior data of the user likes music, the behavior data of the user playing music, the behavior data of the user commenting music, the behavior data of the user sharing music or the behavior data of the user creating music.

The method shown in fig. 2 to 3 is described below with reference to fig. 4A to 4B in conjunction with specific embodiments:

in this embodiment, the user a performs voice interaction with the smart speaker, and a preset music comment library is constructed before the voice interaction starts.

FIG. 4A schematically shows a diagram of a library of preset music reviews, according to one embodiment of the invention.

As shown in fig. 4A, the preset music comment library includes: comment information corresponding to the personalized music of the user A, comment information of the promoted music in the current preset time and popular comment information. The personalized music of the user A refers to music of forward music interaction behaviors of the user A such as collection, creation, liking, sharing and playing, the promotion music in the current preset time comprises one or more of popular music in the current preset time, music which is appointed to be promoted with a partner, and the like, and the popular comment information is comment information with the number of praise being more than or equal to 500.

Screening out a preliminary basic tag according to the historical voice interaction content of the user A, wherein the basic tag comprises the following steps: "autism", "insomnia", "heart injury", "memory", "worries", "anxiety", etc., and a base tag library is constructed from these base tags. And forming a sentence pattern based on the basic labels in the basic label library, extracting the semantic sentence pattern from the comment preparation information in the preset music comment library, classifying the basic labels of the comment information, summarizing the comment information into an intention classification sentence pattern to form an intention classification library, and expanding the label library according to the extracted sentence pattern. And performing sentence pattern extraction on each comment information based on the expanded tag library, updating the tag classification of each comment information, summarizing the comment information into an intention classification library, and expanding the tag library according to the extracted sentence pattern. And by analogy, continuously and circularly expanding to reach a final label library and an intention classification library, obtaining one or more labels for representing the focus information of each comment information based on the final label library, and obtaining an intention sentence pattern for representing the intention information of each comment information based on the final intention classification library.

FIG. 4B schematically shows a diagram of a voice interaction process according to one embodiment of the invention.

When the smart speaker receives the voice information input by the user a, the voice information is converted into a sentence text through an ASR (Automatic Speech Recognition), in this example, the sentence text corresponding to the voice information input by the user a is "i'm asleep". Semantic analysis is carried out on the sentence text based on Natural Language Understanding (NLU) to obtain an intention sentence pattern representing intention information of the sentence text: { I, chat, comfort }, obtaining multiple tags that characterize the focus information of the sentence text: { insomnia, lonely }.

And calculating the similarity between the focus information of the sentence text and the comment information in the preset music comment library shown in fig. 4A based on a preset algorithm, and screening out the comment information with the similarity higher than a second threshold value. In this example, the recommendation system item _ cf algorithm is used to calculate the similarity between the tag { insomnia, lonely } corresponding to the sentence text "i can not sleep" and the tags corresponding to the comment information, and 5 pieces of comment information in the preset music comment library shown in fig. 4B are screened out: comment 1, comment 2, comment 3, comment 4, comment 5. The intention sentence pattern corresponding to the comment 1 is { chat }, the intention sentence pattern corresponding to the comment 2 is { chat, comfort }, the intention sentence pattern corresponding to the comment 3 is { treatment }, the intention sentence pattern corresponding to the comment 4 is { movie/television }, and the intention sentence pattern corresponding to the comment 5 is { chat, comfort }.

Matching the intention information of the sentence text with the intention sentence pattern corresponding to the screened comment information, and further screening the comment information of which the matching degree exceeds a third threshold value. In this example, the intention sentence pattern { i, chat, comfort } corresponding to the sentence text "i can not fall asleep" is matched with the intention sentence pattern corresponding to the 5 pieces of comment information screened above, and comment information, namely comment 2 and comment 5, of which the intention sentence pattern includes { chat, comfort } is screened out.

And determining the comprehensive score of the music corresponding to each piece of screened comment information according to the historical music interaction behavior data of the user A, and selecting the comment information corresponding to the music with the highest comprehensive score. Further, other characteristics of the music corresponding to the selected comment information except the user preference can be considered, for example, the selected comment information is sorted according to the priority order of the popularization music in the current preset time, the music enjoyed by the user, the music played by the user and exceeding a fourth threshold, and the comment with the highest priority is selected. In this example, the screened comments 2 and 5 are sorted according to the music preference of the user a, the music corresponding to the comment 5 is the music liked by the user a, and has a higher priority, and the comment 5 is finally screened, and the corresponding music is: song "late night dining room".

The smart speaker converts the comment 5 into a voice message "say, a person who cannot sleep at night because he is awake in someone else's dream" by using a speech synthesis technique (TTS) and outputs the voice message as a response to "i am not asleep" input by the user a. In addition, the music "late night canteen" corresponding to the comment 5 may be played directly, or the music "late night canteen" corresponding to the comment 5 may be played after being confirmed by the user a. Because the music is a favorite song of the user A, and the comment 5 is written by other real users based on similar moods of the user A that the user A can not sleep, the intelligent sound box outputs the comment 5 as a response and plays a corresponding song, so that the user A is caused to feel and resonate, and the process is a man-machine interaction process with higher temperature.

The embodiment of the disclosure can reduce the human input cost required in voice interaction, and fully utilize the existing real users to write the generated comment information; by selecting comment information matched with the input sentence text as a response, emotion interaction in voice interaction can be promoted, the method is different from mechanical question-answer type conversation, not only can the answered content meet the psychological needs of users, but also helps to create atmosphere by matching played music, and the users feel emotion and resonance.

Exemplary devices

Having described the method of the exemplary embodiment of the present invention, the voice interaction apparatus of the exemplary embodiment of the present invention will be described in detail with reference to fig. 5A to 6.

FIG. 5A schematically shows a block diagram of a voice interaction device, according to an embodiment of the present invention. As shown in fig. 5A, the voice interaction apparatus 500 includes:

the receiving module 510 is configured to receive voice information input by a user, and convert the voice information into a sentence text.

The matching module 520 is configured to obtain comment information matched with the sentence text from a preset music comment library.

The output module 530 is configured to output the comment information as a response to the voice information.

It can be seen that, with respect to the voice information input by the user, the apparatus shown in fig. 5A selects comment information matching the voice information input by the current user from a plurality of existing comment information on music as a response, and therefore, it is not necessary for a developer to write the response content in advance, and the manpower invested in writing the response content is greatly reduced. And because the comment information is expressed by the real user based on the real emotion of the corresponding music, the matched comment information is used as the response of the voice information, so that the emotion resonance of the user who inputs the voice information at present can be caused, and the emotion requirement of the user is met.

FIG. 5B schematically shows a block diagram of a voice interaction device, according to another embodiment of the present invention. As shown in fig. 5B, the voice interaction apparatus 500' includes: a receiving module 510, a matching module 520, an output module 530, a playing module 540, a first preprocessing module 550, and a second preprocessing module 560. The receiving module 510, the matching module 520, and the output module 530 are described above, and repeated descriptions are omitted.

The playing module 540 is configured to play music corresponding to the comment information after the output module outputs the comment information as a response to the voice information.

In an embodiment of the present disclosure, before the matching module 520 obtains the comment information matched with the sentence text from the preset music comment library, the first preprocessing module 550 is configured to obtain a plurality of pieces of comment information on music that satisfy a preset condition, and construct the preset music comment library from the obtained plurality of pieces of comment information. The second preprocessing module 560 is used for identifying the focus information and the intention information of each comment information in the preset music comment library.

On this basis, the matching module 520 obtains comment information matched with the sentence text from a preset music comment library, specifically: and acquiring comment information matched with the sentence text from a preset music comment library based on the focus information and the intention information of each comment information in the preset music comment library.

Specifically, as an optional embodiment, the first preprocessing module 550 is configured to obtain comment information corresponding to personalized music of the user according to historical music interaction behavior data of the user, where the personalized music of the user includes at least one of the following: music collected by the user, music created by the user, music liked by the user, or music played by the user; and/or obtaining comment information corresponding to the current popularization music; and/or obtaining comment information with the number of praise exceeding a first threshold value.

Fig. 6 schematically shows a block diagram of a matching module according to an embodiment of the invention. As shown in fig. 6, the matching module 520 includes: a recognition sub-module 521, a first matching sub-module 522, a second matching sub-module 523, an acquisition sub-module 524, and a ranking sub-module 525.

In one embodiment of the present disclosure, the recognition submodule 521 is used to recognize the focus information and the intention information of the sentence text. The first matching submodule 522 is configured to match the focus information of the sentence text with the focus information of each comment information in a preset music comment library, and screen out the comment information with a focus matching. And the second matching submodule 523 is configured to match the intention information of the sentence text with the intention information of the focus-matched comment information, and screen out the focus-matched comment information with the intention.

As an optional embodiment, the second preprocessing module 560 is specifically configured to extract tags for characterizing the corresponding focus information from the review information based on the tag library, and extract intent sentences for characterizing the corresponding intent information from the review information based on the intent classification library. The identification submodule 521 is specifically configured to extract tags for characterizing the corresponding focus information from the sentence text based on the tag library, and extract an intention sentence pattern for characterizing the corresponding intention information from the sentence text based on the intention classification library. The first matching sub-module 522 is specifically configured to match the tags of the sentence text with the tags of the comment information, and determine that the sentence text is the comment information with the focus matching when the matching degree exceeds a second threshold. And the second matching sub-module 523 is specifically configured to match the intended sentence pattern of the sentence text with the intended sentence pattern of the comment information that is in focus matching, and determine that the comment information that is in focus matching and is in intention matching is the comment information that is in focus matching when the matching degree exceeds a third threshold.

In an embodiment of the present disclosure, the obtaining sub-module 524 is configured to, when a plurality of pieces of comment information that are focus-matched and are intended to be matched are screened out, obtain a priority of a category to which music corresponding to each comment information belongs. The sorting submodule 525 is configured to sort the comments based on the priority of the music, and select a piece of comment information based on a sorting result.

Optionally, the obtaining sub-module 524 is specifically configured to determine, according to the historical music interaction behavior data of the user, a comprehensive score of music corresponding to each piece of comment information, where the historical music interaction behavior data of the user includes at least one of: the behavior data of the user collecting music, the behavior data of the user likes music, the behavior data of the user playing music, the behavior data of the user commenting music, the behavior data of the user sharing music or the behavior data of the user creating music.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.

Exemplary Medium

Having described the method and apparatus of exemplary embodiments of the present invention, a medium implementing a voice interaction method of exemplary embodiments of the present invention will be described.

An embodiment of the present invention provides a medium storing computer-executable instructions, which when executed by a processor, are configured to implement the voice interaction method according to any one of the above method embodiments.

In some possible embodiments, aspects of the present invention may also be implemented in the form of a program product comprising program code for causing a computing device to perform the steps of the voice interaction method according to various exemplary embodiments of the present invention described in the above section "exemplary method" of this specification, when the program product is run on the computing device, for example, the computing device may perform the operational steps as shown in fig. 2 and may also perform the operational steps as shown in fig. 3.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Fig. 7 schematically shows a schematic diagram of a computer-readable storage medium product according to an embodiment of the present invention, and as shown in fig. 7, a program product 70 implementing a voice interaction method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

Exemplary computing device

Having described the method, medium, and apparatus of exemplary embodiments of the present invention, a computing device implementing a voice interaction method according to another exemplary embodiment of the present invention is described next.

An embodiment of the present invention further provides a computing device, including: the device comprises a memory, a processor and executable instructions stored on the memory and executable on the processor, wherein the processor executes the instructions to realize the voice interaction method in any one of the above method embodiments.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, a computing device implementing a voice interaction method according to the present invention may include at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code which, when executed by the processing unit, causes the processing unit to perform the operational steps in the data processing method according to various exemplary embodiments of the present invention described in the above section "exemplary method" of the present specification. For example, the processing unit may perform the operational steps as shown in fig. 2, and may also perform the operational steps as shown in fig. 3.

A computing device 80 implementing the voice interaction method according to this embodiment of the present invention is described below with reference to fig. 8. The computing device 80 shown in FIG. 8 is only one example and should not be taken to limit the scope of use and functionality of embodiments of the present invention.

As shown in fig. 8, computing device 80 is embodied in the form of a general purpose computing device. Components of computing device 80 may include, but are not limited to: the at least one processing unit 801, the at least one memory unit 802, and a bus 803 that couples various system components including the memory unit 802 and the processing unit 801.

Bus 803 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

The storage unit 802 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)8021 and/or cache memory 8022, and may further include Read Only Memory (ROM) 8023.

Storage unit 802 can also include a program/utility 8025 having a set (at least one) of program modules 8024, such program modules 8024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Computing device 80 may also communicate with one or more external devices 804 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with computing device 80, and/or with any devices (e.g., router, modem, etc.) that enable computing device 80 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/0) interface 805. Moreover, computing device 80 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via network adapter 806. As shown, the network adapter 806 communicates with the other modules of the computing device 80 over the bus 803. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 80, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the data processing apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the invention have been described with reference to several particular embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The invention is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A voice interaction method, comprising:

receiving voice information input by a user, and converting the voice information into a sentence text;

obtaining comment information matched with the sentence text from a preset music comment library, wherein the comment information comprises: obtaining comment information matched with the statement text based on focus information and intention information of each comment information in the preset music comment library; and

outputting the comment information as a response to the voice information;

playing music corresponding to the comment information;

before obtaining comment information matched with the sentence text from a preset music comment library, the method further comprises the following steps:

acquiring a plurality of pieces of comment information about music, which meet preset conditions, and constructing a preset music comment library according to the acquired plurality of pieces of comment information; and

identifying focus information and intention information of each comment information in a preset music comment library;

the focus information of the comment information refers to important information expressed by the comment information and is a part which is noticed by a viewer of the comment information when the viewer of the comment information sees the comment information by an initiator of the comment information; the intention information of the comment information refers to the operation or purpose which is expressed by the comment information and is desired to be realized by the initiator of the comment information;

the obtaining of comment information matched with the sentence text based on the focus information and the intention information of each comment information in the preset music comment library includes:

identifying focus information and intention information of the sentence text;

matching the focus information of the statement text with the focus information of each comment information in the preset music comment library, and screening out the comment information with the matched focus; and

matching the intention information of the statement text with the intention information of the focus-matched comment information, and screening out the focus-matched and intention-matched comment information.

2. The method of claim 1, wherein the obtaining of the pieces of comment information on music that satisfy the preset condition includes:

obtaining comment information corresponding to the personalized music of the user according to the historical music interaction behavior data of the user, wherein the personalized music of the user comprises at least one of the following: music collected by the user, music created by the user, music liked by the user, or music played by the user; and/or

Obtaining comment information corresponding to the current popularization music; and/or

And obtaining comment information of which the number of praise exceeds a first threshold value.

3. The method of claim 1, wherein:

the identifying of the focus information and the intention information of each comment information in the preset music comment library comprises the following steps: extracting labels for representing corresponding focus information from the each piece of comment information based on a label library, and extracting intention sentence patterns for representing corresponding intention information from the each piece of comment information based on an intention classification library;

the identifying the focus information and the intention information of the sentence text comprises: extracting tags for representing corresponding focus information from the sentence text based on the tag library, and extracting intention sentence patterns for representing corresponding intention information from the sentence text based on the intention classification library;

the matching of the focus information of the sentence text and the focus information of each comment information in the music comment library includes: matching the label of the statement text with the labels of the comment information, and determining the comment information as focus matching when the matching degree exceeds a second threshold value; and

the matching of the intention information of the sentence text and the intention information of the focus-matched comment information includes: and matching the intention sentence pattern of the sentence text with the intention sentence pattern of the comment information with the focus matching, and determining the comment information with the focus matching and the intention matching when the matching degree exceeds a third threshold value.

4. The method of claim 1, wherein the obtaining comment information matching the sentence text from a music comment library further comprises:

when a plurality of pieces of comment information which are matched in focus and matched in intention are screened out, acquiring the priority of music corresponding to each comment information; and

and ranking the comments based on the priority of the music, and selecting a piece of comment information based on a ranking result.

5. The method of claim 4, wherein the obtaining of the priority of the music corresponding to each comment information includes:

determining a comprehensive score of music corresponding to each piece of comment information according to the historical music interactive behavior data of the user, wherein the historical music interactive behavior data of the user comprises at least one of the following: the behavior data of the user collecting music, the behavior data of the user likes music, the behavior data of the user playing music, the behavior data of the user commenting music, the behavior data of the user sharing music or the behavior data of the user creating music.

6. A voice interaction device, comprising:

the receiving module is used for receiving voice information input by a user and converting the voice information into a sentence text;

the matching module is used for acquiring comment information matched with the sentence text from a preset music comment library; and

an output module for outputting the comment information as a response to the voice information;

a playing module for playing music corresponding to the comment information after the comment information is output as a response to the voice information by the output module;

the first preprocessing module is used for acquiring a plurality of pieces of comment information about music meeting preset conditions before the matching module acquires the comment information matched with the statement text from a preset music comment library, and the preset music comment library is constructed by the acquired plurality of pieces of comment information; and

the second preprocessing module is used for identifying the focus information and the intention information of each comment information in the preset music comment library;

the matching module is used for acquiring comment information matched with the statement text from a preset music comment library based on the focus information and the intention information of each comment information in the preset music comment library;

wherein the matching module comprises:

the recognition submodule is used for recognizing the focus information and the intention information of the sentence text;

the first matching submodule is used for matching the focus information of the statement text with the focus information of each comment information in the preset music comment library and screening out the comment information with the focus matched; and

and the second matching submodule is used for matching the intention information of the statement text with the intention information of the focus-matched comment information and screening out the focus-matched and intention-matched comment information.

7. The apparatus of claim 6, wherein:

the first preprocessing module is configured to acquire comment information corresponding to personalized music of the user according to the historical music interaction behavior data of the user, where the personalized music of the user includes at least one of the following: music collected by the user, music created by the user, music liked by the user, or music played by the user; and/or

8. The apparatus of claim 6, wherein:

the second preprocessing module is used for extracting labels for representing corresponding focus information from the comment information based on the label library and extracting intention sentences for representing corresponding intention information from the comment information based on the intention classification library;

the identification submodule is used for extracting labels for representing corresponding focus information from the statement text based on the label library and extracting intention sentence patterns for representing corresponding intention information from the statement text based on the intention classification library;

the first matching sub-module is used for matching the tags of the sentence text with the tags of the comment information, and determining the comment information as focus matching when the matching degree exceeds a second threshold value; and

and the second matching submodule is used for matching the intention sentence pattern of the sentence text with the intention sentence pattern of the comment information with the focus matching, and determining the comment information with the focus matching and the intention matching when the matching degree exceeds a third threshold value.

9. The apparatus of claim 6, wherein the matching module further comprises:

the obtaining submodule is used for obtaining the priority of the category to which the music corresponding to each piece of comment information belongs when a plurality of pieces of comment information which are matched in focus and matched in intention are screened out; and

and the sequencing submodule is used for sequencing the comments based on the priority of the music and selecting a piece of comment information based on a sequencing result.

10. The apparatus of claim 9, wherein:

the acquisition submodule is used for determining the comprehensive scores of the music corresponding to the comment information according to the historical music interaction behavior data of the user,

wherein the historical music interaction behavior data of the user comprises at least one of: the behavior data of the user collecting music, the behavior data of the user likes music, the behavior data of the user playing music, the behavior data of the user commenting music, the behavior data of the user sharing music or the behavior data of the user creating music.

11. A medium storing computer executable instructions, which when executed by a processor, are operable to implement:

the voice interaction method of any one of claims 1 to 5.

12. A computing device, comprising: a memory, a processor, and executable instructions stored on the memory and executable on the processor, the processor when executing the instructions implementing:

the voice interaction method of any one of claims 1 to 5.