WO2022126734A1 - 语音交互处理方法、装置、电子设备及存储介质 - Google Patents

语音交互处理方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022126734A1
WO2022126734A1 PCT/CN2020/140213 CN2020140213W WO2022126734A1 WO 2022126734 A1 WO2022126734 A1 WO 2022126734A1 CN 2020140213 W CN2020140213 W CN 2020140213W WO 2022126734 A1 WO2022126734 A1 WO 2022126734A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
reply
command
evaluation
duration
Prior art date
Application number
PCT/CN2020/140213
Other languages
English (en)
French (fr)
Inventor
樊思远
Original Assignee
美的集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 美的集团股份有限公司 filed Critical 美的集团股份有限公司
Publication of WO2022126734A1 publication Critical patent/WO2022126734A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/635Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/63Querying
    • G06F16/638Presentation of query results
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of intelligent processing, and in particular, to a voice interaction processing method, apparatus, electronic device and storage medium.
  • Voice User Interface refers to the transmission of information between humans and devices through natural speech.
  • many home appliances represented by smart speakers are equipped with voice interaction modules.
  • the voice interaction module can recognize the user's command voice and respond to the user's command voice in the form of voice, providing users with a more anthropomorphic man-machine. interactive mode.
  • the embodiments of the present application provide a voice interaction processing method, device, electronic device and storage medium, which are used to solve the problem that the reply voice in the automatic voice interaction process cannot match user requirements.
  • an embodiment of the present application provides a voice interaction processing method, including:
  • the reply voice is the voice of the command voice sent by the user
  • the command voice is the voice of the command
  • a dialogue strategy corresponding to the command voice is determined.
  • an embodiment of the present application provides a voice interaction processing method, including:
  • the reply voice is the voice in response to the command voice sent by the user
  • the command voice is the voice of the command
  • a dialogue strategy corresponding to the command voice is determined.
  • determine the dialogue strategy corresponding to the command voice specifically including:
  • the frequency of the reply voice in subsequent responses to the command voice is adjusted.
  • the reply voice is the reply voice determined by querying the dialogue database based on the command voice sent by the user;
  • the evaluation voice determine the dialogue strategy corresponding to the command voice, which specifically includes:
  • the evaluation voice query the evaluation database, determine the feedback information contained in the evaluation voice, and determine the dialogue strategy corresponding to the command voice according to the feedback information;
  • the evaluation database and the dialogue database are set independently, the evaluation database is set on the smart device side, and the content of the evaluation database is less than that of the dialogue database.
  • determine the dialogue strategy corresponding to the command voice specifically including:
  • the playback duration and/or redundancy of the reply voice in response to the instruction voice is reduced.
  • reducing the playback duration and/or redundancy of the reply voice in response to the command voice specifically includes:
  • adjusting the playback duration of the reply voice corresponding to the command voice according to the first duration specifically including one or more of the following:
  • the playback duration of the reply voices corresponding to all or part of the command voices in the same command voice group is controlled to be less than or equal to the first duration.
  • determine the dialogue strategy corresponding to the command voice specifically including:
  • reducing the frequency of use of the reply voice as a response to the command voice or replacing a new reply voice as a response to the command voice specifically includes:
  • reducing the use frequency of the reply voice refers to selecting the reply voice from the reply voice library corresponding to the command voice when responding to the command voice in a subsequent time period The probability of being a response decreases;
  • reducing the playback length and/or redundancy greater than or equal to the reply voice usage frequency of the reply voice refers to When responding to the command voice subsequently, the probability of selecting a reply voice whose playback length and/or redundancy is greater than or equal to the reply voice from the reply voice library corresponding to the command voice is reduced as a response;
  • determine the dialogue strategy corresponding to the command voice specifically including:
  • the evaluation voice contains keywords with positive colors and the keywords are related to maintaining or increasing the playback duration, then maintaining or increasing the playback duration and/or redundancy of the reply voice in response to the command voice.
  • maintaining or improving the playback duration and/or redundancy of the reply voice in response to the command voice specifically includes any one or more of the following:
  • the redundancy of the reply voice refers to the ratio of the voice content necessary for the non-reply command voice in the reply voice to the total voice content of the reply voice;
  • a reply voice whose difference in playback duration and/or redundancy from the reply voice is within a preset range is selected for playback.
  • determine the dialogue strategy corresponding to the command voice specifically including:
  • the evaluation voice contains keywords with positive colors and the keywords are related to maintaining or increasing the frequency of use, then maintaining or increasing the frequency of use of the reply voice as a response to the command voice.
  • maintaining or increasing the frequency of use of the reply voice as a response to the command voice specifically includes one or more of the following:
  • increasing the frequency of use of the reply voice refers to an increase in the probability of selecting the reply voice from the reply voice library as a response when responding to the command voice in a subsequent time period;
  • Increasing the playback length and/or redundancy greater than or equal to the reply voice usage frequency of the reply voice means that in the subsequent response
  • the probability of selecting a reply voice whose playback length and/or redundancy is greater than or equal to the reply voice from the reply voice library corresponding to the command voice increases as a response increases.
  • the evaluation voice contains keywords with negative colors, specifically including one or more of the following contents:
  • the evaluation voice carries first information, and the first information refers to information that matches the comment information in the first database; wherein, the first database stores negative comment information;
  • the evaluation voice carries second information, and the second information refers to information having an opposite meaning to the information contained in the reply voice;
  • the loudness corresponding to the evaluation speech is greater than or equal to the first loudness.
  • the evaluation voice contains keywords with positive colors, specifically including one or more of the following contents:
  • the evaluation voice carries third information, and the third information refers to information that matches the comment information in the second database; wherein, the second database stores positive comment information;
  • the evaluation voice carries fourth information, and the fourth information refers to information having the same or similar meaning as the information contained in the reply voice;
  • the voice interaction processing method also includes:
  • a dialogue strategy corresponding to the command voice is determined according to the evaluation voice.
  • the method before determining the dialogue strategy corresponding to the command voice according to the evaluation voice, the method also includes:
  • Determining whether the evaluation voice is a valid evaluation voice specifically includes:
  • the evaluation speech determines whether the evaluation speech does not contain a wake-up word, and/or determine whether the duration of the evaluation speech is less than the first duration, and/or whether the loudness difference between the evaluation speech and the command speech or the reply speech is not is greater than the first difference, and if so, the evaluation voice is determined to be a valid evaluation voice.
  • determine the dialogue strategy corresponding to the command voice specifically including:
  • the length of the command voice is determined, and the playback duration of the reply voice is adjusted according to the length of the command voice, or the redundancy of the reply voice is adjusted according to the length of the command voice.
  • adjusting the playback duration of the reply voice according to the length of the command voice including:
  • part of the content is intercepted in the unplayed part of the reply voice to continue playing, so that the adjusted total playback duration of the reply voice matches the length of the command voice;
  • the playback speed of the unplayed part of the reply voice is increased according to the length of the command voice, so that the adjusted total playing time of the reply voice matches the length of the command voice.
  • adjusting the redundancy of the reply voice according to the length of the command voice including:
  • the redundancy corresponding to the redundancy of the reply speech is determined according to the length range interval corresponding to the length of the command speech.
  • determine the dialogue strategy corresponding to the command voice specifically including:
  • the playback duration and/or redundancy of the reply voice are adjusted, including in the following manner. any of:
  • the time window coincides with at least a part of the playback process of the reply voice, and at least a part of the evaluation voice falls within an interval of the time window that coincides with the playback process of the reply voice.
  • an embodiment of the present application further provides a voice interaction processing device, including:
  • a receiving module for receiving the user's evaluation voice for the reply voice in the playback process of the reply voice; the reply voice is the voice in response to the command voice sent by the user; the command voice is the voice of the command;
  • a processing module configured to determine a dialogue strategy corresponding to the command voice according to the evaluation voice.
  • an embodiment of the present application further provides a voice interaction processing device, including:
  • a receiving module for receiving the evaluation voice of the user for the reply voice in the time window after playing;
  • the reply voice is the voice in response to the command voice sent by the user;
  • the command voice is the voice of the command;
  • a processing module configured to determine a dialogue strategy corresponding to the command voice according to the evaluation voice.
  • an embodiment of the present application provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the first aspect when the processor executes the program or the steps of the voice interaction processing method described in the second aspect.
  • an embodiment of the present application provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements the voice interaction processing according to the first aspect or the second aspect steps of the method.
  • the voice interaction processing method, device, electronic device and storage medium provided by the present application according to the evaluation voice received during the playback process or within the time window after the playback of the reply voice in response to the command voice, adjust the The dialogue strategy corresponding to the command voice, so that the dialogue strategy corresponding to the command voice better matches the user's needs, so that a better voice interaction service experience can be provided for the user.
  • FIG. 1 is a flowchart of a voice interaction processing method provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a voice interaction process provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of an implementation process interaction of a voice interaction processing method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a module implementation corresponding to a voice interaction processing method provided by an embodiment of the present application
  • FIG. 5 is a schematic diagram of a voice interaction process with evaluation voice provided by an embodiment of the present application.
  • Fig. 6 is another implementation process interaction schematic diagram of the voice interaction processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of another module implementation corresponding to the voice interaction processing method provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a voice interaction processing apparatus provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a smart device provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a server provided by an embodiment of the present application.
  • the voice interaction module can recognize the user's command voice and respond to the user's command voice in the form of voice, providing users with a more anthropomorphic human voice. computer interaction.
  • the speech design of an excellent voice interaction system must take into account the balance between rationality and sensibility, which not only provides useful help to customers, but also has a certain interest. Therefore, in order to reduce the so-called "machine feeling" when constructing a phrase for the "Skill" of the voice interaction device, designers often provide a variety of reply expressions for the same instruction when expressing similar meanings. However, it is not All users are satisfied with the discourse strategy set by the designer.
  • the present application provides a voice interaction processing method, device, electronic device and storage medium, which can provide users with targeted reply voices according to user requirements (or information or signals presented by users).
  • the voice interaction processing method, apparatus, electronic device, and storage medium provided by the present application will be described in detail below through specific embodiments.
  • the term "and/or” in this embodiment of the present application describes the association relationship between associated objects, indicating that there may be three kinds of relationships, for example, A and/or B, which may indicate that A exists alone, and A and B exist simultaneously. B, there are three cases of B alone.
  • the character “/” generally indicates that the associated objects are an "or” relationship.
  • the term “plurality” in the embodiments of the present application refers to two or more than two, and other quantifiers are similar.
  • FIG. 1 shows a flowchart of a voice interaction processing method provided by an embodiment of the present application.
  • the voice interaction processing method provided by an embodiment of the present application includes:
  • Step 101 Receive the user's evaluation voice for the reply voice during the playback of the reply voice or within the time window after the playback ends; the reply voice is the voice in response to the command voice issued by the user; the command voice is the issuing command voice;
  • Step 102 Determine a dialogue strategy corresponding to the command voice according to the evaluation voice.
  • the user needs to perform intelligent voice interaction in some scenarios when using a smart device, such as a smart speaker.
  • a smart device such as a smart speaker.
  • the smart speaker will respond to the command voice and reply, assuming that the reply voice is "it's 5 o'clock in the afternoon, it's sunset time, and the sunset today is beautiful”.
  • the command voice is the voice that instructs the smart device to perform the task
  • the reply voice is the voice that responds to the command voice.
  • smart devices may refer to smart home appliances, such as smart speakers, smart TVs, smart humidifiers, smart refrigerators, etc., or smart wearable devices, such as smart watches, smart headphones, etc., or other A smart device, which is not limited in this embodiment.
  • the user first issues a command voice, and the command voice is used to instruct the smart device to perform the corresponding task, and the task content is determined according to the command voice content.
  • the command voice is When "what time is it", it means that the instruction voice is used to instruct the smart device to perform the query task of what time it is.
  • a complete voice interaction process mainly goes through the following process Automatic Speech Recognition (ASR) ⁇ Natural Language Processing (NLP) ⁇ Dialog Management (Dialog Management, DM) ) ⁇ Speech Synthesis (Text-To-Speech, TTS), as shown in Figure 2, the smart device will perform a series of processing after receiving the command voice, including converting the command voice into command text through automatic speech recognition (ASR). , and then perform natural language processing (NLP) on the instruction text to analyze the user's intent, then determine the final reply text through dialogue management (DM), and finally perform speech synthesis (TTS) on the reply text to obtain the reply voice.
  • ASR Automatic Speech Recognition
  • NLP Natural Language Processing
  • DM dialogue management
  • TTS speech synthesis
  • the conversion of command speech into command text by automatic speech recognition refers to the process of converting speech information into text information by using automatic speech recognition technology.
  • performing natural language processing (NLP) on the instruction text and analyzing the user's intent refers to obtaining the user's intent by performing natural language processing (NLP) analysis on the instruction text, which specifically includes performing natural language-based processing on the instruction text.
  • NLP natural language processing
  • extract text features such as TF-IDF text feature extraction, word2vec-based feature extraction model for feature extraction, etc.
  • intent recognition is to classify sentences or queries into corresponding intent categories.
  • intent recognition For example, for the voice interaction module on a smart device, there are only 50 interactive skills. The device sends out a command voice, and the smart device needs to assign the user's query to one or several interactive skills through intent recognition, and then perform subsequent processing.
  • intent recognition a rule matching method based on a domain dictionary can be used, or a user's intent can be discriminated based on an intent classification model.
  • this embodiment does not introduce too much, and for details, please refer to the existing or relatively advanced intention recognition algorithm in the industry.
  • Dialogue Management actually controls the process of human-machine dialogue.
  • Task-driven dialogue management is actually a decision-making process.
  • the next action to be taken is determined according to the current state (such as providing results, asking specific constraints, clarifying or confirming requirements, etc.), so as to most effectively assist users in completing the task of obtaining information or services.
  • the final reply text is determined through dialogue management (DM), and finally the reply text is subjected to speech synthesis (TTS) as the reply voice.
  • TTS speech synthesis
  • the reply voice can be obtained by performing speech synthesis on the above reply text.
  • the command voice when the command voice is "what time is it", it can directly reply "it is 3:00 in the morning".
  • the reply voice will be interspersed with chatty words or interesting words or knowledge words, etc.
  • the command voice when the command voice is "What time is it” You can reply "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you, continue to work hard tomorrow!.
  • this embodiment provides a voice interaction processing method, in which the user can send the evaluation voice during the playback of the reply voice or within a time window after the reply voice is played, and then make the smart device (or is a terminal device or a server), and determines a dialogue strategy corresponding to the command voice according to the evaluation voice. For example, the frequency of use of the reply voice or the reply voice related to the reply voice may be adjusted according to the evaluation voice.
  • the playback length or redundancy of the reply voice or the reply voice related to the reply voice may be adjusted according to the evaluation voice.
  • the playback of the reply voice may also be interrupted according to the evaluation voice.
  • the reply voice may be played repeatedly according to the evaluation voice.
  • it may also be to replace a new reply voice according to the evaluation voice, or the like.
  • the evaluation voice refers to the voice in which the user makes an evaluation on the reply voice during the playback of the reply voice or within a time window (eg, 10-60s) after the play ends.
  • Scheme 1 Receive the user's evaluation voice for the reply voice during the playback of the reply voice; the reply voice is the voice in response to the command voice issued by the user; the command voice is the voice of the issued command; according to the evaluation voice , determine the dialogue strategy corresponding to the command voice
  • Scheme 2 Receive the user's evaluation voice for the reply voice within the time window after the reply voice playback ends; the reply voice is the voice in response to the command voice issued by the user; the command voice is the voice of the issued command; The evaluation voice is used to determine the dialogue strategy corresponding to the command voice.
  • the evaluation voice can be sent out for the reply voice during the playback of the reply voice, or can be sent out for the reply voice within a time window after the playback ends.
  • the time window refers to a period of time after the playback of the reply voice ends.
  • the time window starts from the moment when the reply voice ends to be played, and lasts for a preset time period such as a time period that ends in 5s.
  • the function of the time window is to monitor and receive the evaluation voice sent by the user within this time window. After this time window, the evaluation voice sent by the user will no longer be monitored and received, thus effectively improving the pertinence of the evaluation voice reception. , to avoid confusing the evaluation voice with the next new command voice.
  • the time window starts from the moment when the reply voice finishes playing, but as a special example, it may also be: the time window coincides with at least a part of the playback process of the reply voice, so At least a part of the evaluation speech falls into the interval in the time window that coincides with the playback process of the reply speech.
  • the time window can be 14:02:40-14:02:60. It can be seen that the time window and the reply voice There is a partial overlap in the playback process, and the overlap interval is (14:02:40-14:02:55), and then at least a part of the evaluation voice falls into the overlap interval.
  • the advantage of this processing is that it can ensure that the voice issued by the user is accurate Instead of issuing a new command voice, the evaluation voice of the reply voice is used, thereby improving the recognition rate of the evaluation voice.
  • the evaluation database and the dialogue database are set independently, and the content of the evaluation database is less than that of the dialogue database, through such a design, when the time window and the playback process of the reply voice are different At least one part overlaps, so that when at least a part of the evaluation voice falls into the overlapping interval, it can be accurately recognized that the voice issued by the user is the evaluation voice accurately aimed at the reply voice, rather than issuing a new command voice, so that the evaluation can be used in a targeted manner.
  • the database can effectively improve the recognition rate and recognition efficiency.
  • the evaluation voice may be a positive evaluation voice or a negative evaluation voice.
  • the user when the user is satisfied or recognized with the current reply voice or has further exploration interest, he will give a biased positive evaluation voice.
  • a negative evaluation voice When the user is not satisfied with the current reply voice or has a clear opinion, a negative evaluation voice will be given.
  • the evaluation voice is generally short and small, for example, the negative evaluation voice may include: bad, dislike, too long, too complicated, disturbed, No, Bad, Stop, etc.
  • the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you, keep going tomorrow!, if the user doesn't like the voice,
  • the corresponding evaluation speech may be "not good” or “dislike” or "too long” or “disturbed” or "No” or "Bad” or “Stop”.
  • the positive evaluation voice can generally include: really good, good, very good, like, Yes, Good, Like, etc.
  • the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you all the time, keep going tomorrow!, if the user likes the voice, then The corresponding evaluation speech may be "Like” or "Good” or "Yes".
  • the evaluation speech can also be a longer sentence, which can provide richer feedback information.
  • a commentary speech could be: "I don't like such a complicated answer, please tell me what time it is”.
  • the evaluation voice may also be "please do not bring any redundant information”.
  • the evaluation voice can also be "I don't like sports-themed news, please give some hot news about movies" and so on.
  • the smart device when the user makes an evaluation on the reply voice in response to the command voice, and then sends out the evaluation voice, the smart device (it may also be a terminal device or a server) will determine the corresponding voice according to the evaluation voice.
  • the dialogue strategy of the command voice refers to the strategy of responding or responding to the command voice, for example, including: responding to the command voice in a short content way, or responding to the command voice in a rich content way, or responding to the command voice in a different thematic way (such as Respond to the command voice with lively music, or, respond to the command voice with a story, or, respond to the command voice in a news feed, etc.).
  • the user can give feedback by evaluating the voice, so that the smart device (or the terminal device) , may also be a server), adjust the playback duration/redundancy of the reply voice itself or adjust the frequency of use of the reply voice, etc. according to the evaluation voice.
  • the currently playing reply voice and/or the next (or subsequent) reply voice may be adjusted according to the evaluation voice.
  • the evaluation voice is published after the playback of the reply voice, the next (or subsequent) reply voice can be adjusted according to the evaluation voice.
  • the adjustment of the next (or subsequent) reply voice here may include adjusting the next (or subsequent) reply voice for the same command voice, or it may include adjusting the next (or subsequent) reply voice sent by the same user or a different user. Adjusting the reply voice similar to the command voice may also include adjusting the reply voice of part or all of the command voice issued by the same user next (or subsequent), and may also include adjusting the next (or subsequent) reply voice issued by the same user in the same time period. The user or different users issue the same or different command voices to adjust the reply voice, which is not limited in this embodiment.
  • adjusting the reply voice according to the evaluation voice may refer to adjusting the playback duration of the reply voice, or adjusting the redundancy of the reply voice, and may also refer to both. , in addition, it can also refer to replacing a new reply voice, in addition, it can also refer to increasing or reducing the frequency of reply voice use, and it can also refer to stopping playing the reply voice, etc., which is not limited in this embodiment.
  • the adjustment of the playback duration or redundancy of the reply voice may be performed in real time each time, or may be stored and used directly after a certain adjustment.
  • the content of the reply voice can be shortened, the playback speed of the reply voice can be accelerated, or the content of the reply voice can be shortened and the playback of the reply voice can be accelerated. speed.
  • the user's request for the length of the reply voice can also be determined according to the length of the reply voice that has been played when the evaluation voice occurs, so that subsequent replies to all or part of the user's command voice will follow the length requirements that match the user's. Select the appropriate reply voice to reply.
  • the voice interaction method provided in this embodiment makes it possible to adjust the reply voice by sending the evaluation voice during the playback of the reply voice or after the playback, such as adjusting the reply voice (this time or next time).
  • the reply time or replacement of the reply voice, etc. so that the reply voice time or content more matches the user's needs, so as to provide users with a better voice interaction service experience.
  • the evaluation voice may be a positive evaluation voice or a negative evaluation voice.
  • the current reply voice may be maintained or optimized in the same or similar direction according to the duration, redundancy, or category of the extended topic. For example, assuming that the current reply voice belongs to the reply voice with rich content and more extended information (that is, the redundancy is relatively high), when the evaluation voice for the reply voice is a positive evaluation voice, the current reply voice can be maintained. redundancy or optimize towards higher redundancy.
  • the current reply voice belongs to a reply voice with a relatively long playback duration
  • the evaluation voice for the reply voice is a positive evaluation voice
  • the current playback duration can be maintained or optimized to a higher playback duration.
  • the subject of the extended information in the current reply voice is a running subject
  • the evaluation voice for the reply voice is a positive evaluation voice
  • the current running subject can be maintained or the subject of yoga (and yoga) can be added.
  • the extension theme for the running class is similar).
  • the evaluation voice when the evaluation voice is a negative evaluation voice, optimization can be performed in opposite or different directions according to the duration, redundancy, or category of the extended topic of the current reply voice. For example, assuming that the current reply voice is a reply voice with rich content and more extended information (that is, with high redundancy), when the evaluation voice for the reply voice is a negative evaluation voice, the reply can be reduced. Redundancy of speech. For another example, assuming that the current reply voice belongs to a reply voice with a relatively long playback time, when the evaluation voice for the reply voice is a negative evaluation voice, the playback time of the reply voice can be reduced. For another example, assuming that the subject of the extended information in the current reply voice is a sports subject, when the evaluation voice for the reply voice is a negative evaluation voice, the subject of the extended information in the reply voice can be adjusted to a life subject. Wait.
  • the positive evaluation speech may be speech including positive evaluation words, for example, the positive evaluation words may include: really good, good, very good, like, Yes, Good, Like, and so on.
  • the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been curse you all the time, keep going tomorrow!, if the user likes the voice, then The corresponding evaluation speech may be "Like” or "Good” or "Yes".
  • the negative evaluation speech may be speech including negative evaluation words, for example, the negative evaluation words may include: bad, dislike, too long, too complicated, disturbed, No, Bad, Stop, etc.
  • the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you, keep going tomorrow!, if the user doesn't like the voice,
  • the corresponding evaluation speech may be "not good” or “dislike” or “too long” or “disturbed” or "No” or "Bad” or “Stop”.
  • the positive evaluation voice can also be a voice that retells the reply voice (or part of the reply voice), that is, when the user agrees with or likes the reply voice, he will repeat the reply voice (or part of the reply voice) to express his liking Feelings.
  • the positive evaluation voice may also contain the same or similar or similar meaning to the words in the reply voice, that is, when the user agrees with or likes the reply voice, he or she will express the feelings of liking through words with the same meaning.
  • the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you all the time, keep going tomorrow!, if the user likes the voice, then The corresponding evaluation speech may be "Well, let's work together! or "Work hard together” or "Strive together”.
  • the negative evaluation voice may also contain voices with opposite meanings to the words in the reply voice, that is, when the user does not like the reply voice, he will express his dislike feelings by expressing words with opposite meanings. For example, when the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you, keep going tomorrow!, if the user doesn't like the voice, The corresponding evaluation voice may be "don't work hard! or "don't want to work hard” or “don't want to struggle” and so on.
  • the length of subsequent reply voices to the same command voice may be shortened, or the length of subsequent reply voices to all or part of the command voices issued by the user may be shortened.
  • the evaluation voice carries the duration condition information such as "I hope the length of the reply voice is controlled within 5s”
  • the duration condition information can be extracted, and the subsequent reply voices for the same command voice can be processed according to the duration condition information.
  • the length of the subsequent reply speech for all or part of the command speech issued by the user can be shortened and adjusted.
  • an evaluation voice such as "I don't like this topic”
  • adjustment can be made according to the evaluation voice.
  • a new reply voice can be replaced.
  • the reply voice is "It's already 3:00 in the morning, it's getting late, go to bed early, I know you're working hard, I've been blessing you all the time, continue to work hard tomorrow!
  • the evaluation voice is "I don't like this theme”
  • you can change to a new reply voice for example, to a new reply voice: "It's 3:00 in the morning, let me tell you a bedtime story”.
  • the evaluation voice can also carry prompt information (for example, if you like the theme of football), when replacing a new reply voice, you can select a reply voice that matches the football theme according to the prompt information carried in the evaluation voice, for example, replace it with a new one Reply to the voice: "It's 3:00 in the morning, and there is a final between Barcelona and Real Madrid at 7:00 in the morning, please remember to pay attention!.
  • prompt information for example, if you like the theme of football
  • the voice interaction processing method provided by the present application adjusts the dialogue strategy of the corresponding command voice according to the evaluation voice received during the playback process or within the time window after the playback of the reply voice in response to the command voice, thereby
  • the dialogue strategy corresponding to the command voice is made to better match the needs of the user, so that a better voice interaction service experience can be provided for the user.
  • a dialogue strategy corresponding to the command voice is determined, which specifically includes:
  • a dialogue strategy corresponding to the command voice is determined.
  • the feedback information carried in the evaluation voice can be determined first, and then the corresponding dialogue strategy can be determined according to the feedback information. For example, when it is determined that the feedback information carried in the evaluation voice is "too high redundancy", it can be determined that the dialogue strategy corresponding to the command voice is: responding to the command voice in a short and effective manner. For another example, when it is determined that the feedback information carried in the evaluation voice is "I want to add some chatting content", it can be determined that the dialogue strategy corresponding to the command voice is: responding to the command voice in a content-rich manner.
  • this embodiment determines the dialogue strategy corresponding to the command voice according to the feedback information carried by the evaluation voice, so that the adjusted dialogue strategy can better match the user's needs, thereby improving the user's experience of using the smart device.
  • a dialogue strategy corresponding to the command voice is determined, which specifically includes:
  • the frequency of the reply voice in subsequent responses to the command voice is adjusted.
  • the frequency of the reply voice appears to increase or decrease in the subsequent response to the command voice.
  • the probability (that is, the frequency of) that the reply speech can be used as a response to the command speech is increased subsequently.
  • the probability that the reply voice can be used as the response of the command voice is reduced (that is, the frequency is reduced or the use is abandoned).
  • increasing the frequency of use of the reply voice refers to the probability of selecting the reply voice as a response from a reply voice library corresponding to the command voice when responding to the command voice in a subsequent time period improve.
  • reducing the frequency of use of the reply voice refers to the probability of selecting the reply voice as a response from the reply voice library corresponding to the command voice when responding to the command voice in a subsequent time period reduce.
  • the frequency of occurrence of the reply voice in the subsequent response to the command voice can be adjusted directly according to the evaluation voice, that is, if the user likes it, it can appear more frequently, and if the user does not like it, the frequency of occurrences can be reduced or No longer appears, so as to better match user needs and meet user needs, thereby improving user experience.
  • the reply voice is the reply voice determined by querying the dialogue database based on the command voice issued by the user;
  • the evaluation voice determine the dialogue strategy corresponding to the command voice, which specifically includes:
  • the evaluation voice query the evaluation database, determine the feedback information contained in the evaluation voice, and determine the dialogue strategy corresponding to the command voice according to the feedback information;
  • the evaluation database and the dialogue database are set independently, the evaluation database is set on the smart device side, and the content of the evaluation database is less than that of the dialogue database.
  • the evaluation database and the dialogue database are set independently, so that the dialogue database for analyzing the command voice and the evaluation database for analyzing the evaluation voice do not interfere with each other, so that the The content setting of each database is made more targeted, so that the respective analysis efficiency and analysis accuracy can be effectively improved.
  • the smart device (such as a smart speaker) is preset to perform the reception of the evaluation voice and the analysis of the evaluation voice during the playback of the reply voice or within a time window after the playback ends, Therefore, the energy consumption of the smart device can be effectively reduced, and at the same time, since the smart device uses a special database for analyzing the evaluation voice for analysis, the processing efficiency can be effectively improved, and more accurate analysis results can be obtained.
  • the database for analyzing the evaluation voice is located on the side of the smart device, and the smart device, during the playback of the reply voice or within the time window after the playback ends, analyzes the received voice based on the data used for the evaluation voice.
  • the database for analysis is analyzed to determine that the feedback information carried by the evaluation voice is negative feedback information or positive feedback information, so that the analysis can be completed locally on the smart device (the interaction process with the server or the terminal is omitted), thereby reducing the cost.
  • the delay makes it possible to obtain the analysis results quickly and then use the analysis results to adjust the smart device.
  • the current reply voice can be interrupted in time or the redundancy or playback duration of the current reply voice can be adjusted in time (for the specific adjustment method, please refer to the introduction of the foregoing embodiment), Thereby improving the user experience.
  • a dialogue strategy corresponding to the command voice is determined, which specifically includes:
  • the feedback information carried by the evaluation voice is negative feedback information
  • the first dialogue strategy adjustment direction corresponding to the command voice is determined, and according to the first dialogue
  • the strategy adjusts the direction, and adjusts the dialogue strategy corresponding to the command voice.
  • the first dialogue strategy adjustment direction refers to the direction of adjusting the reply voice in response to the command voice according to the negative feedback information carried by the evaluation voice to improve user experience. For example, if it is determined that the evaluation voice contains keywords with negative colors and the keywords are related to reducing the playback duration, it is determined that the adjustment direction of the first dialogue strategy is to reduce the playback duration and the playback duration of the reply voice in response to the command voice. / or redundancy. If it is determined that the evaluation voice contains keywords with negative colors and the keywords are related to user preferences, reduce the frequency of using the reply voice as a response to the command voice or replace a new reply voice as the Command voice response. Negative colors here refer to negative information or meanings such as dissatisfaction, dislike, and opinions.
  • the adjustment direction of the first dialogue strategy may also be determined according to the first keyword carried in the feedback information. For example, according to the first keyword carried in the feedback information, it can be determined whether the adjustment direction of the first dialogue strategy is the adjustment direction of shortening the playback duration (reducing redundancy), the adjustment direction of reducing the frequency of use of the relevant reply voice, or other adjustment directions And so on, so that it can more accurately match the needs of users.
  • the first dialogue strategy adjustment direction corresponding to the command voice is determined, which specifically includes:
  • the first keyword is a keyword related to reducing the playback duration, and then it is determined that the adjustment direction of the first dialogue strategy corresponding to the command voice is the direction of shortening the playback duration or reducing the redundancy.
  • the first keyword carried in the negative feedback information is a keyword related to reducing the playback duration
  • the adjustment direction of the first dialogue strategy corresponding to the command voice is the direction of shortening the playback duration or reducing the redundancy, so as to match user needs.
  • keywords related to reducing playback duration may include keywords related to reducing redundancy.
  • keywords related to reducing the playing time may include: “playing time is too long”, “reply content is too long”, “reply content is redundant”, “too long”, “redundant” and so on.
  • Dialogue strategies can include the following processing methods:
  • a processing method can be based on the evaluation
  • the voice ends the reply voice, that is, the reply voice that has not been played when the evaluation voice is received will not continue to be played, and the reply voice will be ended, so that the user is no longer troubled by the long or disliked reply voice, so that the user can The effect of stopping the playback of the reply voice is realized when the evaluation voice is sent out.
  • ending the reply voice here may refer to completely ending the playback of the reply voice, or it may refer to temporarily suspending the playback of the reply voice, and then restarting the playback after receiving the restart playback instruction, which is not limited in this embodiment. .
  • the redundancy of the reply speech refers to the ratio of the speech content necessary for the non-replying command speech in the reply speech to the total speech content of the reply speech.
  • a negative evaluation voice sent by a user when a negative evaluation voice sent by a user is received, it means that the user does not like the reply voice or thinks the length of the reply voice is too long. /or redundancy, for example, the playback duration of the reply voice can be shortened, the redundancy of the reply voice can also be reduced, and the playback duration of the reply voice can be shortened and the redundancy of the reply voice can be reduced at the same time Spend.
  • the playback duration of the reply voice can be adjusted, for example, the playback duration can be adjusted from 15s to 5s. It is understandable that there are various ways to adjust the playback duration, for example, by increasing the playback speed, by removing part of the reply voice, or by both.
  • the playback speed of the remaining unplayed part can be accelerated, or part of the content can be intercepted in the unplayed part to continue playing.
  • the playback speed of the entire reply voice can be accelerated, or part of the content of the entire reply voice can be intercepted for continuous playback.
  • the reply voice "It's 11 am, you are tired from work, remember to add more water, eat more fruit, stretch and stretch, it is good for your health", its total playing time It is 15s, assuming that the evaluation voice is received when it is played for 3s (assuming it is played until: it is 11:00 a.m., you are tired from work), at this time, you can adjust the playback time to 8s or 6s (or other time), you can also intercept part of the content in the unplayed part "Remember to add more water, eat more fruit” and play it. It is understandable that the intercepted part of the content can be random or according to time. Sequentially intercepted.
  • the previous paragraph and the following paragraph can be randomly intercepted, such as "eat more fruit, doing stretching exercises is good for health", or it can be intercepted in chronological order "remember to add more water and eat more fruit”.
  • the length of the specific interception can be adjusted according to the needs.
  • the redundancy of the reply voice refers to the ratio of the voice content necessary for the non-replying command voice in the reply voice to the total voice content of the reply voice; here, the voice content necessary for the reply command voice It can be understood as the content directly related to the command voice, and the voice content that is not necessary to reply to the command voice can be understood as the content not directly related to the command voice, but the content that is actively recommended, such as warm reminders, music sharing, and one-liners. , advertising, etc.
  • the content of the reply voice may vary in length and redundancy, and some only contain the content directly related to the command voice, and some further contain the content actively recommended by the designer, Such as friendly reminders, one-liners and even advertisements.
  • some user groups pursue humanization and hope that the entire voice interaction will be more natural, vivid and varied; while some user groups pursue simplicity and clarity, and do not want to receive redundancy that has nothing to do with command voices. Therefore, after receiving the evaluation voice sent by the user, the redundancy of the reply voice can be reduced to match the needs of the user.
  • the redundancy of the reply voice refers to the ratio of the voice content necessary for the non-reply command voice in the reply voice to the total voice content of the reply voice, the redundancy of the reply voice is reduced. In fact, it is the speech content necessary to reduce the non-reply command speech in the reply speech.
  • this processing method is similar to the above-mentioned processing method of "adjusting the playback duration and/or redundancy of the reply voice". The main difference is that this processing method is to adjust the word count and/or redundancy of the reply text corresponding to the reply voice.
  • the playback duration and/or redundancy of the reply voice is adjusted by adjusting the number of words and/or the redundancy of the reply text corresponding to the reply voice, because they are substantially similar. , so no further examples are given here, and for specific examples, please refer to the description of the above embodiments.
  • the first user is the user who issued the command voice.
  • the first user when a negative voice evaluation sent by the first user is received during the playback of the reply voice, it means that the first user may think that the length of the reply voice is too long, that is to say, it can be obtained that the first user does not want to receive it. Redundant information irrelevant to the command voice, that is, it can be obtained that the first user is a user who likes a short and effective reply voice. Therefore, in this case, in order to better suit the user's needs, the first user can be The reply voices of all or part of the command voices are adjusted to a lower playback duration and/or redundancy, so as to meet the user's interaction requirements.
  • reducing the playback duration and/or redundancy of the reply voice corresponding to all or part of the command voice issued by the first user may include any one or more of the following:
  • a reply voice whose playback duration is less than the preset duration threshold and/or whose redundancy is less than the preset redundancy threshold is selected from the reply voice library corresponding to the command voice.
  • the playback duration of the reply voice can be controlled to be less than or stop playing when it equals a predetermined threshold.
  • the playback speed of the reply voice can also be controlled, so that the playback time of the reply voice is shortened.
  • part of the content of the reply voice can also be intercepted and played, so that the playing time of the reply voice is shortened.
  • a reply voice is selected from the reply voice library corresponding to the command voice, and the redundancy of the reply voice is adjusted, for example, some or all of the reply voices inconsistent with the command voice are removed. There is content that is directly related, reducing redundancy.
  • the corresponding reply voice will be adjusted for all or part of the command voice issued by the first user, so that the playback time of the reply voice is less than the preset time length threshold and/or the redundancy is less than the preset redundant voice. Redundancy threshold, so that the voice interaction process is more in line with the user's demand for the length and/or redundancy of the reply voice.
  • the corresponding reply voice makes the playback time of the reply voice less than the preset time length threshold and/or the redundancy is less than the preset redundancy threshold, so that the voice interaction process is more in line with the user's requirements for the reply voice time length and/or redundancy.
  • the above processing method describes the adjustment method of the reply voice for the same command voice, for example, for the command voice of "what time is it", determine the reply when the command voice of "what time is it" appears again in the future.
  • the voice adjustment method, and this processing method is aimed at the first user, that is, all or part of the command voice issued by the first user will adjust the corresponding reply voice, so that the voice interaction process is more in line with the user's response to the voice duration. and/redundancy requirements.
  • the reply voices corresponding to some command voices do not need to be adjusted to meet the first user's requirements for voice duration and/or redundancy, no adjustment is required.
  • This processing method is similar to the above processing method, the main difference is that this processing method emphasizes the word count and/or redundancy of the reply text, that is, this processing method adjusts the word count and/or redundancy of the reply text way to adjust the length and/or redundancy of the reply speech.
  • the word count condition and/or redundancy condition here can be set as required. For example, part of the text content can be selected from the reply text according to the word count condition, and the selection method can be sequential or random. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, detailed description is omitted here.
  • the emphasis is on adjusting the playback duration and/or redundancy of the reply voices corresponding to all or part of the command voices in the same command voice group.
  • the command voice group can be divided in various ways, for example, it can be divided according to the subject of the command, it can also be divided according to the length and/or complexity of the command voice, and it can also be divided according to the similarity, etc. etc., there is no limitation on the specific division method.
  • the instruction voice group may be divided into instruction topics, for example, may be divided according to one or more of life instructions, work instructions, and study instructions. Accordingly, a life instruction voice group, a work instruction voice group, and a study instruction voice group are obtained. For example, “what time is it”, “today's weather”, “tomorrow's weather”, “traffic conditions”, “restricted number”, “supermarket discount” and other command voices belong to the command voices in the life command voice group, while “carving a boat and seeking a sword”
  • the meaning of "5G mobile phone”, “the origin of the log function” and other command voices belong to the command voices in the learning command voice group, such as “how to arrange time reasonably”, “precautions for business trips”, “how to improve work Instruction voices such as "efficiency” and “what are the artificial intelligence algorithms” belong to the instruction voices in the work instruction group.
  • the reply voice of the command voice sends a negative voice evaluation multiple times, and this processing method adjusts the playback duration and/or redundancy of the reply voice corresponding to all or part of the command voice in the same command voice group, so that the user When sending out other command voices in the same command voice group, reply voices with lower playback duration and/or redundancy can also be obtained, so that the user can avoid sending negative feedbacks for the reply voices of different command voices in the same command voice group multiple times.
  • Voice evaluation which can improve the user experience.
  • the voice interaction processing method needs to distinguish different users.
  • different users can be distinguished by means of timbre recognition, and then the corresponding reply voice can be determined or adjusted according to the command voice of the corresponding user and the voice interaction processing mode corresponding to the user.
  • This processing method is similar to the above processing method, the main difference is that this processing method emphasizes the word count and/or redundancy of the reply text, that is, this processing method adjusts the word count and/or redundancy of the reply text way to adjust the length and/or redundancy of the reply speech.
  • the word count condition and/or redundancy condition here can be set as required. For example, part of the text content can be selected from the reply text according to the word count condition, and the selection method can be sequential or random. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, detailed description is omitted here.
  • the emphasis is on adjusting the playback duration and/or redundancy of some or all of the reply voices in the reply voice database corresponding to the command voice.
  • one or more reply voices stored in the reply voice library corresponding to the command voice are all reply voices corresponding to the command voice, and when the user makes a negative voice evaluation on one of the reply voices, it may indicate that the user thinks the The playback time of the reply voice is too long and/or the redundancy is too high.
  • it can also reflect that the user hopes that the playback time of other reply voices corresponding to the command voice should not be too long and/or redundant. Don't be too high.
  • the playback duration and/or the playback time of some or all of the reply voices in the reply voice library corresponding to the command voice are adjusted. or redundancy, so as to meet the user's requirements for the playback duration and/or redundancy of the reply voice of the command voice. For example, when the command voice issued by the user is "What's the weather like today", suppose that the reply voice "It's sunny today, the temperature is 16-21°C, the breeze is gentle, suitable for suburban activities, you can consider going out for an outing" will receive a negative response. Evaluating the voice means that the user only cares about the reply content directly related to the command voice, and does not want to be disturbed by the long voice.
  • 1 can be shortened to "It's sunny today, the temperature is 16-21°C, and it is brez, so it is suitable to wear autumn clothes and coats"; 2 can be shortened to “It's sunny today, the temperature is 16-21°C, and it is brez, outdoor running is recommended”; 3 is shortened to “It's sunny today, the temperature is 16-21°C”, 4 is shortened to “It's sunny today, the temperature is 16-21°C, good morning” and so on.
  • reducing the playback duration and/or redundancy of the reply voice corresponding to the command voice that is the same as the command voice in the subsequent time period may include two situations:
  • the playback speed can be accelerated when the reply voice corresponding to the command voice that is the same as the command voice is subsequently played, thereby shortening the playback time.
  • part of the voice content can be selected from the reply voice to play, thereby shortening the playing time.
  • the first paragraph and the last paragraph can be randomly intercepted, such as "It's 11:00 a.m., doing stretching exercises is good for health", or it can be intercepted in chronological order "It's 11:00 a.m., I'm tired from work. Bar”.
  • the length of the specific interception can be adjusted according to the needs.
  • the subsequent reply voice corresponding to the command voice that is the same as the command voice can be controlled to stop playing when the playback duration is less than or equal to the first duration;
  • the subsequent reply voice corresponding to the command voice that is the same as the command voice is controlled to stop playing when the playing duration is less than or equal to the predetermined threshold.
  • a random threshold value within a specified interval can also be used to control the subsequent reply voice corresponding to the command voice that is the same as the command voice to stop playing when the playback duration is less than or equal to the random threshold value.
  • the random threshold may be within a specified interval of 3-6s, for example, it may stop when the random playback reaches 3s, or stop when the random playback reaches 5s, or stop when the random playback reaches 6s, and so on.
  • a voice whose playback duration and/or redundancy is lower than the current reply voice can be selected from the reply command library as the reply voice. Mark the playback duration and redundancy, so that according to the playback duration and redundancy of each response voice in the response command library, a voice whose playback duration and/or redundancy is lower than the current response voice can be selected as the response voice.
  • this processing method is similar to the above processing method, the difference is that this processing method emphasizes the word count and/or redundancy of the reply text, that is, this processing method adjusts the word count and/or redundancy of the reply text. to adjust the length and/or redundancy of the reply speech.
  • the word count condition and/or redundancy condition here can be set as required. For example, partial text content can be selected from the original reply text based on word count criteria.
  • the selection method can be sequential or random. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, detailed description is omitted here.
  • the redundancy of the reply text refers to the text content (number of words) necessary for the non-reply command voice in the reply text and the reply command voice.
  • the ratio of all text content (number of words); here, the text content necessary for replying to the command voice can be understood as the content directly related to the command voice, and the text content not necessary for replying to the command voice can be understood as not directly related to the command voice.
  • adjusting the direction according to the first dialogue strategy, and adjusting the dialogue strategy corresponding to the command voice specifically includes:
  • this embodiment effectively utilizes the information of "the first duration of the reply voice that has been played when the evaluation voice is received", so that when the reply voice of the response command voice is played,
  • the playback duration of the reply voice can be effectively adjusted according to the first duration or the redundancy of the reply voice can be adjusted according to the first ratio of the first duration to the total duration of the reply voice.
  • the playback duration of the subsequent reply voice corresponding to the same command voice as the command voice can be controlled to be less than or equal to the first duration, so as to meet the user's requirement for the playback duration of the reply voice.
  • the complete playback time of a reply voice is 15s
  • 6s can be used as a threshold to control the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice to be less than or equal to 6s.
  • the ratio of the played first duration of the reply voice to the total duration of the reply voice when the evaluation voice is received it is also possible to determine the ratio of the played first duration of the reply voice to the total duration of the reply voice when the evaluation voice is received, and control the response time of the reply voice corresponding to all or part of the command voice issued by the first user.
  • the redundancy is less than or equal to the ratio; or,
  • the reply voice can be adjusted more accurately according to the evaluation voice, so that the reply voice in the human-computer interaction process can meet the user's requirements for human-computer interaction, thereby improving the user's performance. experience.
  • adjusting the playback duration of the reply voice corresponding to the command voice according to the first duration specifically includes:
  • control the playback duration of the reply voice corresponding to all or part of the command voice issued by the first user to be less than or equal to the first duration; wherein, the first user is the user who issued the command voice;
  • controlling the playback duration of the reply voices corresponding to all or part of the command voices in the same command voice group to be less than or equal to the first duration.
  • three control scenarios are considered, respectively: 1 the adjustment of the subsequent reply voice corresponding to the command voice that is the same as the command voice; 2 the reply voice corresponding to all or part of the command voice issued by the first user 3.
  • control of the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice is less than or equal to the first duration, including:
  • the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice is controlled to be less than or equal to the first duration
  • there are multiple implementations such as: A. Control the subsequent and The response voice corresponding to the command voice that is the same as the command voice stops playing when the playback duration is less than or equal to the first duration; or, B. Control the subsequent response voice corresponding to the command voice that is the same as the command voice to be intercepted during playback Part of the content is played; or, C, from the reply voice library corresponding to the command voice, select the reply voice whose playback duration is less than or equal to the first time length as the follow-up reply voice corresponding to the command voice that is the same as the command voice ; or, D. Increase the playback speed of the subsequent reply voice corresponding to the command voice that is the same as the command voice.
  • the advantage of the above method A is that it is simple and convenient to control, and only needs to stop playing when the playback duration of the reply voice is less than or equal to the first duration.
  • the advantage of the above method B is that it is more flexible, for example, relatively important information in the reply voice can be intercepted and played as needed.
  • the advantage of the above method C is that there is no need to adjust the reply voice in the reply voice library, which is simple and convenient to implement, and the reply voice whose playback duration meets the requirements can be directly selected as the response.
  • the advantage of the above method D is that the information content of the reply voice is not lost, and at the same time, the effect of shortening the playing time can be satisfied.
  • the response voice of the user interface makes the playback duration of the response voice less than or equal to the first duration, so that the voice interaction process is more in line with the user's requirements for the duration and/or redundancy of the response voice.
  • the previous processing method describes the adjustment method of the reply voice for the same command voice, and this processing method is aimed at the first user, that is, all or part of the command voice issued by the first user will be adjusted.
  • Corresponding reply voice so that the voice interaction process is more in line with the user's requirements for the reply voice duration and/or redundancy, and also avoids the trouble of the first user issuing evaluation voices for the reply voices with different command voices.
  • adjusting the redundancy of the reply speech corresponding to all or part of the command speech issued by the first user according to the evaluation speech including:
  • the main difference is that this embodiment emphasizes that The redundancy of the reply voice.
  • the threshold for the redundancy is the ratio of the first duration of the reply voice that has been played when the evaluation voice occurs to the total duration of the reply voice. The specific principle of the degree correlation has been described in detail in other embodiments, so it is not repeated here.
  • adjusting the playback duration of the reply voices corresponding to all or part of the command voices in the same command voice group according to the evaluation voice including:
  • the work instruction group may be divided according to the subject of the instruction, for example, may be divided according to one or more of life instructions, work instructions, and study instructions. Accordingly, a life instruction voice group, a work instruction voice group, and a study instruction voice group are obtained.
  • command voices such as “Today's limited number”, “weather forecast”, and “seven-step hand washing method” belong to the command voices in the life command voice group.
  • command voices such as "the origin of the English word pop” and “the story of the zodiac” belong to the command voices in the learning command voice group.
  • instruction voices such as "how to become a reliable workplace person” and “how to make a good work plan” belong to the instruction voices in the work instruction group.
  • the smart device can use reply voices with similar playback duration and/or redundancy to reply to the command voices belonging to the same command voice group, thereby eliminating the need for users to respond to the same command voice.
  • Some or all of the reply voices of the voice commands of the command voice group give out the trouble of adjusting the evaluation voice.
  • the reply voice of the command voice sends the evaluation voice multiple times, and this processing method makes the playback time of the reply voice corresponding to all or part of the command voice in the same command voice group less than or equal to the first time length, so that the When the user sends out other command voices in the same command voice group, he can also get reply voices with lower playback duration and/or redundancy, so as to avoid the user's multiple sending of reply voices for different command voices in the same command voice group Evaluate voice, so as to improve user experience.
  • adjusting the redundancy of the reply voices corresponding to all or part of the command voices in the same command voice group with the command voices according to the evaluation voice including:
  • the main difference is that In this embodiment, the emphasis is on the redundancy of the reply voice.
  • the threshold used when controlling the redundancy is that the first duration of the reply voice that has been played when the evaluation voice occurs takes up the proportion of the reply voice. The ratio of the total duration, in addition, since the specific principle of the redundancy adjustment of the reply speech has been introduced in detail in other embodiments, it will not be repeated here.
  • adjusting the redundancy of the reply speech corresponding to the command speech according to the first ratio specifically includes:
  • control the redundancy of the reply voices corresponding to all or part of the command voices in the same command voice group to be less than or equal to the ratio.
  • the first dialogue strategy adjustment direction corresponding to the command voice specifically including:
  • the first keyword is a keyword related to the preference, and then it is determined that the adjustment direction of the first dialogue strategy corresponding to the command voice is the direction of reducing the frequency of use of the reply voice or replacing a new reply voice.
  • the adjustment of the first dialogue strategy corresponding to the command voice can be determined.
  • the direction is the direction of reducing the usage frequency of the reply voice or replacing with a new reply voice.
  • the keywords related to preference include: dislike, Don't like, change, don't appear in the future, change another one, and so on.
  • the first keyword when it is determined according to the negative feedback information that the first keyword is a keyword related to preference, it means that the user does not like the current reply voice.
  • the dialogue strategy there are various ways to adjust the dialogue strategy:
  • the voice color of the voice is different from the voice color corresponding to the current reply voice as a standard to select other reply voices (for example, a male voice is changed to a female voice, or a female voice is changed to a male voice, or an adult is changed to a child, or a child is changed to an adult, etc.).
  • 5 Select the reply voice based on the prompt information carried in the evaluation voice (for example, if the prompt information is a football theme, then when replacing a new reply voice, the reply that matches the football theme can be selected according to the prompt information carried in the evaluation voice. voice).
  • a smart device with voice interaction function it generally has a preset number of interactive skills.
  • the smart device will classify the user's command voice into a certain voice through intent recognition. or a few interaction skills before proceeding with subsequent processing.
  • each interactive skill corresponds to at least one reply voice library. After the intent of the command voice is identified by means of intent recognition, the command voice can be divided into one or several interactions. In terms of skills, since each interaction skill corresponds to at least one reply voice bank, one or more reply voice banks corresponding to the command voice can be determined.
  • one or more reply voices are stored in one or more reply voice libraries corresponding to the command voice, and these reply voices can be reply voices with different voice lengths, or reply voices with different extension themes, or It may be reply voices with different voices and colors, which is not limited in this embodiment.
  • one or more reply voices stored in the one or more reply voice libraries corresponding to the command voice belong to the reply voices that can be used as command voices, but only in terms of the form or content such as the length of time, the extended theme, and the tone of voice. Just showing a difference.
  • the reply voice database corresponding to the command voice stores reply voices of different durations, which are respectively 1s, 3s, 5s, 10s, 15s, 20s, 25s, 30s, and 50s of reply voices.
  • the reply voice database corresponding to the command voice stores reply voices of different extension themes.
  • the extension themes include, but are not limited to, informational (only convey information, for example, it is 3:00 pm), interesting (now is 3:00 p.m.) At 3 o'clock in the afternoon, do you want to listen to a joke to ease the mood?
  • the content of the joke is: 7), knowledge-based (it is 3 o'clock in the afternoon, the weather is fine, 3 o'clock in the afternoon is a period when the brain neurons are more active, you can choose some memory types work to deal with, etc.), stories (it is 3 o'clock in the afternoon, what happened in history at 3 o'clock in the afternoon, etc.), music (it is 3 o'clock in the afternoon, welcome to listen to a song by singer A Old songs), sports (it's 3:00 pm, the CBA Beijing VS Guangzhou team nominees start at 3:50, please don't miss it), dialogue (it's 3:00 pm, do you want to do a word-guessing game? Wait).
  • the reply voice database corresponding to the command voice stores reply voices of different tones. For example, for the same reply voice, boys, girls, adults and children can be used to record respectively to obtain reply voices of different tones.
  • the modified reply voice after selecting and playing a reply voice different from the reply voice from the reply voice library corresponding to the command voice according to the evaluation voice, it is also possible to further determine the modified reply voice. Whether there is a negative evaluation voice in the reply voice, if not, the modified reply voice can be selected as the reply voice in the subsequent response to the command voice, if the modified reply voice has no negative evaluation voice, you can continue to replace it with a new reply voice Play until the user's negative evaluation voice is no longer received.
  • the current time period can also be recorded, and when it is determined that there is no negative evaluation voice in the modified reply voice, the updated reply voice is selected as the response of the command voice to improve user satisfaction.
  • reducing the frequency of use of the reply voice refers to selecting the reply voice from the reply voice library corresponding to the command voice when responding to the command voice in a subsequent time period
  • the probability of replying to the voice as a response is reduced, and the details are as follows:
  • the key point is that when a certain reply voice receives a negative evaluation voice, the use frequency of the reply voice will be reduced later, that is, because the reply voice is not used as the reply voice of the command voice.
  • the probability of selecting the reply voice will be reduced, that is, the probability of selecting the reply voice as a response from the reply voice library corresponding to the command voice will be reduced.
  • reply commands of different lengths in the reply voice library corresponding to a certain command voice.
  • a reply command with a short length and low redundancy it can be determined which one or several reply voices will be subsequently selected for the user as the reply of the response command voice according to the feedback information of the different reply voices from the user. voice.
  • This processing method is similar to the above processing method, the main difference is that this processing method is used to reduce the frequency of use of reply voices whose playback length and/or redundancy is greater than or equal to the reply voice in the reply voice library corresponding to the command voice.
  • this processing method is used to reduce the frequency of use of reply voices whose playback length and/or redundancy is greater than or equal to the reply voice in the reply voice library corresponding to the command voice.
  • the reply voice receives a negative evaluation voice, it indicates that the user does not like the reply voice whose playback duration and/or redundancy is greater than or equal to the reply voice. Therefore, the playback length and/or redundancy can be reduced in the future.
  • the redundancy is greater than or equal to the probability that the reply voice of the reply voice acts as a response, so that it can be more suitable for user needs. Since the processing method of this embodiment is similar to that of the above-mentioned embodiment, it is not repeated here.
  • the user when adjusting the dialogue strategy, according to the subject the user wishes to replace, the user can select the corresponding voice from the reply voice library corresponding to the command voice.
  • the reply voice that matches the subject is played, so that the user's needs can be accurately matched.
  • adjusting the direction according to the first dialogue strategy, and adjusting the dialogue strategy corresponding to the command voice specifically includes:
  • reducing the use frequency of the reply voice refers to selecting the reply voice from the reply voice library corresponding to the command voice when responding to the command voice in a subsequent time period The probability of being a response decreases;
  • reducing the playback length and/or redundancy greater than or equal to the reply voice usage frequency of the reply voice refers to When responding to the command voice subsequently, the probability of selecting a reply voice whose playback length and/or redundancy is greater than or equal to the reply voice from the reply voice library corresponding to the command voice is reduced as a response;
  • a reply voice matching the theme is selected from the reply voice library corresponding to the command voice and played.
  • a dialogue strategy corresponding to the command voice is determined, which specifically includes:
  • the feedback information carried by the evaluation voice is positive feedback information
  • the second keyword carried in the positive feedback information the second dialogue strategy adjustment direction corresponding to the command voice is determined, and according to the second dialogue The strategy adjusts the direction, and adjusts the dialogue strategy corresponding to the command voice.
  • the second dialogue strategy adjustment direction refers to the direction of adjusting the reply voice in response to the command voice according to the positive feedback information carried by the evaluation voice to further maintain or enhance the user experience. For example, if it is determined that the evaluation voice contains keywords with positive colors and the keywords are related to maintaining or increasing the playback duration, maintaining or increasing the playback duration and/or redundancy of the reply voice in response to the instruction voice Spend. If it is determined that the evaluation voice contains a keyword with a positive color and the keyword is related to maintaining or increasing the frequency of use, maintaining or increasing the frequency of use of the reply voice as a response to the command voice.
  • Positive colors here refer to messages or meanings with positive feedback such as likes, approvals, and support.
  • the adjustment direction of the second dialogue strategy can also be determined according to the second keyword carried in the feedback information. For example, according to the second keyword carried in the feedback information, it can be determined whether the adjustment direction of the second dialogue strategy is the adjustment direction of maintaining or increasing the playback duration (maintaining or increasing the redundancy), or the adjustment direction of maintaining or increasing the frequency of using the relevant reply voice. , or other adjustment directions, etc., so that it can more accurately match user needs.
  • the second dialogue strategy adjustment direction corresponding to the command voice is determined, which specifically includes:
  • the second keyword is a keyword related to maintaining or increasing the playback duration
  • determine that the adjustment direction of the second dialogue strategy corresponding to the command voice is the direction of maintaining or increasing the playback duration, or maintaining or increasing the redundancy direction.
  • the keywords related to maintaining or increasing the playback duration may be: the duration is just right, the duration can be appropriately increased next time, the reply voice of this length is liked very much, the content is rich and the time is right, etc.
  • the second keyword is a keyword related to maintaining or increasing the playback duration
  • the adjustment direction of the second dialogue strategy corresponding to the command voice is the direction of maintaining or increasing the playback duration, or, maintaining or increasing the playback duration. direction to increase redundancy.
  • the adjustment direction of the second dialogue strategy corresponding to the command voice is the direction of maintaining or increasing the playback duration, or the direction of maintaining or increasing the redundancy.
  • the redundancy of the reply speech refers to the ratio of the speech content necessary for the non-replying command speech in the reply speech to the total speech content of the reply speech.
  • the reply voice when the reply voice receives a positive evaluation voice, it indicates that the user may relatively approve or accept the playback time and/or redundancy of the reply voice. Therefore, in an implementation manner, the above can be maintained. Playback duration and/or redundancy of the reply voice.
  • a reply voice whose playback duration is longer than a preset threshold is positively evaluated, it indicates that the user may approve or wish to receive a reply voice with a longer playback duration or higher redundancy. Therefore, in one implementation, It is also possible to improve the playback duration and/or redundancy of the reply voice. It can be seen that, in this embodiment, the reply voice can be adjusted according to the user's evaluation voice, so that the reply voice is more suitable for the user's habits or needs.
  • the reply voice when it receives a positive evaluation voice, it indicates that the user may relatively approve or accept the playback time and/or redundancy of the reply voice.
  • the command voice selects a reply voice whose difference in playback duration and/or redundancy from the reply voice is within a preset range from the reply voice library corresponding to the command voice, and play it, that is, from the reply voice library.
  • a reply voice that is close to the playback time and/or redundancy of the reply voice is selected to be played, so as to satisfy the user's requirement for the playback time and/or redundancy of the reply voice. Require.
  • This processing method is similar to the above processing method, the main difference is that this processing method emphasizes the word count and/or redundancy of the reply text, that is, this processing method adjusts the word count and/or redundancy of the reply text way to adjust the length and/or redundancy of the reply speech.
  • the word count condition and/or redundancy condition here can be set as required. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, a specific description is omitted here.
  • the emphasis is on adjusting the playback duration and/or redundancy of some or all of the reply voices in the reply voice database corresponding to the command voice.
  • this processing method has an opposite relationship with the reduction of the playback duration and/or redundancy of some or all of the reply voices in the reply voice library corresponding to the command voice introduced in the foregoing embodiment. Therefore, for specific principles, reference can be made to the introduction of the foregoing embodiments according to the opposite logic, and details are not repeated here.
  • the reply voice when the reply voice receives a positive evaluation voice, it indicates that the user may relatively approve or accept the playback time and/or redundancy of the reply voice, and therefore, in an implementation manner, the follow-up time can be maintained.
  • the playback duration and/or redundancy of the reply voice corresponding to the command voice that is the same as the command voice in the segment can meet the user's requirements for the playback duration and/or redundancy of the reply voice.
  • a reply voice whose playback duration is longer than a preset threshold is positively evaluated, it indicates that the user may approve or wish to receive a reply voice with a longer playback duration or higher redundancy.
  • the reply voice can be adjusted according to the user's evaluation voice, so that the reply voice is more suitable for the user's habits or needs.
  • This processing method is similar to the above processing method, the main difference is that this processing method emphasizes the word count and/or redundancy of the reply text, that is, this processing method adjusts the word count and/or redundancy of the reply text way to adjust the length and/or redundancy of the reply speech.
  • the word count condition and/or redundancy condition here can be set as required. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, detailed description is omitted here.
  • the first user is the user who issued the command voice.
  • the reply voice is a reply voice whose playback time is longer than the preset threshold
  • the reply voice receives a positive voice evaluation sent by the first user, it indicates that the first user may approve or accept the longer playback time. and/or higher redundant reply voice, therefore, in an implementation manner, when responding to all or part of the command voice issued by the first user subsequently, the playback duration and/or redundancy of the reply voice can be maintained or increased degree, so as to meet the user's requirements for the playback duration and/or redundancy of the reply voice.
  • This processing method is similar to the above processing method, the main difference is that this processing method emphasizes the word count and/or redundancy of the reply text, that is, this processing method adjusts the word count and/or redundancy of the reply text way to adjust the length and/or redundancy of the reply speech.
  • the word count condition and/or redundancy condition here can be set as required.
  • the emphasis is on adjusting the playback duration and/or redundancy of the reply voices corresponding to all or part of the command voices in the same command voice group.
  • the command voice group can be divided in various ways, for example, it can be divided according to the subject of the command, it can also be divided according to the length and/or complexity of the command voice, and it can also be divided according to the similarity, etc. etc., there is no limitation on the specific division method.
  • the instruction voice group may be divided into instruction topics, for example, may be divided according to one or more of life instructions, work instructions, and study instructions. Accordingly, a life instruction voice group, a work instruction voice group, and a study instruction voice group are obtained. For example, “what time is it”, “today's weather”, “tomorrow's weather”, “traffic conditions”, “restricted number”, “supermarket discount” and other command voices belong to the command voices in the life command voice group, while “carving a boat and seeking a sword”
  • the meaning of "5G mobile phone”, “the origin of the log function” and other command voices belong to the command voices in the learning command voice group, such as “how to arrange time reasonably”, “precautions for business trips”, “how to improve work Instruction voices such as "efficiency” and “what are the artificial intelligence algorithms” belong to the instruction voices in the work instruction group.
  • the playback duration and/or The redundancy can be accepted or liked by the user, which also means that the user wants the reply voice to have a longer playback duration and/or higher redundancy.
  • the user should also hope that the command voice
  • the reply voices corresponding to the command voice group in which they belong have long playback duration and/or high redundancy. Therefore, in order to improve the user experience, the user can avoid multiple reply voices for different command voices in the same command voice group.
  • this processing method adjusts the playback duration and/or redundancy of the reply voice corresponding to all or part of the command voice in the same command voice group, so that the user sends out the same command voice group.
  • reply voices with higher playback duration and/or redundancy can also be obtained, so that users can avoid sending positive evaluation voices multiple times for the reply voices of different command voices in the same command voice group, which can improve the User experience.
  • This processing method is similar to the above processing method, the main difference is that this processing method emphasizes the word count and/or redundancy of the reply text, that is, this processing method adjusts the word count and/or redundancy of the reply text way to adjust the length and/or redundancy of the reply speech.
  • the word count condition and/or redundancy condition here can be set as required.
  • the playback duration and/or redundancy of the reply voice when it is improved, it can be done by querying the extended information stored in the database. For example, when a certain reply voice is "it is 3:00 pm", if you want to improve the playback time and/or redundancy of the reply voice, you can complete it by querying various extended information stored in the database, such as , after expanding by querying the database, the reply voices obtained are 1 "It's 3:00 pm, please get up and have a cup of coffee"; 2 "It's 3:00 pm, let's play a soothing song for you"; 3 "It's 3 o'clock in the afternoon, the interesting thing that happened at 3 o'clock in the history is --; 4 "It's 3 o'clock in the afternoon, please find a quiet place, close your eyes, and do meditation with me” and so on.
  • adjusting the direction according to the second dialogue strategy, and adjusting the dialogue strategy corresponding to the command voice specifically includes:
  • the redundancy of the reply voice refers to the ratio of the voice content necessary for the non-reply command voice in the reply voice to the total voice content of the reply voice;
  • the first user is the user who issued the command voice
  • a reply voice whose difference in playback duration and/or redundancy from the reply voice is within a preset range is selected for playback.
  • the second dialogue strategy adjustment direction corresponding to the command voice specifically including:
  • the second keyword is a keyword related to maintaining or increasing the frequency of use, and then it is determined that the adjustment direction of the second dialogue strategy corresponding to the command voice is related to maintaining or increasing the frequency of use of the reply voice.
  • the keywords related to maintaining or increasing the frequency of use may be: appear more in the future, like it very much, be you in the future, use it a lot, and so on.
  • the key point is that when a certain reply voice receives a positive evaluation voice, the use frequency of the reply voice can be increased in the future, that is, because the reply voice is more frequently used as the reply voice of the command voice Welcome, therefore, when responding to the command voice in the future, the possibility of selecting the reply voice will increase, that is, the probability of selecting the reply voice as a response from the reply voice library corresponding to the command voice will increase.
  • the processing method of this embodiment there is an advantage that there is no need to adjust or change the reply command in the reply voice library, but a more suitable or matching reply voice is selected as the response of the command voice, which is relatively simple and convenient to implement. .
  • the possibility of selecting the reply voice as a response can be increased by increasing the score corresponding to the reply voice, or by special marking. sex.
  • This processing method is similar to the above-mentioned processing method, except that, in order to enrich the user experience, the frequency of use of the reply voice whose subject is close to the reply voice may be increased. For example, when the user prefers the reply voices on sports topics, they can try to increase the use of reply voices on relatively similar topics such as yoga or meditation.
  • This processing method is similar to the above processing method, the main difference is that this processing method is used to increase the frequency of use of reply voices whose playback length and/or redundancy is greater than or equal to the reply voice in the reply voice library corresponding to the command voice.
  • this processing method is used to increase the frequency of use of reply voices whose playback length and/or redundancy is greater than or equal to the reply voice in the reply voice library corresponding to the command voice.
  • the reply voice receives a positive evaluation voice, it indicates that the user may approve or hope to receive a reply voice with a longer playback duration or higher redundancy. Therefore, in an implementation manner, subsequent playback can be added.
  • the length and/or the redundancy is greater than or equal to the probability that the reply voice of the reply voice acts as a response, so that it can better meet the needs of the user. Since the processing method of this embodiment is similar to that of the above-mentioned embodiment, it is not repeated here.
  • adjusting the direction according to the second dialogue strategy, and adjusting the dialogue strategy corresponding to the command voice specifically includes:
  • increasing the frequency of use of the reply voice refers to an increase in the probability of selecting the reply voice from the reply voice library as a response when responding to the command voice in a subsequent time period;
  • increasing the playback length and/or redundancy is greater than or equal to the reply voice usage frequency of the reply voice; wherein, increasing the playback length and/or redundancy greater than or equal to the reply voice usage frequency of the reply voice refers to When responding to the command voice subsequently, the probability of selecting a reply voice whose playback length and/or redundancy is greater than or equal to the reply voice from the reply voice library corresponding to the command voice increases as a response.
  • the reply voice when the reply voice receives a positive evaluation voice, it indicates that the user prefers the reply voice. Therefore, in an implementation mode, the reply voice can be played repeatedly to satisfy the user's desire to listen to the reply voice. Respond to voice requests.
  • the reply voice may be played repeatedly this time, or it may be repeated the next time in response to the same command voice.
  • the reply voice can also be a combination of both.
  • a special processing method after receiving the negative evaluation voice corresponding to the repeated playback of the positive evaluation voice is to end the reply voice, and the specific introduction is as follows:
  • a processing method can be based on the evaluation
  • the voice ends the reply voice, that is, the reply voice that has not been played when the evaluation voice is received will not continue to be played, and the reply voice will be ended, so that the user is no longer troubled by the long or disliked reply voice, so that the user can The effect of stopping the playback of the reply voice is realized when the evaluation voice is sent out.
  • ending the reply voice here may refer to completely ending the playback of the reply voice, or it may refer to temporarily suspending the playback of the reply voice, and then resuming the playback after receiving the restart playback instruction, which is not limited in this embodiment. .
  • the feedback information carried by the evaluation voice is negative feedback information, which specifically includes:
  • the evaluation voice carries first information, and the first information refers to information that matches the comment information in the first database; wherein, the first database stores negative comment information;
  • the evaluation voice carries second information
  • the second information refers to information having an opposite meaning to the information contained in the reply voice
  • the intonation corresponding to the evaluation speech matches the intonation information in the first intonation database, where intonations with negative emotions are stored in the first intonation database;
  • the loudness corresponding to the evaluation speech is greater than or equal to the first loudness.
  • the evaluation voice carries first information, and the first information refers to information that matches the comment information in the first database; wherein, the first database stores negative comment information;
  • the negative comment information may include bad, dislike, too long, too complicated, disturbed, No, Bad, Stop, and the like.
  • the evaluation voice carries second information, and the second information refers to information having the opposite meaning to the information contained in the reply voice;
  • the negative evaluation voice may also contain information having an opposite meaning to the information contained in the reply voice, that is, when the user does not like the reply voice, he will express his dislike by expressing the opposite meaning.
  • the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you, keep going tomorrow!, if the user doesn't like the voice,
  • the corresponding evaluation voice may be "don't work hard! or “don't want to work hard” or “don't want to struggle” and so on.
  • the evaluation voice sent out will have a negative emotional tone, such as unhappy, such as sighing, such as resentment and so on. Therefore, by determining whether the intonation corresponding to the evaluation speech matches the intonation information in the first intonation database, it can be determined whether the evaluation speech carries negative feedback information for the reply speech.
  • the loudness of the evaluation voice sent out is generally relatively high, for example, it is annoying! do not like! Stop! Wait. Therefore, by determining whether the loudness corresponding to the evaluation speech is greater than or equal to the first loudness (the first loudness can be set as required), it can then be determined whether the evaluation speech carries negative feedback information for the reply speech.
  • this embodiment provides different processing methods for determining whether the evaluation voice carries negative feedback information for the reply voice. These processing methods can comprehensively and accurately determine the evaluation voice from different perspectives. Whether to carry negative feedback information for the reply voice.
  • the feedback information carried by the evaluation voice is positive feedback information, which specifically includes:
  • the evaluation voice carries the third information
  • the third information refers to the information that matches the comment information in the second database; wherein, positive comment information is stored in the second database;
  • the evaluation voice carries fourth information
  • the fourth information refers to information having the same or similar meaning as the information contained in the reply voice
  • the loudness corresponding to the evaluation speech is smaller than the first loudness.
  • any one or more of the following A, B, C, and D may be specifically implemented:
  • the evaluation voice carries third information, and the third information refers to the information that matches the comment information in the second database; wherein, the second database stores positive comment information;
  • the positive comment information may include “not good” or “dislike” or “too long” or “disturbed” or “No” or “Bad” or “Stop”, and so on.
  • the evaluation voice carries fourth information, and the fourth information refers to information having the same or similar meaning as the information contained in the reply voice;
  • the positive evaluation voice may also contain information having the same meaning as the information contained in the reply voice, that is, when the user prefers the reply voice, he or she will express the feeling of liking by expressing the same or similar meaning.
  • the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you all the time, keep going tomorrow!, if the user likes the voice, then The corresponding evaluation voice may be "Come on together! or "Strive hard” or "I also wish you well” and so on.
  • the evaluation voice sent out will have a positive emotional tone, such as happy, cheering, happy and so on. Therefore, by determining whether the intonation corresponding to the evaluation speech matches the intonation information in the second intonation database, it can be determined whether the evaluation speech carries positive feedback information for the reply speech.
  • the loudness of the evaluation voice sent out is generally relatively small, for example, the voice is good, like, good, etc. Therefore, by determining whether the loudness corresponding to the evaluation voice is smaller than the first loudness (the first loudness can be set as required), it can be determined whether the evaluation voice carries positive feedback information for the reply voice.
  • this embodiment provides different processing methods for determining whether the evaluation voice carries positive feedback information for the reply voice. These processing methods can comprehensively and accurately determine the evaluation voice from different perspectives. Whether to carry positive feedback information for the reply voice.
  • the database for analyzing the evaluation voice and the database for analyzing the command voice are independent of each other;
  • the received voice is analyzed based on the database used to analyze the evaluation voice, and it is determined that the feedback information carried by the evaluation voice is negative feedback information. or positive feedback.
  • a database for analyzing evaluation speech and a database for analyzing command speech can be set independently, so that the two databases do not interfere with each other, and each database can have more Therefore, it can effectively improve the pertinence of the analysis, thereby improving the analysis efficiency, and at the same time improving the analysis accuracy and analysis speed.
  • the smart device (such as a smart speaker) is preset to perform the reception of the evaluation voice and the analysis of the evaluation voice during the playback of the reply voice or within a time window after the playback ends, Therefore, the energy consumption of the smart device can be effectively reduced, and at the same time, since the smart device uses a special database for analyzing the evaluation voice for analysis, the processing efficiency can be effectively improved, and more accurate analysis results can be obtained.
  • the database used for analyzing the evaluation voice is located on the side of the smart device, and the smart device, during the playback of the reply voice or within the time window after the playback, The voice of the user is analyzed based on the database for analyzing the evaluation voice, and it is determined that the feedback information carried by the evaluation voice is negative feedback information or positive feedback information.
  • the database for analyzing the evaluation voice is located on the side of the smart device, and the smart device, during the playback of the reply voice or within the time window after the playback ends, analyzes the received voice based on the data used for the evaluation voice.
  • the database for analysis is analyzed, and it is determined that the feedback information carried by the evaluation voice is negative feedback information or positive feedback information, so that the analysis can be completed locally on the smart device (the interaction process with the server or the terminal is omitted), thereby reducing the cost.
  • the delay makes it possible to obtain the analysis results quickly and then use the analysis results to adjust the smart device.
  • the current reply voice can be interrupted in time or the redundancy or playback duration of the current reply voice can be adjusted in time (for the specific adjustment method, please refer to the introduction of the foregoing embodiment), Thereby improving the user experience.
  • the command voice group is divided by the way of command theme, and the command theme includes one or more of life command, work command, and study command.
  • the work instruction group may be divided according to the subject of the instruction, for example, may be divided according to one or more of life instructions, work instructions, and study instructions. Accordingly, a life instruction voice group, a work instruction voice group, and a study instruction voice group are obtained.
  • command voices such as “what time is it now”, “the weather today”, and “how to wash hands in seven steps” belong to the command voices in the life command voice group.
  • the command voices such as "the meaning of waiting for the rabbit", “the twenty-four solar terms", and "the origin of the ln function" belong to the command voices in the learning command voice group.
  • command voices such as "PPT preparation method” and "how to make a good work plan" belong to the command voices in the work command group.
  • the smart device can use similar playback duration and/or redundancy for multiple voices belonging to the same command voice group. Reply to the command voices belonging to the same command voice group, thus saving the user the trouble of sending evaluation voices to adjust the reply voices of some or all of the voice commands in the same command voice group.
  • the reply voice of the command voice sends a negative evaluation voice for many times, and this processing method adjusts the playback duration and/or redundancy of the reply voice corresponding to all or part of the command voice in the same command voice group, so that the user When sending out other command voices in the same command voice group, reply voices with lower playback duration and/or redundancy can also be obtained, so that the user can avoid sending negative feedbacks for the reply voices of different command voices in the same command voice group multiple times. Evaluate voice, so as to improve user experience.
  • adjusting the reply voice according to the evaluation voice includes:
  • the reply voice is adjusted according to the prompt information carried in the evaluation voice; wherein the prompt information is used to prompt an adjustment strategy for the reply voice.
  • the reply voice may be adjusted directly according to the prompt information carried in the evaluation voice.
  • the prompt information can be: play a reply voice related to the sports theme; it can also be: the playing time is controlled within 3-6s; it can also be: the playing time is shortened; it can also be: the playing time is more than 10s ; It can also be that the playback duration is longer; it can also be that the redundancy is controlled below 0.5; it can also be that the redundancy is above 0.5.
  • the length of subsequent reply voices to the same command voice may be shortened, or the length of subsequent reply voices to all or part of the command voices issued by the user may be shortened.
  • the evaluation voice carries the duration condition information such as "I hope the length of the reply voice is controlled within 5s”
  • the duration condition information can be extracted, and the subsequent reply voices for the same command voice can be processed according to the duration condition information.
  • the length of the subsequent reply speech for all or part of the command speech issued by the user can be shortened and adjusted.
  • a new reply voice can be replaced.
  • the reply voice is "It's already 3:00 a.m., it's getting late, go to bed early, I know you're working hard, I've been blessing you all the time, continue to cheer tomorrow!, let's say the evaluation voice is "I like football themes”.
  • you can change the new reply voice for example, replace it with a new reply voice: "It is 3:00 in the morning, and there is the final between Barcelona and Real Madrid at 5:00 in the morning, please remember to watch it in time!.
  • the reply voice is adjusted according to the prompt information carried in the evaluation voice, including:
  • prompt information is used to prompt to reduce or increase the playback duration and/or redundancy of the reply voice, reduce or increase the playback time and/or redundancy of the reply voice according to the prompt information;
  • the new reply voice is replaced according to the prompt information.
  • the prompt information is prompt information for reducing or increasing the playback duration and/or redundancy of the reply voice
  • the playback time and/or redundancy of the reply voice are reduced or increased according to the prompt information.
  • the prompt information is prompt information for prompting replacement of a new reply voice
  • the new reply voice is replaced according to the prompt information.
  • the prompt information includes target playback duration information and/or target redundancy information, and/or, the prompt information includes target extended theme information;
  • Reducing or improving the playback duration and/or redundancy of the reply voice according to the prompt information includes:
  • target playback duration information and/or target redundancy information carried in the prompt information reduce or improve the playback duration and/or redundancy of the reply voice
  • a new reply voice with the target extended topic information is replaced.
  • the playback duration and/or redundancy of the reply voice is reduced or increased.
  • the target playback duration information such as "I hope the length of the reply voice is controlled within 5s"
  • the target playback duration information can be extracted, and according to the target playback duration information, the length of the subsequent reply voice for the same command voice can be determined.
  • the length of the subsequent reply speech for all or part of the command speech issued by the user may be shortened and adjusted.
  • the new reply voice with the target extended topic information is replaced according to the target extended topic information carried in the prompt information.
  • the reply voice is "It's already 3:00 in the morning, it's getting late, go to bed early, I know you are working hard, I've been wishing you all the time, continue to cheer tomorrow!
  • the evaluation voice is "I like football themes”
  • you can prompt the target extended theme information (football) carried in the message to change to a new reply voice for example, replace it with a new reply voice: "It's 3:00 a.m., and there is a match between Barcelona and Real Madrid at 5:00 a.m., please Remember to tune in in time!.
  • receiving the evaluation voice for the reply voice includes:
  • the evaluation voice for the reply voice is received within the time window after the reply voice is played.
  • the evaluation voice for the reply voice may be received during the playback of the reply voice, or the evaluation voice for the reply voice may be received within a time window after the reply voice playback ends, It may also be both, which is not limited in this embodiment.
  • the time for the user to express the evaluation voice is not limited, and the user can freely and flexibly express the evaluation voice during the playback of the reply voice, or at the end of the reply voice playback as required.
  • the evaluation voice will be published in the following time window (such as within 5s and 10s after the end).
  • the time window may be set as required, which is not limited in this embodiment.
  • the voice or text analysis database corresponding to the command voice is the first voice or text database; the voice or text analysis database corresponding to the evaluation voice is the second voice or text A text database; the first voice or text database stores voice or text content related to instruction analysis; the second voice or text database stores voice or text content related to evaluation analysis.
  • the command voices are generally: “what time is it now”, “what's the weather like tomorrow”, “next week's limited number”, “why is the sky blue”, “how many frogs are there?” Legs” and other query-type instruction content
  • the evaluation voice is generally: “like”, “dislike”, “Yes”, “No”, “Want to switch to basketball theme” and other evaluation-type instruction content
  • the playback speed of the unplayed part of the reply voice when adjusting the playback duration of the reply voice according to the evaluation voice, the playback speed of the unplayed part of the reply voice may be increased according to the evaluation voice, or the playback speed of the unplayed part of the reply voice may be increased according to the evaluation voice, or Part of the content of the unplayed part of the reply voice is intercepted to continue playing.
  • the advantage of increasing the playback speed of the unplayed part of the reply voice is that it can take into account the user's requirement for playback time and retain the complete reply voice content. not good enough.
  • the advantage of intercepting part of the content in the unplayed part of the reply voice and continuing to play it is that it can take into account the user's requirements for the playing time, and can retain the relatively important content in the unplayed part, while the user's auditory experience. It is also better, and there will be no feeling that the voice is accelerated and compressed.
  • the advantage of speeding up the playback speed is that the information is not reduced, and at the same time, the playback can be completed in a short time.
  • important or critical content can be intercepted and played from the unplayed part, thus avoiding loss of later but more effective information in the reply information.
  • the reply voice is: "The weather is sunny, the sun is shining, the temperature is 15-20, the wind is 4-5, it is not suitable for going out to play or mountain climbing", for this case, suppose that in the The playback of the reply voice is interrupted when "the weather is fine”.
  • the redundancy of the unplayed part of the reply voice can also be reduced.
  • the redundancy of the unplayed part of the reply voice can also be reduced like this embodiment.
  • redundancy of the reply voice refers to the ratio of the voice content necessary for the non-reply command voice in the reply voice to the total voice content of the reply voice; Redundancy refers to the ratio of the voice content necessary for non-replying the command voice in the unplayed part of the reply voice to the voice content of the unplayed part.
  • the voice content necessary for replying to the command voice can be understood as the content directly related to the command voice
  • the voice content not necessary for replying to the command voice can be understood as not directly related to the command voice content, but actively promoted content, such as warm reminders, music sharing, one-liners, advertisements, etc.
  • the reply voice can be processed by reducing the redundancy of the unplayed part of the reply voice.
  • the specific method for reducing redundancy is not limited in this embodiment, and may be a method of determining which contents to retain by using preset keywords, or a method of determining which contents to delete by using preset inefficient words. , which can be a method of deleting content that expresses repetitive semantics, or a method of retaining important information, or a method of randomly deleting part of information, or other methods to reduce redundancy, which are not implemented in this embodiment. limited.
  • adjusting the word count of the reply text corresponding to the reply voice includes:
  • the word count of the reply text corresponding to the unplayed part of the reply voice is reduced.
  • this processing method emphasizes the number of words in the reply text, that is, this processing method adjusts the length of the reply voice by adjusting the number of words in the reply text.
  • the word count condition here can be set as needed. For example, part of the text content can be selected from the unplayed part of the reply text according to the word count condition, and the selection method can be sequential or random. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, a specific description is omitted here.
  • the corresponding adjusted reply text or the original unadjusted reply text can also be further displayed. Reply to text for the user to view the corresponding text, improving the user experience.
  • the reply voice when the user does not have time to listen to the reply voice due to answering the phone, or the reply voice cannot be heard clearly due to noise, etc., or because he just heard it but forgot, there is a corresponding reply text to help the user know Reply to the content information of the voice.
  • the benefit of displaying the original unadjusted reply text is that on the one hand, it will not take up the user's time because it will not be played, and on the other hand, it provides the user with the opportunity to view the full reply content, if the user wants to know the full If you reply to the content of the voice, you can learn the relevant information through the displayed reply text.
  • adjusting the redundancy of the reply text corresponding to the reply voice includes:
  • the redundancy of the reply text corresponding to the unplayed part of the reply voice is reduced.
  • this processing method emphasizes the redundancy of the reply text, that is, this processing method adjusts the redundancy of the reply voice by adjusting the redundancy of the reply text.
  • redundancy The redundancy conditions here can be set as required. For example, part of the text content may be selected from the unplayed part in the reply text according to the redundancy condition, and the selection method may be sequential or random. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, a specific description is omitted here.
  • the corresponding adjusted reply text or original reply can be further displayed. Unadjusted reply text for the user to view the corresponding text to improve the user experience.
  • an implementation manner is to determine that the reply voice has been played when the evaluation voice occurs The first duration is controlled, and the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice is controlled to be less than or equal to the first duration.
  • the redundancy of the reply text corresponding to the unplayed part of the reply voice is maintained or improved.
  • a positive evaluation voice for the reply voice is received during the playback of the reply voice, it means that the user continues to enjoy the reply voice or prefers the reply voice with a longer playback time or higher redundancy. Maintain or reduce the playback speed of the unplayed part of the reply voice; or, maintain or improve the redundancy of the unplayed part of the reply voice; or, maintain or improve the reply corresponding to the unplayed part of the reply voice Redundancy of text, so as to meet the user's voice interaction needs.
  • an implementation manner is to determine that the reply voice has been played when the evaluation voice occurs
  • the first duration is controlled, and the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice is controlled to be less than or equal to the first duration. Since the user sends an evaluation voice when the reply voice is played to the first duration, it indicates that the length of the first duration is the maximum length that the user can accept, and the reply voice exceeding this length is unwilling to be accepted by the user. Taking this as a condition, the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice is controlled to be less than or equal to the first duration, so as to satisfy the user's requirement for the playback duration of the reply voice.
  • the complete playback time of a reply voice is 15s
  • 6s can be used as a threshold to control the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice to be less than or equal to 6s.
  • control of the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice is less than or equal to the first duration, including:
  • the playback duration of the subsequent reply voice corresponding to the command voice that is the same as the command voice is controlled to be less than or equal to the first duration
  • there are multiple implementations such as: A. Control the subsequent and The response voice corresponding to the command voice that is the same as the command voice stops playing when the playback duration is less than or equal to the first duration; or, B. Control the subsequent response voice corresponding to the command voice that is the same as the command voice to be intercepted during playback Part of the content is played; or, C, from the reply voice library corresponding to the command voice, select the reply voice whose playback duration is less than or equal to the first time length as the follow-up reply voice corresponding to the command voice that is the same as the command voice ; or, D. Increase the playback speed of the subsequent reply voice corresponding to the command voice that is the same as the command voice.
  • the advantage of the above method A is that it is simple and convenient to control, and only needs to stop playing when the playback duration of the reply voice is less than or equal to the first duration.
  • the advantage of the above method B is that it is more flexible, for example, relatively important information in the reply voice can be intercepted and played as needed.
  • the advantage of the above method C is that there is no need to adjust the reply voice in the reply voice library, which is simple and convenient to implement, and the reply voice whose playback duration meets the requirements can be directly selected as the response.
  • the advantage of the above method D is that the information content of the reply voice is not lost, and at the same time, the effect of shortening the playing time can be satisfied.
  • adjusting the redundancy of the subsequent reply voice corresponding to the command voice that is the same as the command voice according to the evaluation voice includes:
  • the ratio of the duration to the total duration of the reply voice, and the redundancy of the subsequent reply voice corresponding to the command voice that is the same as the command voice is controlled to be less than or equal to the ratio. For example, assuming that the complete playback time of a reply voice is 15s, when the user's evaluation voice is received when the reply voice is played for 6s, the first time that the reply voice has been played when the evaluation voice occurs will occupy the reply.
  • the redundancy of the reply voice corresponding to the command voice that is the same as the command voice can be controlled to be less than or equal to the ratio, that is, when the reply voice is controlled subsequently, the reply voice can be guaranteed
  • the proportion of the part that is not directly related to the command voice in the total command voice is less than 0.4.
  • the reply voice “It's 11 am, you are tired from work, remember to add more water, eat more fruits, stretch and do stretching exercises are good for your health”
  • “Now it is 11:00 am” is the content directly related to the command voice
  • “I’m tired from work, remember to add more water, eat more fruit, stretch and stretch, it’s good for health” is not directly related to the command voice. related content.
  • the redundancy of the reply voice is 0.85. Assuming that the user's evaluation voice is received when the reply voice is played for 6s, the first duration of the reply voice that has been played when the evaluation voice occurs accounts for the total duration of the reply voice.
  • the redundancy of the subsequent reply voice corresponding to the command voice that is the same as the command voice can be controlled to be less than or equal to the ratio. If there is no direct correlation, the proportion of the total command voice is less than 0.4, that is, the reply voice can be adjusted to "It is 11 am, you are tired from work".
  • adjusting the word count of the reply text of the subsequent reply voice corresponding to the command voice that is the same as the command voice according to the evaluation voice including:
  • this embodiment emphasizes the number of words in the reply text, that is, the processing method adjusts the length of the reply voice by adjusting the number of words in the reply text. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, detailed description is omitted here.
  • adjusting the redundancy of the reply text of the subsequent reply voice corresponding to the command voice that is the same as the command voice according to the evaluation voice includes:
  • this embodiment emphasizes the redundancy of the reply text, that is, this processing method adjusts the redundancy of the reply voice by adjusting the redundancy of the reply text. redundancy. Since the specific processing manner of this embodiment is similar to that of the above-mentioned embodiment, detailed description is omitted here.
  • adjusting the playback duration of the reply voice corresponding to all or part of the command voice issued by the first user according to the evaluation voice including:
  • the voice interaction processing method further includes:
  • the reply voice is adjusted according to the evaluation voice.
  • the time period information corresponding to the occurrence of the evaluation voice may be determined first, and then in the subsequent time period corresponding to the time period information, the evaluation voice may to adjust the reply voice.
  • the reply voice with rich content for example, contains content directly related to the command voice and not directly related to the command voice, and in the second time period (such as 8:00-9:00 in the morning), it is more inclined to receive short-content reply Speech, for example, contains content directly related to the command speech. Therefore, even for the same command voice, the user's response voice requirements for the command voice may be different because of different time periods.
  • this embodiment first determines the time period information corresponding to when the evaluation voice occurs, and then adjusts the reply voice according to the evaluation voice in a subsequent time period corresponding to the time period information.
  • any one or more adjustment methods in the processing methods 1 to 13 described in the foregoing embodiments may be performed.
  • a day can be divided into multiple time periods, and then the user's adjustment mode for different reply voices in each time period is determined respectively.
  • it can also be divided into 24 time periods in units of 1 hour, and the user's adjustment mode for different reply voices in each time period can be determined respectively, which is not limited in this embodiment.
  • the method before adjusting the reply voice according to the evaluation voice, the method further includes:
  • Determining whether the evaluation voice is a valid evaluation voice specifically includes:
  • the evaluation speech determines whether the evaluation speech does not contain a wake-up word, and/or determine whether the duration of the evaluation speech is less than the first duration, and/or whether the loudness difference between the evaluation speech and the command speech or the reply speech is not is greater than the first difference, and if so, the evaluation voice is determined to be a valid evaluation voice.
  • the evaluation voice When determining whether the evaluation voice is a valid evaluation voice, there are various implementations. For example, since the evaluation voice is not an instruction voice, there is no need to wake up the smart device. Therefore, the evaluation voice generally does not contain a wake-up word. In this implementation manner, whether the evaluation speech is valid or not can be determined by determining whether the evaluation speech contains a wake-up word. For example, when it is determined that the wake word is not included, it is a valid evaluation speech. When it is determined that the wake word is included, it is an invalid evaluation speech.
  • whether the evaluation speech is valid may be determined by whether the duration of the evaluation speech is less than the first duration. For example, if it is less than the first duration, it is determined as a valid evaluation voice; otherwise, it is determined as an invalid evaluation voice.
  • the size of the first duration may be set as required, which is not limited in this embodiment.
  • the evaluation voice since there is generally a difference in loudness between the evaluation voice and the command voice or the reply voice, in an implementation manner, it can be determined whether it is valid by judging whether the loudness difference between the evaluation voice and the command voice or the reply voice is greater than the first difference value.
  • evaluation voice For example, if it is greater than the first difference, it is determined as a valid evaluation voice, otherwise it is determined as an invalid evaluation voice.
  • the size of the first difference may be set as required, which is not limited in this embodiment.
  • the semantic recognition algorithm corresponding to the command voice is the first semantic recognition algorithm
  • the semantic recognition algorithm corresponding to the evaluation voice is the second semantic recognition algorithm
  • the first semantic recognition algorithm The real-time performance of the second semantic recognition algorithm is lower than that of the first semantic recognition algorithm.
  • the semantic recognition algorithm corresponding to the command voice since the user is highly sensitive to whether the command voice is responded to in time, the semantic recognition algorithm corresponding to the command voice has a high requirement for real-time performance, and because the user is sensitive to whether the evaluation voice is responded to in time Therefore, the real-time requirement of the semantic recognition algorithm corresponding to the evaluation speech is relatively low. In addition, since the real-time requirement of the semantic recognition algorithm corresponding to the evaluation speech is relatively low, it can be used with higher accuracy. , a more complex recognition algorithm can accurately identify the evaluation meaning contained in the evaluation speech, and then make more precise adjustments to the reply speech.
  • adjusting the reply voice according to the evaluation voice including:
  • the playback duration and/or redundancy of the reply voice is adjusted according to the length of the command voice.
  • this embodiment does not adjust the reply voice according to the first time duration that the reply voice has been played when the evaluation voice is received, but according to the command voice.
  • the length of the reply voice is adjusted. For example, when the command voice issued by the user is longer, the playback duration of the corresponding reply voice is also longer; when the command voice issued by the user is short, the playback duration of the corresponding reply voice is also shorter.
  • the command voice issued by the user is generally relatively short. Therefore, according to this processing method, the length of the reply voice can be determined relatively simply and effectively.
  • the length of the command voice is a time value, it can be used directly when adjusting the playback duration, and when adjusting the redundancy, it can be adjusted according to the preset duration and redundancy. relationship, determine the appropriate redundancy, and then adjust the redundancy. For example, suppose the relationship between the preset duration and redundancy is: when the duration is 2s, the redundancy is 0.1, when the duration is 5s, the redundancy is 0.2, and when the duration is 8s, the redundancy is 0.3 and so on.
  • adjusting the playback duration of the reply voice according to the length of the command voice may refer to: controlling the playback duration of the reply voice to be less than or equal to the length of the command voice; it may also refer to: controlling the The absolute value of the difference between the playback duration of the reply voice and the length of the command voice is within a preset interval.
  • a similar manner may also be adopted, which will not be repeated in this embodiment.
  • the playback duration of the reply voice is adjusted according to the length of the command voice, including:
  • part of the content is intercepted in the unplayed part of the reply voice to continue playing, so that the adjusted total playback duration of the reply voice matches the length of the command voice;
  • the playback speed of the unplayed part of the reply voice is increased according to the length of the command voice, so that the adjusted total playing time of the reply voice matches the length of the command voice.
  • the playback duration of the reply voice when adjusting the playback duration of the reply voice according to the length of the command voice, there are multiple implementations: for example, 1 controlling the playback duration of the reply voice according to the length of the command voice Stop playing when it matches the length of the command voice.
  • the matching includes various situations, for example, it may include that the playback duration of the reply voice is less than or equal to the length of the command voice, or the absolute value of the difference between the playback duration of the reply voice and the command voice is located in within the preset range, etc.
  • the reply voice is controlled to stop playing when the playback duration matches the length of the command voice, and the advantage is that the reply voice can be controlled relatively simply and accurately. playback time.
  • speeding up the playback speed has the advantage of not reducing the information, and at the same time ensuring that the playback is completed in a short time.
  • intercepting part of the content in the unplayed part of the reply voice and continuing to play it has the advantage that important or key content can be intercepted and played from the unplayed part, thereby avoiding loss of the reply information. The latter but more effective information.
  • the redundancy of the reply speech is adjusted according to the length of the command speech, including:
  • the redundancy corresponding to the redundancy of the reply speech is determined according to the length range interval corresponding to the length of the command speech.
  • the redundancy of the reply voice may be determined according to a length range interval corresponding to the length of the command voice.
  • the redundancy of the reply speech is 0.1
  • the length range interval corresponding to the command speech length is (2)
  • the redundancy of the reply voice is 0.2
  • the redundancy of the reply voice is 0.3 and so on.
  • adjusting the reply voice according to the evaluation voice including:
  • the playback duration and/or redundancy of the reply voice is adjusted.
  • this embodiment not only adjusts the reply voice according to the first duration of the playback of the reply voice when the evaluation voice is received, but also not only adjusts the reply voice according to the The length of the command voice adjusts the reply voice, but the two are combined to adjust the reply voice. For example, it can be adjusted based on the average value of the two, and can also be adjusted based on the minimum value of the two. It can be understood that the advantage of adjusting the reply voice by combining the two is that it can more accurately reflect the user's acceptance of the playback time of the reply voice. Therefore, the playback time and/or redundancy of the reply voice determined in this way is The degree is more in line with user expectations.
  • the playback time and/or redundancy of the reply voice is determined.
  • the margin is adjusted, including any one of the following methods:
  • T represents the target duration
  • T 1 represents the length of the command speech
  • T 2 represents the first duration
  • represents the weight of the command speech
  • represents the weight of the first duration
  • k 1 represents the first duration.
  • a specific method for adjusting the playback duration and/or redundancy of the reply voice is given by combining the length and the first duration of the command voice, for example, it can be adjusted according to the average value of the two.
  • the adjustment can also be adjusted according to the minimum value of the two, and can also be adjusted according to the sum of the two.
  • the above-mentioned first relationship model or the second relationship model can also be used for adjustment.
  • the advantage of adjusting according to the average of the two lies in the average of the length of the command voice issued by the user and the longest playback duration (that is, the first duration) acceptable to the user when the evaluation voice occurs. It can more accurately reflect the user's acceptance of the playback duration of the reply voice. Therefore, the playback duration of the reply voice determined in this way is more in line with the user's expectation.
  • the advantage of adjusting according to the minimum value of the two is that: determining the playback duration of the reply voice according to the minimum value of the two can make the reply voice short and refined to the greatest extent, so as to satisfy the user's requirement for the short and refined reply voice. requirements.
  • the advantage of adjusting according to the sum of the two is that it can provide users with as much additional extended information as possible on the premise of basically meeting the user's requirement for the playback duration of the reply voice, so that the reply voice does not appear to be too much. Too monotonous.
  • the advantage of using the above-mentioned first relationship model or the second relationship model for adjustment is that different weights can be assigned to the length of the command voice and the first duration according to requirements, for example, focusing more on the playback duration of the reply voice. If it tends to be close to the duration of the command voice, the weight corresponding to the duration of the command voice can be increased. For example, if more emphasis is placed on making the playback duration of the reply voice tend to be close to the first duration, the weight corresponding to the first duration can be increased.
  • the above-mentioned first relationship model and second relationship model also set an adjustment coefficient, which is used to adjust the time length appropriately after the time length is finally determined according to the time length of the command voice and the first time length. When the reply voice is short, the adjustment coefficient can be set to 0.5, and when the reply voice tends to be longer, the adjustment coefficient can be set to 0.8 or 1 and so on.
  • redundancy whether it is based on the average value, the minimum value, the sum of the two, or the target duration, these are all time values, which can be used directly when adjusting the playback duration.
  • an appropriate redundancy may be determined according to the relationship between the preset duration and the redundancy, and then the redundancy may be adjusted. For example, suppose the relationship between the preset duration and redundancy is: when the duration is 2s, the redundancy is 0.1, when the duration is 5s, the redundancy is 0.2, and when the duration is 8s, the redundancy is 0.3 and so on.
  • the command voice includes a wake-up word.
  • the command voice includes a wake-up word.
  • a certain command voice does not contain a wake-up word, it will not be recognized and responded to, thereby reducing irrelevant voices interference.
  • wake-up word has different designs for the wake-up word. This embodiment does not require the specific content setting and length setting of the wake-up word. Generally speaking, the wake-up word is related to product features or nicknames. In addition, wake words should generally not be too long and need to be easier to pronounce.
  • the voice interaction processing method provided in this embodiment adjusts the reply voice by sending the evaluation voice during the playback of the reply voice, so that the adjusted reply voice more matches the user's needs, so that the reply voice can be adjusted for Provide users with a better voice interaction service experience.
  • Command voice refers to the voice content issued by the user that can trigger the dialogue management (Diaglou Management, DM for short) of the voice interaction device (it can be a smart device, a terminal device, a server, or a combination of many) . It should be noted that, in a voice interaction device that uses wake-up words to wake up, the command voice generally needs to include wake-up words.
  • Voice interaction device It can be composed of smart device, terminal device and server.
  • the smart device receives the command voice, the terminal device performs voice recognition, and the server performs dialogue management.
  • the terminal device can also be connected to the smart device, and then the command voice is received by the terminal device, and the server performs voice recognition (which can also be placed in the terminal device), dialog management, and the like.
  • the voice interaction device can also be composed of both a smart device and a server, that is, the smart device receives the command voice, and then the server performs voice recognition and dialogue management.
  • the voice interaction device can also be composed of smart devices, that is, the smart device locally performs the entire process of receiving the command voice, and at the same time, it also performs the entire process of voice recognition and dialog management locally.
  • the voice interaction device may be composed of a smart device and a terminal device, that is, the smart device receives the command voice, and then the terminal device performs processing processes such as voice recognition and dialog management.
  • the voice interaction device may be composed of a terminal device, that is, the terminal device receives the command voice, and then the terminal device performs processing processes such as voice recognition and dialogue management. It can be understood that, the voice interaction device may be composed of one, two, or three of a smart device, a terminal device, and a server, which will not be illustrated one by one in this embodiment.
  • Reply voice refers to the voice played by the voice interaction device in response to the user's one-time command voice.
  • the duration of the reply voice refers to the audio length of the reply voice, which is approximately equal to the time required for the reply voice to be played.
  • Evaluation voice refers to the evaluation of the reply voice, for example, use "OK”, “No”, “No”, “Shut up”, “shut up”, etc. to evaluate the reply voice.
  • the survey found that the speech with a length shorter than a certain threshold is more likely to be an evaluative speech rather than an instructional speech.
  • the text database of evaluative elements is much smaller than the dialogue database of stored command speech, intonation (for example, if the rising and falling intonation reaches a certain threshold, the speech is considered to contain evaluative features) or loudness (above a certain threshold or loudness difference from the previous sentence) greater than a certain threshold) and other non-content characteristic factors to obtain the user's evaluation of the last reply voice.
  • the evaluation voice is not a command voice, that is, the reply voice that cannot directly trigger the "dialogue management" of the voice interactive device.
  • the evaluation voice usually does not include wake-up words (the requirements for the recognition of evaluation voices are generally required. below the command voice).
  • the basic principle of the present application is as follows: within a certain time window (for example, 10 seconds) of playing the reply voice, the voice interaction device confirms that the user has fed back the evaluation voice, and then adjusts the reply voice according to the evaluation, such as adjusting the frequency of its occurrence.
  • the voice interaction processing method provided by the present application will be explained and described in detail below with reference to FIG. 3 , FIG. 4 , FIG. 5 , FIG. 6 , and FIG. 7 as well as specific embodiments.
  • the voice interactive system includes a voice interactive terminal (also called a voice interactive device) and a cloud server.
  • the function of the voice interactive terminal is to receive voice information from users.
  • the voice interactive terminal includes a smart speaker.
  • Smart phones with voice assistant software installed smart home appliances such as TVs, refrigerators, and air conditioners with voice modules and communication modules, and wearable smart devices such as sports bracelets and smart watches.
  • the user When the user utilizes the intelligent voice interaction function, the user first sends out an instruction voice. For example, “Xiaomei Xiaomei, what time is it?", where "Xiaomei Xiaomei" is the wake-up word.
  • the voice interactive terminal receives the voice sent by the user through the microphone module, and after preliminary voice and audio processing such as noise reduction, enhancement, etc., determines whether the header of the voice and audio data contains a preset wake-up word (for example, the header corresponds to "Xiaomei"). Whether the audio waveform of "Xiaomei" matches), if it is included, the processed voice and audio data will be uploaded to the cloud server. Otherwise, do discard processing.
  • the voice and audio data uploaded to the cloud server pass through the automatic language recognition module (audio to text) and the natural language processing module (text analysis) in turn, and then enter the dialogue management module, and the dialogue management module decides and feeds back the corresponding reply voice and/or device operation. Order.
  • the voice interaction terminal receives the reply voice sent from the cloud server and plays it through the speaker module.
  • the voice interactive terminal continues to record the user's non-command voice (that is, it is not intended to command the voice interactive system to achieve a certain function). (for example, it can be purely emotional catharsis, usually does not include "wake-up words", and will not actively wake up the device), then upload the voice to the evaluation feature extraction module of the cloud server for evaluation and analysis, and the evaluation feature extraction module extracts from the text content. Parsing the voice is not an instruction voice, but contains the user's evaluation (emotion) of the last reply voice, and then outputs the evaluation to the dialogue management module, which is then used to adjust the frequency of occurrence of the previous reply voice.
  • a time window for example, 5 seconds
  • the evaluation feature extraction module connects the second text that is different from the dialogue management module.
  • database evaluation database of Figure 4
  • the characteristic elements of the non-text content for example, the duration of the non-command voice, the difference in loudness between the non-command voice and the command voice or the reply voice, etc.
  • the text content can be identified after meeting certain conditions. .
  • the user is less sensitive to the real-time performance and accuracy of evaluating the feature extraction, so it is preferable to use different processing strategies for the command voice and non-command voice (for example, different databases can be used, and the The non-command voice adopts a more complex recognition mode, and the requirements for real-time performance can be appropriately relaxed, etc.).
  • different processing strategies for the command voice and non-command voice for example, different databases can be used, and the The non-command voice adopts a more complex recognition mode, and the requirements for real-time performance can be appropriately relaxed, etc.
  • the implementation subject may be a server or a terminal voice device (in this case, relevant processing such as voice recognition and dialogue management is performed locally).
  • the user sends out a command voice, such as “Xiaomei Xiaomei (wake-up word), what time is it”, and the reply voice “It is now Yes... continue to cheer tomorrow!
  • a command voice such as “Xiaomei Xiaomei (wake-up word), what time is it”
  • the reply voice “It is now Yes... continue to cheer tomorrow!”
  • a voice input of "don't cheer up” from the user is detected. It can be seen from this that the user is not satisfied with the attitude of the reply voice, and then the frequency of occurrence of the reply voice can be reduced in the future.
  • the evaluation feature extraction module is located on the voice interactive terminal instead of the server.
  • the evaluation feature extraction module may extract text content as the judgment criteria for the output evaluation, or may only extract several non-text dimensions such as intonation and loudness as the judgment criteria for the output evaluation.
  • the hardware requirements of the terminal can be reduced.
  • the embodiment of the present application can adjust the strategy of the speech technique according to the user's evaluation voice feedback on the reply speech, so that the adjusted reply speech technique is more in line with the user's habits or needs.
  • the voice interaction processing apparatus includes: a receiving module 21 and a processing module 22, wherein:
  • the receiving module 21 is used to receive the user's evaluation voice for the reply voice in the playback process of the reply voice or in the time window after the playback ends; the reply voice is the voice in response to the command voice issued by the user; the command voice is the voice of the command;
  • the processing module 22 is configured to determine a dialogue strategy corresponding to the command voice according to the evaluation voice.
  • Scheme 1 Receive the user's evaluation voice for the reply voice during the playback of the reply voice; the reply voice is the voice in response to the command voice issued by the user; the command voice is the voice of the issued command; according to the evaluation voice , determine the dialogue strategy corresponding to the command voice
  • Scheme 2 Receive the user's evaluation voice for the reply voice within the time window after the reply voice playback ends; the reply voice is the voice in response to the command voice issued by the user; the command voice is the voice of the issued command; The evaluation voice is used to determine the dialogue strategy corresponding to the command voice.
  • voice interaction processing apparatus provided in this embodiment can be used to execute the voice interaction processing method described in the foregoing embodiments, the working principles and beneficial effects thereof are similar, and are not described in detail here.
  • another embodiment of the present application provides a smart device, where the smart device includes the voice interaction processing apparatus described in the above embodiments.
  • this embodiment provides a smart device including the above-mentioned voice interaction processing apparatus, thereby realizing the above-mentioned Voice interaction process.
  • the smart device may be various smart appliances, such as smart speakers, smart refrigerators, smart rice cookers, smart water heaters, smart TVs, smart washing machines, etc., which are not limited in this embodiment.
  • the intelligent device provided in this embodiment includes the voice interaction processing apparatus described in the above embodiment, its working principle and beneficial effects are similar, so it will not be described in detail here, and the specific content can be referred to the introduction of the above embodiment.
  • another embodiment of the present application provides a terminal device, where the terminal device includes the voice interaction processing apparatus described in the above embodiments.
  • this embodiment provides a terminal device including the above-mentioned voice interaction processing apparatus, and further realizes the above-mentioned Voice interaction process.
  • the terminal device may be various devices, such as a mobile phone, a pad, a smart watch, a notebook, etc., which is not limited in this embodiment.
  • the terminal device provided in this embodiment includes the voice interaction processing apparatus described in the above embodiment, its working principle and beneficial effects are similar, so it will not be described in detail here. For details, refer to the introduction of the above embodiment.
  • another embodiment of the present application provides a server, where the server includes the voice interaction processing apparatus described in the above embodiments.
  • this embodiment provides a server including the above-mentioned voice interaction processing apparatus, thereby realizing the above-mentioned voice interaction processing.
  • the server may be a cloud server or another server, which is not limited in this embodiment. When it is a cloud server, it has the advantages of fast processing speed and high security.
  • the server provided in this embodiment includes the voice interaction processing device described in the above embodiment, its working principle and beneficial effects are similar, so it will not be described in detail here, and the specific content can be referred to the introduction of the above embodiment.
  • the smart device specifically includes the following: a processor 301, a memory 302, a communication interface 303, and a communication bus 304;
  • the processor 301, the memory 302, and the communication interface 303 complete the mutual communication through the communication bus 304; the communication interface 303 is used to realize the communication between various modeling software and related equipment such as intelligent manufacturing equipment module library. transmission;
  • the processor 301 is configured to call a computer program in the memory 302, and the processor implements all steps of the above voice interaction processing method when the computer program is executed, for example, when the processor executes the computer program.
  • the smart device may be various smart appliances, such as smart speakers, smart refrigerators, smart rice cookers, smart water heaters, smart TVs, smart washing machines, etc., which are not limited in this embodiment.
  • the terminal device specifically includes the following: a processor 401, a memory 402, a communication interface 403, and a communication bus 404;
  • the processor 401, the memory 402, and the communication interface 403 complete the communication with each other through the communication bus 404; the communication interface 403 is used to realize the communication between various modeling software and related equipment such as the intelligent manufacturing equipment module library. transmission;
  • the processor 401 is configured to invoke a computer program in the memory 402, and the processor implements all steps of the above-mentioned voice interaction processing method when the computer program is executed, for example, when the processor executes the computer program.
  • the terminal device may be various devices, such as a mobile phone, a pad, a smart watch, a notebook, etc., which is not limited in this embodiment.
  • the server specifically includes the following: a processor 501, a memory 502, a communication interface 503, and a communication bus 504;
  • the processor 501, the memory 502, and the communication interface 503 complete the communication with each other through the communication bus 504; transmission;
  • the processor 501 is configured to call a computer program in the memory 502, and the processor implements all steps of the above-mentioned voice interaction processing method when the computer program is executed, for example, when the processor executes the computer program.
  • the server may be a cloud server or another server, which is not limited in this embodiment.
  • it is a cloud server, it has the advantages of fast processing speed and high security.
  • another embodiment of the present application provides a non-transitory computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned voice interaction processing is implemented All steps of the method, for example, when the processor executes the computer program, the following steps are implemented: receiving the user's evaluation voice for the reply voice during the playback of the reply voice or in the time window after the playback ends; the reply voice The voice is the voice in response to the command voice issued by the user; the command voice is the voice of the command; according to the evaluation voice, a dialogue strategy corresponding to the command voice is determined.
  • the above-mentioned logic instructions in the memory can be implemented in the form of software functional units and can be stored in a computer-readable storage medium when sold or used as an independent product.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .
  • the device embodiments described above are only illustrative, wherein the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed over multiple network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application. Those of ordinary skill in the art can understand and implement it without creative effort.
  • each embodiment can be implemented by means of software plus a necessary general hardware platform, and certainly can also be implemented by hardware.
  • the above-mentioned technical solutions can be embodied in the form of software products in essence or the parts that make contributions to the prior art, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic A disc, an optical disc, etc., includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the voice interaction processing method described in each embodiment or some part of the embodiment.
  • a computer device which may be a personal computer, a server, or a network device, etc.
  • the terms “installed”, “connected” and “connected” should be understood in a broad sense, for example, it may be a fixed connection, a detachable connection, or an integral connection; it may be a mechanical connection, It can also be an electrical connection; it can be a direct connection, an indirect connection through an intermediate medium, or an internal connection between two components.
  • installed should be understood in a broad sense, for example, it may be a fixed connection, a detachable connection, or an integral connection; it may be a mechanical connection, It can also be an electrical connection; it can be a direct connection, an indirect connection through an intermediate medium, or an internal connection between two components.
  • relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply the existence between these entities or operations any such actual relationship or sequence.
  • the terms “comprising”, “comprising” or any other variation thereof are intended to encompass non-exclusive inclusion such that a process, method, article or device comprising a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus.
  • an element qualified by the phrase “comprising a" does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • User Interface Of Digital Computer (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音交互处理方法、装置、电子设备及存储介质,涉及智能处理技术领域。语音交互处理方法包括:接收在回复语音的播放过程中或播放结束后的时间窗口内用户针对回复语音的评价语音;回复语音为响应于用户发出的指令语音的语音;指令语音为下发指令的语音(101);根据评价语音,确定对应指令语音的对话策略(102)。该方法根据响应于指令语音的回复语音在播放过程中或播放结束后的时间窗口内收到的评价语音,调整相应指令语音的对话策略,从而使得对应指令语音的对话策略更加匹配用户需求,从而可以为用户提供更好的语音交互服务体验。

Description

语音交互处理方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请要求于2020年12月14日提交的申请号为202011474827.8,发明名称为“语音交互处理方法、装置、电子设备及存储介质”的中国专利申请的优先权,其通过引用方式全部并入本申请。
技术领域
本申请涉及智能处理技术领域,具体涉及一种语音交互处理方法、装置、电子设备及存储介质。
背景技术
语音交互(Voice User Interface,VUI)指的是人类与设备通过自然语音进行信息的传递。目前,以智能音箱为代表的很多家电设备中配置有语音交互模块,通过语音交互模块能够识别用户的指令语音,并以语音形式对用户的指令语音进行响应,为用户提供更加拟人化的人机交互方式。
通常情况下,一个优秀的语音交互系统的话术设计须兼顾理性和感性之间的平衡,既对客户提供有益的帮助,也要具备一定趣味性。为此,设计者在为语音交互设备的“技能(Skill)”构建话术时,为了减少所谓的“机器感”,在表达近似含义时往往就同一指令提供多样化的回复表述,以期增加与用户的亲和性。但是,并非所有的用户都对设计者设定的话术策略感到满意。
发明内容
针对现有技术中存在的问题,本申请实施例提供了一种语音交互处理方法、装置、电子设备及存储介质,用于解决在语音自动交互过程中的回复语音无法匹配用户需求的问题。
为解决现有技术中存在的问题,本申请实施例提供了以下技术方案:
第一方面,本申请实施例提供了一种语音交互处理方法,包括:
接收在回复语音的播放过程中用户针对回复语音的评价语音;所述回 复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
根据所述评价语音,确定对应所述指令语音的对话策略。
第二方面,本申请实施例提供了一种语音交互处理方法,包括:
接收在回复语音播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
根据所述评价语音,确定对应所述指令语音的对话策略。
进一步地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
根据所述评价语音,调整后续响应所述指令语音时所述回复语音出现的频率。
进一步地,所述回复语音为基于用户发出的指令语音通过查询对话数据库确定的回复语音;
相应地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
根据所述评价语音,查询评价数据库,确定所述评价语音中包含的反馈信息,并根据所述反馈信息,确定对应所述指令语音的对话策略;
其中,所述评价数据库和所述对话数据库独立设置,所述评价数据库设置在智能设备侧,且所述评价数据库的内容少于所述对话数据库。
进一步地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
确定所述评价语音中包含带有负面色彩的关键词且所述关键词与降低播放时长有关,则降低响应所述指令语音的回复语音的播放时长和/或冗余度。
进一步地,降低响应所述指令语音的回复语音的播放时长和/或冗余度,具体包括:
确定接收所述评价语音时所述回复语音已播放的第一时长,根据所述第一时长调整对应所述指令语音的回复语音的播放时长;
和/或,
确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的第一比值,根据所述第一比值调整对应所述指令语音的回复语音的冗余度。
进一步地,根据所述第一时长调整对应所述指令语音的回复语音的播放时长,具体包括下述内容中的一项或多项:
控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长;
控制与第一用户发出的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长;其中,所述第一用户为发出所述指令语音的用户;
控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长。
进一步地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
确定所述评价语音中包含带有负面色彩的关键词且所述关键词与用户偏好有关,则降低所述回复语音作为所述指令语音的响应的使用频率或更换新的回复语音作为所述指令语音的响应。
进一步地,降低所述回复语音作为所述指令语音的响应的使用频率或更换新的回复语音作为所述指令语音的响应,具体包括:
降低所述回复语音的使用频率;其中,降低所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从与所述指令语音对应的回复语音库中选择所述回复语音作为响应的概率降低;
或,降低播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率;其中,减低播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率是指在后续响应所述指令语音时,从与所述指令语音对应的回复语音库中选择播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率降低;
或,从与所述指令语音对应的回复语音库中选择与所述回复语音不同的回复语音进行播放;
或,根据所述负面反馈信息中携带的用户希望更换的主题,从与所述 指令语音对应的回复语音库中选择与所述主题匹配的回复语音进行播放。
进一步地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
确定所述评价语音中包含带有正面色彩的关键词且所述关键词与保持或提高播放时长有关,则保持或提高响应所述指令语音的回复语音的播放时长和/或冗余度。
进一步地,保持或提高响应所述指令语音的回复语音的播放时长和/或冗余度,具体包括下述中的任意一项或多项:
保持或提高所述回复语音的播放时长和/或冗余度;其中,回复语音的冗余度是指回复语音中非回复指令语音所必需的语音内容与回复语音全部语音内容的比值;
保持或提高与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度;
保持或提高与第一用户发出的所有或部分指令语音对应的回复语音的播放时长和/或冗余度;其中,所述第一用户为发出所述指令语音的用户;
保持或提高与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度;
从与所述指令语音对应的回复语音库中选择与所述回复语音的播放时长和/或冗余度的差值在预设范围内的回复语音进行播放。
进一步地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
确定所述评价语音中包含带有正面色彩的关键词且所述关键词与保持或提高使用频率有关,则保持或提高所述回复语音作为所述指令语音的响应的使用频率。
进一步地,保持或提高所述回复语音作为所述指令语音的响应的使用频率,具体包括下述内容中的一项或多项:
增加所述回复语音的使用频率;其中,增加所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从回复语音库中选择所述回复语音作为响应的概率增加;
增加主题与所述回复语音接近的回复语音的使用频率;
增加播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率;其中,增加播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率是指在后续响应所述指令语音时,从与所述指令语音对应的回复语音库中选择播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率增加。
进一步地,确定所述评价语音中包含带有负面色彩的关键词,具体包括下述内容中的一项或多项:
确定所述评价语音中携带有第一信息,所述第一信息是指与第一数据库中的评语信息相匹配的信息;其中,所述第一数据库中存储有负面评语信息;
确定所述评价语音中携带有第二信息,所述第二信息是指与所述回复语音中包含的信息具有相反含义的信息;
确定所述评价语音对应的语调与第一语调库中的语调信息相匹配,所述第一语调库中存储有带有负面情绪的语调;
确定所述评价语音对应的响度大于或等于第一响度。
进一步地,确定所述评价语音中包含带有正面色彩的关键词,具体包括下述内容中的一项或多项:
确定所述评价语音中携带有第三信息,所述第三信息是指与第二数据库中的评语信息相匹配的信息;其中,所述第二数据库中存储有正面评语信息;
确定所述评价语音中携带有第四信息,所述第四信息是指与所述回复语音中包含的信息具有相同或类似含义的信息;
确定所述评价语音对应的语调与第二语调库中的语调信息相匹配,所述第二语调库中存储有带有正面情绪的语调;
确定所述评价语音对应的响度小于第一响度。
进一步地,所述语音交互处理方法,还包括:
确定接收所述评价语音时对应的时间段信息;
相应地,在后续与所述时间段信息相对应的时间段,根据所述评价语音,确定对应所述指令语音的对话策略。
进一步地,在根据所述评价语音,确定对应所述指令语音的对话策略 之前,所述方法还包括:
确定所述评价语音是否为有效的评价语音,具体包括:
确定所述评价语音是否不包含唤醒词,和/或,确定所述评价语音的时长是否小于第一时长,和/或,所述评价语音与所述指令语音或所述回复语音的响度差是否大于第一差值,若是,则确定所述评价语音为有效的评价语音。
进一步地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
确定所述指令语音的长度,根据所述指令语音的长度对所述回复语音的播放时长进行调整,或,根据所述指令语音的长度对所述回复语音的冗余度进行调整。
进一步地,根据所述指令语音的长度对所述回复语音的播放时长进行调整,包括:
根据所述指令语音的长度控制所述回复语音在播放时长与所述指令语音的长度匹配时停止播放;
或,
根据所述指令语音的长度在所述回复语音的未播放部分中截取部分内容进行继续播放,使得调整后的回复语音的总播放时长与所述指令语音的长度匹配;
或,
根据所述指令语音的长度调高所述回复语音的未播放部分的播放速度,使得调整后的回复语音的总播放时长与所述指令语音的长度匹配。
进一步地,根据所述指令语音的长度对所述回复语音的冗余度进行调整,包括:
根据所述指令语音的长度对应的长度范围区间,确定所述回复语音的冗余度对应的冗余度。
进一步地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
确定所述指令语音的长度,根据所述指令语音的长度和接收所述评价语音时所述回复语音已播放的第一时长,对所述回复语音的播放时长和/ 或冗余度进行调整。
进一步地,根据所述指令语音的长度和接收所述评价语音时所述回复语音已播放的第一时长,对所述回复语音的播放时长和/或冗余度进行调整,包括下述方式中的任意一种:
根据所述指令语音的长度和第一时长的平均值,对所述回复语音的播放时长和/或冗余度进行调整;
根据所述指令语音的长度和第一时长中的最小值,对所述回复语音的播放时长和/或冗余度进行调整;
根据所述指令语音的长度和第一时长之和,对所述回复语音的播放时长和/或冗余度进行调整;
根据所述指令语音的长度和第一时长采用第一关系模型或第二关系模型,确定回复语音的目标时长,并根据所述目标时长对所述回复语音的播放时长和/或冗余度进行调整;其中,所述第一关系模型包括:T=k 1(αT 1+βT 2);其中,T表示目标时长,T 1表示指令语音的长度,T 2表示第一时长,α表示指令语音的权重,β表示第一时长的权重,k 1表示第一调节系数;
所述第二关系模型包括:T 0=k 2(αlnT 1+βlnT 2);其中,T 0表示目标时长,T 1表示指令语音的长度,T 2表示第一时长,α表示指令语音的权重,β表示第一时长的权重,k 2表示第二调节系数。
进一步地,所述时间窗口与所述回复语音的播放过程的至少一部分重合,所述评价语音的至少一部分落入所述时间窗口中与所述回复语音的播放过程相重合的区间。
第三方面,本申请实施例还提供了一种语音交互处理装置,包括:
接收模块,用于接收在回复语音的播放过程中用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
处理模块,用于根据所述评价语音,确定对应所述指令语音的对话策略。
第四方面,本申请实施例还提供了一种语音交互处理装置,包括:
接收模块,用于接收在播放结束后的时间窗口内用户针对回复语音的 评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
处理模块,用于根据所述评价语音,确定对应所述指令语音的对话策略。
第五方面,本申请实施例提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如第一方面或第二方面所述语音交互处理方法的步骤。
第六方面,本申请实施例提供了一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面或第二方面所述语音交互处理方法的步骤。
根据上述技术方案可知,本申请提供的语音交互处理方法、装置、电子设备及存储介质,根据响应于指令语音的回复语音在播放过程中或播放结束后的时间窗口内收到的评价语音,调整相应指令语音的对话策略,从而使得对应所述指令语音的对话策略更加匹配用户需求,从而可以为用户提供更好的语音交互服务体验。
需要说明的是,本申请的附加方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请一实施例提供的语音交互处理方法的流程图;
图2是本申请一实施例提供的语音交互流程示意图;
图3是本申请一实施例提供的语音交互处理方法的实现过程交互示意图;
图4是本申请一实施例提供的语音交互处理方法对应的模块实现原理图;
图5是本申请一实施例提供的带有评价语音的语音交互流程示意图;
图6是本申请一实施例提供的语音交互处理方法的另一实现过程交互 示意图;
图7是本申请一实施例提供的语音交互处理方法对应的另一模块实现原理图;
图8是本申请一实施例提供的语音交互处理装置的结构示意图;
图9是本申请一实施例提供的智能设备的结构示意图;
图10是本申请一实施例提供的终端设备的结构示意图;
图11是本申请一实施例提供的服务器的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整的描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
目前,以智能音箱为代表的,很多家电设备中配置有语音交互模块,通过语音交互模块能够识别用户的指令语音,并以语音形式对用户的指令语音进行响应,为用户提供更加拟人化的人机交互方式。
通常情况下,一个优秀的语音交互系统的话术设计须兼顾理性和感性之间的平衡,既对客户提供有益的帮助,也要具备一定趣味性。为此,设计者在为语音交互设备的“技能(Skill)”构建话术时,为了减少所谓的“机器感”,在表达近似含义时往往就同一指令提供多样化的回复表述,但是,并非所有的用户都对设计者的设定的话术策略感到满意。为此,本申请提供了一种语音交互处理方法、装置、电子设备及存储介质,本申请能够根据用户需求(或用户表现出来的信息或信号)为用户提供针对性的回复语音。下面将通过具体实施例对本申请提供的语音交互处理方法、装置、电子设备及存储介质进行详细说明。
需要说明的是,本申请实施例中术语“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。此外,本申请实施例中术语“多个”是指两个或两个以上,其它量词与之类似。
图1示出了本申请一实施例提供的语音交互处理方法的流程图,参见图1,本申请实施例提供的语音交互处理方法,包括:
步骤101:接收在回复语音的播放过程中或播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
步骤102:根据所述评价语音,确定对应所述指令语音的对话策略。
在本实施例中,需要说明的是,用户在使用智能设备,如智能音箱时,在一些场景下需要进行智能语音交互。例如,用户发出指令语音“现在是几点”,则智能音箱会响应该指令语音进行回复,假设回复语音是“现在是下午5点,夕阳西落的时间,今天的夕阳很美”。由此可见,指令语音为指示智能设备执行任务的语音,回复语音为响应于指令语音的语音。
在本实施例中,智能设备可以是指智能家电设备,如智能音箱、智能电视、智能加湿器、智能冰箱等,也可以是智能可穿戴设备,如智能手表、智能耳机等,还可以是其他智能设备,本实施例对此不作限定。
可以理解的是,在用户与智能设备的语音交互过程中,由用户先发出指令语音,该指令语音用于指示智能设备执行相应的任务,任务内容根据指令语音内容确定,例如,当指令语音为“现在是几点”时,表示该指令语音用于指示智能设备执行现在是几点的查询任务。
如图2所示的语音交互流程示意图,一次完整的语音交互过程主要经历以下流程自动语音识别(Automatic Speech Recognition,ASR)→自然语言处理(Natural Language Processing,NLP)→对话管理(Dialog Management,DM)→语音合成(Text-To-Speech,TTS),如图2所示,智能设备在接收到该指令语音后会进行一系列处理,包括通过自动语音识别(ASR)将指令语音转换为指令文本,然后对指令文本进行自然语言处理(NLP),分析得到用户意图,接着通过对话管理(DM)确定最后的回复文本,最后将回复文本进行语音合成(TTS),得到回复语音。这里,通过自动语音识别(ASR)将指令语音转换为指令文本是指利用语音自动识别技术将语音信息转换为文本信息的过程,由于该过程可以采用较为成熟的语音识别算法实现,故本实施例对此不再详细展开。这里,对指令文本进行自然语言处理(NLP),分析得到用户意图是指:通过对指令文本 进行自然语言处理(NLP)分析的方式,获取用户的意图,具体包括对指令文本进行基于自然语言处理的分词处理,然后提取文本特征(如TF-IDF文本特征提取、基于词向量word2vec的特征提取模型进行特征提取等),然后基于提取的文本特征进行意图分类等。
可以理解的是,意图识别是通过分类的方法将句子或查询query分到相应的意图种类,举例来说,对于某智能设备上的语音交互模块来说,只有50项交互技能,那么用户向智能设备发出一个指令语音,智能设备需要通过意图识别将用户的query分到某一个或几个交互技能上,然后再进行后续的处理。对于意图识别来说,可以采用基于领域词典的规则匹配方法,也可以采用基于意图分类模型的方式对用户的意图进行判别。关于该部分内容,本实施例不作过多的介绍,具体可参见现有的或行业比较先进的意图识别算法。
接着介绍关于对话管理(DM)的相关内容,对话管理实际上控制着人机对话的过程,任务驱动的对话管理实际就是一个决策过程,在对话过程中根据当前状态决定下一步应该采取的动作(如提供结果,询问特定限制条件,澄清或确认需求等),从而最有效的辅助用户完成信息或服务获取的任务。本实施例在确定用户意图后,通过对话管理(DM)确定最后的回复文本,最后将回复文本进行语音合成(TTS),作为得到回复语音。
例如,以用户发出指令语音“现在是几点”为例,经过自动语音识别、基于自然语言处理的意图分析,以及,对话管理最终确定的回复文本为“现在是凌晨3点整”或“现在已经是凌晨3点了,失眠了吗。要不要给你唱一首摇篮曲”或“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”等,最后通过将上述回复文本进行语音合成,即可得到回复语音。
在本实施例中,需要说明的是,当指令语音为“现在是几点”时,可以直接回复“现在是凌晨3点整”。当然,有时候为了增加交互互动的有趣性和亲和性,会在回复语音中穿插闲聊式话术或趣味性话术或知识性话术等等,比如,当指令语音为“现在是几点”时,可以回复“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”。可以理解的是,这样的回复比较有亲和性,增加了交互的互 动性,但是有些用户不喜欢这么复杂的回复,更希望追求简洁明了的回复,例如“现在是凌晨3点整”,或,“现在是凌晨3点整,天已不早,早点入睡”。基于此,本实施例提供了一种语音交互处理方法,在该方法中,用户可以在回复语音的播放期间或在回复语音播放结束后的时间窗口内发送评价语音,然后使得智能设备(也可以是终端设备,也可以是服务器),根据所述评价语音,确定对应所述指令语音的对话策略。例如,可以根据所述评价语音调整所述回复语音或与所述回复语音相关的回复语音的使用频率。又如,可以根据所述评价语音调整所述回复语音或与所述回复语音相关的回复语音的播放长度或冗余度等。又如,还可以是根据所述评价语音中断所述回复语音的播放。又如,还可以是根据所述评价语音重复播放所述回复语音。又如,还可以是根据所述评价语音更换新的回复语音等等。
在本实施例中,评价语音是指在回复语音的播放过程中或播放结束后的时间窗口(如10-60s)内用户针对回复语音做出评价的语音。
可以理解的是,本实施例包括两个并列的方案:
方案1:接收在回复语音的播放过程中用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;根据所述评价语音,确定对应所述指令语音的对话策略
方案2:接收在回复语音播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;根据所述评价语音,确定对应所述指令语音的对话策略。
由此可见,评价语音可以在回复语音的播放过程中针对回复语音发出,也可以在播放结束后的时间窗口内针对回复语音发出。
这里,时间窗口是指回复语音播放结束后的一段时间,例如,时间窗口是从回复语音结束播放这一时刻开始,持续预设时长如5s截止的时间段。时间窗口起到的作用是指在这个时间窗口内监测和接收用户发出的评价语音,过了这个时间窗口,将不再监测和接收用户发出的评价语音,从而有效提高了评价语音接收的针对性,避免将评价语音和下一次新的指令语音相混淆。
一般来说,时间窗口是从回复语音结束播放这一时刻开始算起的,但是作为一种特殊的例子,也可以是:所述时间窗口与所述回复语音的播放过程的至少一部分重合,所述评价语音的至少一部分落入所述时间窗口中与所述回复语音的播放过程相重合的区间。举例来说,假设回复语音的播放时间为下午14:02:00-14:02:55,则时间窗口可以是14:02:40-14:02:60,由此可见,时间窗口和回复语音的播放过程存在部分重合,重合区间为(14:02:40-14:02:55),然后评价语音的至少一部分落入该重合区间,这样处理的优势是:可以确保用户发出的语音是准确地针对该回复语音的评价语音,而不是发出新的指令语音,从提高了可以提高评价语音的识别率。在本实施例中,由于评价数据库和对话数据库是独立设置的,且评价数据库的内容少于所述对话数据库,因此,通过这样的设计,当所述时间窗口与所述回复语音的播放过程的至少一部分重合,使得评价语音的至少一部分落入重合区间时,可以准确地识别用户发出的语音是准确地针对该回复语音的评价语音,而不是发出新的指令语音,从而可以有针对性利用评价数据库,从而可以有效提高识别率以及识别效率。
需要说明的是,评价语音可以是正面的评价语音,也可以是负面的评价语音。一般来说,当用户对当前的回复语音比较满意或认可或有进一步的探索兴趣时,会给出偏向正面的评价语音。当用户对当前的回复语音不太满意或有明确意见时,会给出偏向负面的评价语音。
可以理解的是,一般来说,评价语音一般比较短小,如负面的评价语音可以包括:不好、不喜欢、太长、太复杂、受到干扰、No、Bad、Stop等。举例来说,当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,若用户不喜欢该语音,则对应的评价语音可能是“不好”或“不喜欢”或“太长”或“受到干扰”或“No”或“Bad”或“Stop”等。
可以理解的是,对于正面的评价语音来说,一般可以包括:真好、不错、挺好、喜欢、Yes、Good、Like等。举例来说,当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,若用户喜欢该语音,则对应的评价语音可能是“喜欢”或“Good”或“Yes”。
当然,在某些情况下,评价语音也可以为一个较长的句子,从而能够提供更加丰富的反馈信息。例如,评价语音可以是:“我不喜欢这么复杂的回答,请告诉我现在是几点即可”。又如,评价语音还可以是“请不要带任何冗余信息”。又如,评价语音还可以是“我不喜欢运动主题新闻,请给一些关于电影方面的热点新闻吧”等等。
在本实施例中,当用户针对响应于指令语音的回复语音做出评价,进而发出评价语音时,智能设备(也可以是终端设备,还可以是服务器)将会根据所述评价语音,确定对应所述指令语音的对话策略。这里,话术策略是指回应或响应指令语音的策略,例如包括:以内容简短的方式回应指令语音,或,内容丰富的方式回应指令语音,或,以不同的主题方式回应指令语音(如以轻松活泼的音乐方式回应指令语音,或,以带有故事的方式回应指令语音,或,以新闻输送的方式回应指令语音等等)。
在本实施例中,可以理解的是,在回复语音的播放期间或播放结束后,若用户感觉满意或不满意,均可以通过评价语音的方式进行反馈,从而使得智能设备(也可以是终端设备,也可以是服务器),根据评价语音对回复语音本身的播放时长/冗余度进行调整或对回复语音的使用频次进行调整等。
可以理解的是,当在回复语音的播放期间发表评价语音,则可以根据评价语音对当前正在播放的回复语音和/或下次(或后续)回复语音进行调整。当在回复语音播放结束后发表评价语音,则可以根据评价语音对下次(或后续)回复语音进行调整。
这里对下次(或后续)回复语音进行调整,既可以包括对下次(或后续)针对同样指令语音的回复语音进行调整,也可以包括对下次(或后续)由相同用户或不同用户发出类似指令语音的回复语音进行调整,也可以包括对下次(或后续)由相同用户发出部分或全部指令语音的回复语音进行调整,还可以包括对下次(或后续)在相同时间段由相同用户或不同用户发出相同或不同指令语音的回复语音进行调整等,本实施例对此不作限定。
此外,在本实施例中,根据评价语音对所述回复语音进行调整可以是指对回复语音的播放时长进行调整,也可以是指对回复语音的冗余度进行调整,还可以指两者兼顾,此外,也可以指更换新的回复语音,此外,还 可以指提高或降低回复语音使用频率,此外,还可以指停止播放回复语音等,本实施例对此不作限定。
此外,可以理解的是,对回复语音进行播放时长或冗余度的调整可以是每次进行实时调整,也可以是在某次调整后存储起来后续直接使用。
此外,对于具体的播放时长调整方式也有多种实现方式,例如,可以是缩短回复语音的内容,也可以是加快回复语音的播放速度,也可以既缩短回复语音的内容,又加快回复语音的播放速度。此外,还可以根据本次评价语音发生时回复语音已播放的时长确定该用户对回复语音长度的要求,从而后续在回复该用户的所有或部分指令语音时,都按照与该用户匹配的长度要求选择合适的回复语音进行回复。
由此可见,通过本实施例提供的语音交互方法,使得可以通过在回复语音的播放过程中或播放结束后发送评价语音的方式对回复语音进行调整,比如调整(本次或下次)回复语音的回复时长或更换回复语音等,从而使得回复语音的时长或内容更加匹配用户需求,从而可以为用户提供更好的语音交互服务体验。
在本实施例中,需要说明的是,评价语音可以是正面的评价语音,也可以是负面的评价语音。当评价语音是正面的评价语音时,可以保持当前的回复语音或根据当前的回复语音的时长、冗余度或扩展主题所属的类别等朝着相同或相似的方向进行优化。举例来说,假设当前的回复语音属于内容比较丰富,扩展信息比较多(也即冗余度比较高)的回复语音,则当针对该回复语音的评价语音为正面的评价语音时,可以保持当前的冗余度或向更高的冗余度优化。又如,假设当前的回复语音属于播放时长比较长的回复语音,则当针对该回复语音的评价语音为正面的评价语音时,可以保持当前的播放时长或向更高的播放时长优化。又如,假设当前的回复语音中扩展信息的主题为跑步类的主题,则当针对该回复语音的评价语音为正面的评价语音时,可以保持当前的跑步类扩展主题或增加关于瑜伽类(与跑步类的扩展主题类似)的扩展主题。
在本实施例中,需要说明的是,当评价语音是负面的评价语音时,可以根据当前的回复语音的时长、冗余度或扩展主题所属的类别等朝着相反或不同的方向进行优化。举例来说,假设当前的回复语音属于内容比较丰 富,扩展信息比较多(也即冗余度比较高)的回复语音,则当针对该回复语音的评价语音为负面的评价语音时,可以降低回复语音的冗余度。又如,假设当前的回复语音属于播放时长比较长的回复语音,则当针对该回复语音的评价语音为负面的评价语音时,可以降低回复语音的播放时长。又如,假设当前的回复语音中扩展信息的主题为运动类的主题,则当针对该回复语音的评价语音为负面的评价语音时,可以将回复语音中扩展信息的主题调整为生活类的主题等。
正如前面所述,正面的评价语音可以为包括正面评价词的语音,如正面评价词可以包括:真好、不错、挺好、喜欢、Yes、Good、Like等。举例来说,当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,若用户喜欢该语音,则对应的评价语音可能是“喜欢”或“Good”或“Yes”。
正如前面所述,负面的评价语音可以为包括负面评价词的语音,如负面评价词可以包括:不好、不喜欢、太长、太复杂、受到干扰、No、Bad、Stop等。举例来说,当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,若用户不喜欢该语音,则对应的评价语音可能是“不好”或“不喜欢”或“太长”或“受到干扰”或“No”或“Bad”或“Stop”等。
此外,正面的评价语音还可以为复述回复语音(或回复语音中的一部分)的语音,也即当用户比较认同或喜欢回复语音时,会复述回复语音(或回复语音中的一部分)来表达喜欢的感情。
此外,正面的评价语音还可以为包含与回复语音中的词语含义相同或类似或接近的语音,也即当用户比较认同或喜欢回复语音时,会通过表达相同含义的词语来表达喜欢的感情。举例来说,当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,若用户喜欢该语音,则对应的评价语音可能是“嗯,一起加油!”或“一起努力”或“一起奋斗”。
此外,负面的评价语音还可以为包含与回复语音中的词语含义相反的语音,也即当用户比较不喜欢回复语音时,会通过表达相反含义的词语来表达不喜欢的感情。举例来说,当回复语音为“现在已经是凌晨3点整, 天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,若用户不喜欢该语音,则对应的评价语音可能是“不加油!”或“不想努力”或“不想奋斗”等。
可以理解的是,举例来说,当接收到类似“太长”这类评价语音时,可以根据评价语音进行调整。例如,可以缩短后续针对相同指令语音的回复语音的长度,或者,可以缩短后续针对该用户发出的所有或部分指令语音的回复语音的长度。此外,假设评价语音中携带有“我希望回复语音的长度控制在5s内”这类的时长条件信息,则可以提取该时长条件信息,并根据该时长条件信息对后续针对相同指令语音的回复语音的长度,或者,可以缩短后续针对该用户发出的所有或部分指令语音的回复语音的长度进行调整。
又比如,当接收到类似“不喜欢这个主题”这类评价语音时,可以根据评价语音进行调整。例如,可以更换新的回复语音。假设当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,假设评价语音是“不喜欢这个主题”,则可以更换新的回复语音,例如更换为新的回复语音:“现在是凌晨3点整,给您讲个睡前故事吧”。此外,评价语音中也可以携带提示信息(例如喜欢足球主题),则在更换新的回复语音时,可以根据评价语音中携带的提示信息选择与足球主题匹配的回复语音,例如,更换为新的回复语音:“现在是凌晨3点整,早上7点有巴萨与皇马的对决赛,请记得关注!”。
根据上述技术方案可知,本申请提供的语音交互处理方法,根据响应于指令语音的回复语音在播放过程中或播放结束后的时间窗口内收到的评价语音,调整相应指令语音的对话策略,从而使得对应所述指令语音的对话策略更加匹配用户需求,从而可以为用户提供更好的语音交互服务体验。
基于上述实施例的内容,在本实施例中,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
根据所述评价语音携带的反馈信息,确定对应所述指令语音的对话策略。
在本实施例中,在根据所述评价语音确定对应所述指令语音的对话策 略时,可以先确定评价语音中携带的反馈信息,然后根据反馈信息确定相应的对话策略。举例来说,当确定评价语音中携带的反馈信息为“冗余度太高”时,可以确定对应所述指令语音的对话策略为:以内容简短有效的方式回应指令语音。又如,当确定评价语音中携带的反馈信息为“希望增加一些聊天式内容”时,可以确定对应所述指令语音的对话策略为:以内容丰富的方式回应指令语音。
由此可见,本实施例根据评价语音携带的反馈信息确定对应所述指令语音的对话策略,从而使得调整后的对话策略能够更加匹配用户需求,从而提高用户使用智能设备的体验。
基于上述实施例的内容,在本实施例中,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
根据所述评价语音,调整后续响应所述指令语音时所述回复语音出现的频率。
在本实施例中,可以根据所述评价语音中携带的反馈信息,确定后续响应所述指令语音时所述回复语音出现的频率是增加还是减少。
举例来说,当所述评价语音中包含的信息是正面反馈信息时,则后续可以将所述回复语音作为所述指令语音的响应的概率增加(也即频率增加)。当所述评价语音中包含的信息是负面反馈信息时,则后续可以将所述回复语音作为所述指令语音的响应的概率减少(也即频率减少或放弃使用)。
在一种实现方式中,增加所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从与所述指令语音对应的回复语音库中选择所述回复语音作为响应的概率提高。
在一种实现方式中,减少所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从与所述指令语音对应的回复语音库中选择所述回复语音作为响应的概率降低。
由此可见,在本实施例中,可以直接根据评价语音调整后续响应所述指令语音时所述回复语音出现的频率,也即用户喜欢的,可以多多出现,用户不喜欢的,降低出现次数或不再出现,从而更加匹配用户需求,满足 用户需求,从而可以提高用户体验。
基于上述实施例的内容,在本实施例中,所述回复语音为基于用户发出的指令语音通过查询对话数据库确定的回复语音;
相应地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
根据所述评价语音,查询评价数据库,确定所述评价语音中包含的反馈信息,并根据所述反馈信息,确定对应所述指令语音的对话策略;
其中,所述评价数据库和所述对话数据库独立设置,所述评价数据库设置在智能设备侧,且所述评价数据库的内容少于所述对话数据库。
在本实施例中,为提高处理效率,将评价数据库和对话数据库进行独立设置,从而使得用于对指令语音进行分析的对话数据库和用于对评价语音进行分析的评价数据库互不干扰,从而可以使得每个数据库的内容设置更加有针对性,从而可以有效提高各自的分析效率以及分析准确度。
在本实施例中,可以理解的是,智能设备(如智能音箱)被预先设定好在回复语音的播放过程中或播放结束后的时间窗口内执行接收评价语音以及针对评价语音的分析工作,从而能有效降低智能设备的能耗,同时,由于智能设备利用专门的用于对评价语音进行分析的数据库进行分析,从而能有效提高处理效率,且能够得到较为准确的分析结果。
在本实施例中,将用于对评价语音进行分析的数据库位于智能设备侧,由智能设备在播放回复语音过程中或播放结束后的时间窗口内,对接收到的语音基于用于对评价语音进行分析的数据库进行分析,确定所述评价语音携带的反馈信息为负面反馈信息或正面反馈信息,从而可以在智能设备本地完成分析(省去了与服务器或终端交互的交互过程),从而可以降低时延,使得可以迅速得到分析结果进而可以利用分析结果对智能设备进行调整。例如当可以及时分析出用户的评价语音中包含负面反馈信息时,可以及时中断当前回复语音或及时调整当前回复语音的冗余度或播放时长等(具体调整方式可以参见前述实施例的介绍),从而提高用户体验。
基于上述实施例的内容,在本实施例中,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
确定所述评价语音携带的反馈信息为负面反馈信息,则根据所述负面 反馈信息中携带的第一关键词,确定对应所述指令语音的第一对话策略调整方向,并根据所述第一对话策略调整方向,调整对应所述指令语音的对话策略。
在本实施例中,第一对话策略调整方向是指根据评价语音携带的负面反馈信息对响应于所述指令语音的回复语音的进行调整以改善用户体验的方向。例如,若确定所述评价语音中包含带有负面色彩的关键词且所述关键词与降低播放时长有关,则确定第一对话策略调整方向为降低响应所述指令语音的回复语音的播放时长和/或冗余度。若确定所述评价语音中包含带有负面色彩的关键词且所述关键词与用户偏好有关,则降低所述回复语音作为所述指令语音的响应的使用频率或更换新的回复语音作为所述指令语音的响应。这里的负面色彩是指带有不满意、不喜欢、有意见等负面的信息或含义。
在本实施例中,可以理解的是,当所述评价语音携带负面反馈信息时,为实现对话策略的精准调整,还可以根据反馈信息中携带的第一关键词确定第一对话策略调整方向。例如,可以根据反馈信息中携带的第一关键词确定第一对话策略调整方向是缩短播放时长(降低冗余度)的调整方向,还是降低相关回复语音使用频率的调整方向,或者是其他调整方向等等,从而可以更加精准匹配用户需求。
基于上述实施例的内容,在本实施例中,根据所述负面反馈信息中携带的第一关键词,确定对应所述指令语音的第一对话策略调整方向,具体包括:
确定所述第一关键词为与降低播放时长有关的关键词,则确定对应所述指令语音的第一对话策略调整方向为缩短播放时长或降低冗余度的方向。
在本实施例中,当负面反馈信息中携带的第一关键词为与降低播放时长有关的关键词时,表明用户希望收到较短或冗余度较低的回复语音,因此此时可以确定对应所述指令语音的第一对话策略调整方向为缩短播放时长或降低冗余度的方向,从而匹配用户需求。
在本实施例中,需要说明的是,与降低播放时长有关的关键词中可以包含与降低冗余度有关的关键词。例如,与降低播放时长有关的关键词可 以包括:“播放时间太长”、“回复内容太长”、“回复内容冗余”、“太长”、“冗余”等等。
在本实施例中,当确定对应所述指令语音的第一对话策略调整方向为缩短播放时长或降低冗余度的方向时,根据所述第一对话策略调整方向,调整对应所述指令语音的对话策略,可以包括如下多种处理方式:
A、结束所述回复语音,具体介绍如下:
可以理解的是,当在回复语音的播放过程中收到用户发送的负面评价语音时,表示用户不喜欢该回复语音或认为该回复语音的长度过长,此时一种处理方式可以是根据评价语音结束该回复语音,也即在收到评价语音时未播放的回复语音不再继续播放,结束该回复语音,这样可以使得用户不再受到过长或不喜欢的回复语音的困扰,使得能够在评价语音发出的同时实现回复语音停止播放的效果。可以理解的是,这里的结束所述回复语音可以指彻底结束回复语音的播放,也可以指暂时中止回复语音的播放,待接收到重启播放指令后再接着播放等,本实施例对此不作限定。
B、降低所述回复语音的播放时长和/或冗余度,具体介绍如下:
可以理解的是,回复语音的冗余度是指回复语音中非回复指令语音所必需的语音内容与回复语音全部语音内容的比值。
可以理解的是,当收到用户发送的负面评价语音时,表示用户不喜欢该回复语音或认为该回复语音的长度过长,此时一种处理方式可以是调整所述回复语音的播放时长和/或冗余度,例如,可以缩短所述回复语音的播放时长,也可以降低所述回复语音的冗余度,还可以同时缩短所述回复语音的播放时长以及降低所述回复语音的冗余度。
举例来说,假设所述回复语音的播放时长最初为15s,当接收到针对该回复语音的负面评价语音后,可以调整该回复语音的播放时长,例如可以将播放时长由15s调整为5s。可以理解的是,调整播放时长的方式有多种多样,例如,可以通过加快播放速度的方式,也可以通过去除部分回复语音的方式,也可以是两者兼具。当调整本次正在播放的回复语音时,可以加快剩余未播放部分的播放速度,也可以在未播放部分中截取部分内容进行继续播放。当调整下次回复语音时,可以加快整个回复语音的播放速度,也可以在整个回复语音部分中截取部分内容进行继续播放。
举例来说,对于回复语音:“现在是上午11点,工作累了吧,记得多补充水分,多吃水果哦,伸下懒腰,做下伸展运动有利于健康呀”,它的播放总时长为15s,假设当播放3s时(假设此时播放至:现在是上午11点,工作累了吧)收到了评价语音,此时可以通过加快未播放部分播放速度的方式将播放时长调整为8s或6s(或其他时间),也可以在未播放部分中截取部分内容“记得多补充水分,多吃水果哦”进行播放,可以理解的是,截取的部分内容可以是随机的,也可以是按照时间顺序截取的。比如可以随机截取前面的一段和后面的一段,如“多吃水果,做下伸展运动有利于健康呀”,也可以是按照时间顺序截取的“记得多补充水分,多吃水果哦”。具体截取的长度可以根据需求进行调整。
在本实施例中,需要说明的是,回复语音的冗余度是指回复语音中非回复指令语音所必需的语音内容与回复语音全部语音内容的比值;这里,回复指令语音所必需的语音内容可以理解成是与指令语音直接相关的内容,非回复指令语音所必需的语音内容可以理解成是与指令语音不是直接相关的内容,而是属于主动推介的内容,如温馨提示、音乐分享、俏皮话、广告等等。
在本实施例中,可以理解的是,回复语音的内容可以长短不一,冗余度不同,有的仅包含与指令语音直接相关的内容,有的则进一步包含了设计者主动推介的内容,如温馨提示、俏皮话乃至广告等。但是由于不同用户群体的需求不同,有的用户群体追求人性化,希望整个语音交互更为自然生动,富于变化;而有的用户群体则追求简洁明了,不希望接收与指令语音无关的冗余信息,因此,在接收到用户发送的评价语音后,可以降低回复语音的冗余度,以和用户的需求相匹配。
在本实施例中,需要说明的是,由于回复语音的冗余度是指回复语音中非回复指令语音所必需的语音内容与回复语音全部语音内容的比值,因此,降低回复语音的冗余度实际上是降低回复语音中非回复指令语音所必需的语音内容。
举例来说,对于回复语音:“现在是上午11点,工作累了吧,记得多补充水分,多吃水果哦,伸下懒腰,做下伸展运动有利于健康呀”,通过降低冗余度可以调整为“现在是上午11点,工作累了吧,记得多补充水分, 多吃水果哦”,也可以调整为“现在是上午11点,工作累了吧,记得多补充水分”,还可以调整为“现在是上午11点”。
C、降低与所述回复语音对应的回复文本的字数和/或冗余,具体介绍如下:
可以理解的是,当在回复语音的播放过程中收到用户发送的负面评价语音时,表示用户不喜欢该回复语音或认为该回复语音的长度过长,此时一种处理方式可以是调整与所述回复语音对应的回复文本的字数和/或冗余度,可以理解的是,本处理方式和上述“调整所述回复语音的播放时长和/或冗余度”的处理方式思路比较类似,区别主要在于本处理方式是调整与回复语音对应的回复文本的字数和/或冗余度。
可以理解的是,本实施例通过调整与回复语音对应的回复文本的字数和/或冗余度的方式来调整所述回复语音的播放时长和/或冗余度,由于其实质上是类似的,故此处不再举例说明,具体例子可参见上面实施例的介绍。
D、降低与第一用户发出的所有或部分指令语音对应的回复语音的播放时长和/或冗余度,具体介绍如下:
在本处理方式中,所述第一用户为发出所述指令语音的用户。
可以理解的是,当在回复语音的播放过程中收到第一用户发送的负面语音评价时,表示第一用户有可能认为该回复语音的长度过长,也即可以得到第一用户不希望接收与指令语音无关的冗余信息,也即也即可以得到第一用户为喜欢简短有效回复语音的用户,因此,在这种情况下,为更加贴合用户需求,可以将与第一用户对应的所有或部分指令语音的回复语音都调整成较低的播放时长和/或冗余度,从而满足该用户的交互需求。
在本处理方式中,降低与第一用户发出的所有或部分指令语音对应的回复语音的播放时长和/或冗余度可以包括下述中的任意一项或多项:
在检测到第一用户发出的指令语音后,从与所述指令语音对应的回复语音库中选择播放时长小于预设时长阈值和/或冗余度小于预设冗余度阈值的回复语音。
在检测到第一用户发出的指令语音后,从与所述指令语音对应的回复语音库中选择回复语音,并对该回复语音的播放时长进行调整,例如,可 以控制该回复语音在播放时长小于或等于预定阈值时停止播放。另外,还可以控制该回复语音的播放速度,使得该回复语音的播放时长缩短。另外,还可以从该回复语音中截取部分内容进行播放,使得该回复语音的播放时长缩短。
在检测到第一用户发出的指令语音后,从与所述指令语音对应的回复语音库中选择回复语音,并对该回复语音的冗余度进行调整,例如,去除一些或全部与指令语音不存在直接关联的内容,从而降低冗余度。
可以理解的是,关于降低播放时长和冗余度的方式,可以参见前述处理方式或后续处理方式的介绍,本处理方式主要强调的是当检测到第一用户在某一次的回复语音播放过程中发送过负面语音评价时,后续将会针对第一用户发出的所有或部分指令语音均会调整其对应的回复语音使得回复语音的播放时长小于预设时长阈值和/或冗余度小于预设冗余度阈值,从而使得语音交互过程更加符合用户对回复语音时长和/冗余度的需求。举例来说,当第一用户在某一次语音交互过程中发出的语音指令为“现在是几点”,回复语音为“现在是晚上7点,您要不要听首放松的曲子或者一段相声”,假设该回复语音收到了负面评价语音,也即第一用户发出了负面语音评价,则表示该第一用户不喜欢接收与指令语音无关的冗余信息,那么后续针对该第一用户发出的所有或部分指令语音,例如可以是上述“现在是几点”的指令语音,也可以是其他指令语音,如“今天天气如何”、“位置A到位置B的交通路况如何”等等,均会调整其对应的回复语音使得回复语音的播放时长小于预设时长阈值和/或冗余度小于预设冗余度阈值,从而使得语音交互过程更加符合用户对回复语音时长和/冗余度的需求。
可以理解的是,前面处理方式讲述的是针对同一指令语音的回复语音的调整方式,例如针对“现在是几点”的指令语音,确定后续再次出现“现在是几点”的指令语音时的回复语音的调整方式,而本处理方式针对的是第一用户,也即针对第一用户发出的所有或部分指令语音均会调整其对应的回复语音,从而使得语音交互过程更加符合用户对回复语音时长和/冗余度的需求。当然可以理解的是,当某些指令语音对应的回复语音本身不需要调整即可满足第一用户对语音时长和/冗余度的要求时,则不需要进行调整。
E、降低与第一用户发出的所有或部分指令语音对应的回复文本的字 数和/或冗余度,具体介绍如下:
在本处理方式中,与上述处理方式类似,主要区别在于本处理方式强调的是回复文本的字数和/或冗余度,也即本处理方式是通过调整回复文本的字数和/或冗余度的方式来调整回复语音的长度和/或冗余度。这里的字数条件和/或冗余度条件可以根据需要进行设定。例如,可以根据字数条件从回复文本中选择部分文本内容,选取的方式可以是顺序的,也可以是随机的。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。
F、降低与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度,具体介绍如下:
在本处理方式中,侧重在于调整与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度。
在本处理方式中,指令语音组的划分方式可以是多种多样的,比如可以按照指令主题进行划分,也可以按照指令语音的长短和/或复杂度进行划分,还可以按照相似度进行划分等等,对于具体划分方式,不作限定。
举例来说,所述指令语音组可以以指令主题的方式进行划分,例如,可以按照生活指令、工作指令、学习指令中的一种或多种进行划分。相应地,得到生活指令语音组、工作指令语音组和学习指令语音组。例如,“现在是几点”、“今天天气”、“明天天气”、“交通状况”、“限号号码”、“超市打折”等指令语音属于生活指令语音组中的指令语音,而“刻舟求剑的含义”、“5G手机是什么手机”、“log函数的由来”等指令语音属于学习指令语音组中的指令语音,又如“如何合理安排时间”、“出差注意事项”、“如何提高工作效率”、“人工智能算法都有哪些”等指令语音属于工作指令组中的指令语音。
可以理解的是,有些用户对于生活指令的回复语音比较重视,希望回复语音较为丰富多彩,内容幽默有趣。这类用户包括家庭主妇、退休老人等,而有些用户对于学习指令的回复语音比较重视,希望回复语音能够较为详细地阐述知识背后的典故、原理等,这类用户包括学生、学者、全职妈妈等,此外,还有用户对于工作指令的回复语音比较重视,希望回复语音能够较为详细地阐述针对工作问题的答复,这类用户包括职场人士等。
可以理解的是,不同的用户对于不同指令语音组对应的回复语音的播放时长和/或冗余度的要求是不同的,例如,职业人士希望针对工作指令组的回复语音较为详实,而希望对于生活指令组的回复语音较为简短。例如,当用户对指令语音“现在是几点”的回复语音的需求是简短有效时,那么对于与“现在是几点”位于同一指令语音组的其他语音指令,如“今天天气如何”、“限行尾号”、“某路线是否堵车”的回复语音的需求也是简单有效。
在本处理方式中,当根据用户发出的负面语音评价确定某一指令语音对应的回复语音不受欢迎时,有可能表示用户希望该指令语音的回复语音是简短有效的,无需太多冗余信息。根据上面指令语音组的分析可知,用户也希望该指令语音所在的指令语音组对应的回复语音均没有太多冗余信息,因此,为提高用户使用体验,避免用户针对同一指令语音组中的不同指令语音的回复语音多次发送负面语音评价,本处理方式调整与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度,使得该用户在发出同一指令语音组中的其他指令语音时,也可以得到播放时长和/冗余度较低的回复语音,从而可以避免用户针对同一指令语音组中的不同指令语音的回复语音多次发送负面语音评价,从而可以提高用户使用体验。
需要说明的是,在智能设备只有一个用户使用的情况下,关于语音交互处理方式无需区分不同用户,在智能设备属于多个用户共用的情况下,关于语音交互的处理方式需要区分不同的用户,具体区分时,可以通过音色识别的方式区别不同用户,进而根据相应用户的指令语音,以及,与该用户对应的语音交互处理方式确定相应的回复语音或对回复语音进行相应的调整。例如,假设用户A和用户B共用一台智能设备,且用户A为退休老人,用户B为职场人士,那么在发出同一语音“现在是几点”时,两人对回复语音的需求是不同的,用户A想要内容较为丰富,冗余度较高的回复语音,用户B想要内容简短有效,冗余度较低的回复语音,由此可见,当多个用户共有智能设备的情况下,需要区分不同的用户,具体区分时,可以通过不同用户的音色进行区分,也可以通过用户在发出指令语音前先发出指定语音(比如名字、小名、暗号)的方式进行区分,还可以通过特定的按键触发或手势触发的方式进行区分,本实施例对此不作限定。
G、降低与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复文本的字数和/或冗余度,具体介绍如下:
在本处理方式中,与上述处理方式类似,主要区别在于本处理方式强调的是回复文本的字数和/或冗余度,也即本处理方式是通过调整回复文本的字数和/或冗余度的方式来调整回复语音的长度和/或冗余度。这里的字数条件和/或冗余度条件可以根据需要进行设定。例如,可以根据字数条件从回复文本中选择部分文本内容,选取的方式可以是顺序的,也可以是随机的。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。
H、降低与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度,具体介绍如下:
在本处理方式中,侧重点在于强调调整与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度。可以理解的是,与指令语音对应的回复语音库中存储的一个或多个回复语音都是与该指令语音对应的回复语音,当用户对其中一个回复语音发出负面语音评价时可能表明用户认为该回复语音的播放时长过长和/或冗余度过高,同时,在一定情况下,也可以反映该用户希望与该指令语音对应的其他回复语音的播放时长也不要过长和/或冗余度也不要过高。为此,在本处理方式中,当接收到用户针对某一指令语音的回复语音的负面语音评价时,调整与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度,从而满足用户对于该指令语音的回复语音播放时长和/或冗余度的需求。举例来说,当用户发出的指令语音是“今天天气如何”时,假设在播放回复语音“今天天气晴朗,温度16-21℃,微风,适合郊外活动,可以考虑外出踏青哦”时收到负面评价语音,则说明该用户只关心与指令语音直接相关的回复内容,不希望被过长的语音干扰。
假设与指令语音“今天天气如何”对应的回复指令库中的剩余其他回复语音分别为①“今天天气晴朗,温度16-21℃,微风,穿衣指数为1,适合穿秋衣和外套,天气干燥,注意补充水分,多吃水果”;②“今天天气晴朗,温度16-21℃,微风,推荐户外跑步,跑步之前记得做下拉伸运动,以免受伤”;③“今天天气晴朗,温度16-21℃,在这风和日丽的日子,请 跟随自己的内心,读一本书或来一场说走就走的旅行吧”;④“今天天气晴朗,温度16-21℃,早上好,现在给您播报一段早间新闻…”。
根据上面的分析可知,当用户发出负面语音评价时,说明该用户只关心与指令语音直接相关的回复内容,不希望被过长的语音干扰,为此,在本处理方式中,根据该负面语音评价,可以将与指令语音“今天天气如何”对应的回复指令库中的所有或部分回复语音的播放时长和/或冗余度进行调低,从而满足用户的需求。例如可以将①缩短为“今天天气晴朗,温度16-21℃,微风,适合穿秋衣和外套”;将②缩短为“今天天气晴朗,温度16-21℃,微风,推荐户外跑步”;将③缩短为“今天天气晴朗,温度16-21℃”、将④缩短为“今天天气晴朗,温度16-21℃,早上好”等等。
I、降低后续时间段内与所述指令语音相同的指令语音对应的回复语音的播放时长和/或冗余度,具体介绍如下:
可以理解的是,当收到用户发送的负面语音评价时,表示用户不喜欢该回复语音或认为该回复语音的长度过长,在这种情况下可以降低后续时间段内与所述指令语音相同的指令语音对应的回复语音的播放时长和/或冗余。
在本处理方式中,降低后续时间段内与所述指令语音相同的指令语音对应的回复语音的播放时长和/或冗余,可以包括两种情况:
①降低后续与所述指令语音相同的指令语音对应的回复语音的播放时长和/或冗余度;
②后续遇到与所述指令语音相同的指令语音时,从回复指令库中选择播放时长和/或冗余度低于本次回复语音的语音作为回复语音;
可以理解的是,对于第①种处理方式,可以在后续播放与所述指令语音相同的指令语音对应的回复语音时可以加快播放速度,进而缩短播放时长。此外,对于第①种处理方式,可以在后续播放与所述指令语音相同的指令语音对应的回复语音时从回复语音中选择部分语音内容进行播放,进而缩短播放时长。举例来说,对于回复语音:“现在是上午11点,工作累了吧,记得多补充水分,多吃水果哦,伸下懒腰,做下伸展运动有利于健康呀”,它的播放时长为15s,通过加快播放速度的方式将播放时长调整为8s或6s(或其他时间),也可以在回复语音中截取部分内容“现在是上午 11点,工作累了吧,记得多补充水分,多吃水果哦”进行播放,可以理解的是,截取的部分内容可以是随机的,也可以是按照时间顺序截取的。比如可以随机截取最前面的一段和最后面的一段,如“现在是上午11点,做下伸展运动有利于健康呀”,也可以是按照时间顺序截取的“现在是上午11点,工作累了吧”。具体截取的长度可以根据需求进行调整。此外,对于第①种处理方式,还可以确定所述负面语音评价发生时所述回复语音已播放的第一时长(这种情况对于负面评价语音是在回复语音播放完成后发出的情况不适用),并控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长。例如,可以控制后续与所述指令语音相同的指令语音对应的回复语音在播放时长小于或等于所述第一时长时停止播放;此外,对于第①种处理方式,还可以以预定阈值的方式,控制后续与所述指令语音相同的指令语音对应的回复语音在播放时长小于或等于所述预定阈值时停止播放。此外,对于第①种处理方式,还可以以指定区间内的随机阈值的方式,控制后续与所述指令语音相同的指令语音对应的回复语音在播放时长小于或等于所述随机阈值时停止播放。例如,所述随机阈值可以位于指定区间3-6s内,例如可以是随机播放到3s时停止,也可以是随机播放到5s时停止,也可以是随机播放到6s时停止等等。此外,对于第①种处理方式,还可以确定所述负面语音评价发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制后续与所述指令语音相同的指令语音对应的回复语音的冗余度小于或等于所述比值。
对于第②种处理方式,可以从回复指令库中选择播放时长和/或冗余度低于本次回复语音的语音作为回复语音,具体实现时,可以在回复指令库中为每个回复语音都标记上播放时长和冗余度,这样就可以根据回复指令库中每个回复语音的播放时长和冗余度,选择播放时长和/或冗余度低于本次回复语音的语音作为回复语音。
J、降低后续时间段内与所述指令语音相同的指令语音对应的回复语音的回复文本的字数和/或冗余度,具体介绍如下:
可以理解的是,本处理方式与上述处理方式类似,区别在于本处理方式强调的是回复文本的字数和/或冗余度,也即本处理方式是通过调整回复 文本的字数和/或冗余度的方式来调整回复语音的长度和/或冗余度。这里的字数条件和/或冗余度条件可以根据需要进行设定。例如,可以根据字数条件从原始回复文本中选择部分文本内容。选取的方式可以是顺序的,也可以是随机的。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。可以理解的是,回复文本的冗余度和回复语音的冗余度的定义类似,也即回复文本的冗余度指回复文本中非回复指令语音所必需的文本内容(字数)与回复指令语音所有的文本内容(字数)的比值;这里,回复指令语音所必需的文本内容可以理解成是与指令语音直接相关的内容,非回复指令语音所必需的文本内容可以理解成是与指令语音不是直接相关的内容,而是属于主动推介的内容,如温馨提示、音乐分享、俏皮话、广告等等。
此外,基于上述实施例的内容,在本实施例中,给出一种更为具体和有效的对话策略调整方案,具体说明如下:
在本实施例中,根据所述第一对话策略调整方向,调整对应所述指令语音的对话策略,具体包括:
确定接收所述评价语音时所述回复语音已播放的第一时长,根据所述第一时长调整对应所述指令语音的回复语音的播放时长;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的第一比值,根据所述第一比值调整对应所述指令语音的回复语音的冗余度。
在本实施例中,需要说明的是,本实施例有效利用了“接收所述评价语音时所述回复语音已播放的第一时长”这一信息,使得在对响应指令语音的回复语音的播放时长或冗余度进行调整时,能够有效依据第一时长进行回复语音播放时长的调整或根据第一时长占回复语音总时长的第一比值进行回复语音冗余度的调整。
可以理解的是,由于回复语音在播放至第一时长时用户发出了评价语音,因此表明第一时长这个长度有可能是用户能够接受的最大长度,超过这个长度的回复语音是用户所不愿意接受的,因此,可以以此为条件,控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于 或等于所述第一时长,从而满足用户对回复语音播放时长的需求。
举例来说,假设一个回复语音完整的播放时长是15s,当在该回复语音播放至6s时接收到了用户的评价语音,则表明该用户针对该指令语音的回复语音的播放时长的需求是在6s或6s以下,因此,可以将6s作为阈值,控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于6s。
类似地,还可以确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制与第一用户发出的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值;或,
确定接收所述评价语音时所述回复语音已播放的第一时长,并控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长;或,确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值。由于这部分的原理和前面的原理类似,故此处不再一一赘述。
需要说明的是,通过本实施例的技术方案能够更为准确地根据评价语音对回复语音进行调整,从而使得人机交互过程中的回复语音能够满足用户对于人机交互的需求,从而可以提高用户体验。
基于上述实施例的内容,在本实施例中,根据所述第一时长调整对应所述指令语音的回复语音的播放时长,具体包括:
控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长;
或,控制与第一用户发出的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长;其中,所述第一用户为发出所述指令语音的用户;
或,控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长。
在本实施例中,考虑了三种控制场景,分别是①后续与所述指令语音相同的指令语音对应的回复语音的调整情况;②与第一用户发出的所有或 部分指令语音对应的回复语音的调整情况;③与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的调整情况。
在本实施例中,所述控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长,包括:
控制后续与所述指令语音相同的指令语音对应的回复语音在播放时长小于或等于所述第一时长时停止播放;
或,控制后续与所述指令语音相同的指令语音对应的回复语音在播放时截取部分内容进行继续播放;
或,从与所述指令语音对应的回复语音库中选择播放时长小于或等于所述第一时长的回复语音作为后续与所述指令语音相同的指令语音对应的回复语音;
或,
调高后续与所述指令语音相同的指令语音对应的回复语音的播放速度。
在本实施例中,在控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长时,有多种实现方式,例如可以是:A、控制后续与所述指令语音相同的指令语音对应的回复语音在播放时长小于或等于所述第一时长时停止播放;或,B、控制后续与所述指令语音相同的指令语音对应的回复语音在播放时截取部分内容进行播放;或,C、从与所述指令语音对应的回复语音库中选择播放时长小于或等于所述第一时长的回复语音作为后续与所述指令语音相同的指令语音对应的回复语音;或,D、调高后续与所述指令语音相同的指令语音对应的回复语音的播放速度。
由此可见,本实施例给出了多种实现方式,上述方式A的优势在于,控制起来简单方便,只需在回复语音的播放时长小于或等于所述第一时长时停止播放即可。上述方式B的优势在于,比较灵活,例如可以根据需要截取回复语音中相对比较重要的信息进行播放。上述方式C的优势在于,不用对回复语音库中的回复语音进行调整,实现起来简单方便,可以直接选择播放时长满足要求的回复语音作为响应。上述方式D的优势在于,不损失回复语音的信息内容,同时能够满足缩短播放时长的效果。
在本实施例中,主要强调的是当检测到第一用户在某一次的回复语音播放过程中发送过评价语音时,后续将会针对第一用户发出的所有或部分指令语音均会调整其对应的回复语音使得回复语音的播放时长小于或等于所述第一时长,从而使得语音交互过程更加符合用户对回复语音时长和/冗余度的需求。
举例来说,当第一用户在某一次语音交互过程中发出的语音指令为“现在是几点”,回复语音为“现在是晚上7点,您要不要听首放松的曲子或者一段相声”,假设在该回复语音的播放过程中,第一用户在2s(也即播放现在是晚上7点时)发出了评价语音,这表示该第一用户不喜欢接收与指令语音无关的冗余信息,那么后续针对该第一用户发出的所有或部分指令语音,例如可以是上述“现在是几点”的指令语音,也可以是其他指令语音,例如“天气预报”、“洗车指数”等,均会控制其对应的回复语音使得回复语音的播放时长小于或等于2s,从而使得语音交互过程更加符合用户对回复语音时长的需求。
可以理解的是,前面处理方式讲述的是针对同一指令语音的回复语音的调整方式,而本处理方式针对的是第一用户,也即针对第一用户发出的所有或部分指令语音均会调整其对应的回复语音,从而使得语音交互过程更加符合用户对回复语音时长和/冗余度的需求,同时也避免了第一用户针对与不同指令语音的回复语音都发出评价语音的麻烦。
在本实施例中,根据所述评价语音调整与第一用户发出的所有或部分指令语音对应的回复语音的冗余度,包括:
确定所述评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制与第一用户发出的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值。
在本实施例中,与上述实施例“控制与第一用户发出的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长”类似,区别主要在于本实施例强调的是回复语音的冗余度,在本实施例中,关于冗余度的阈值为评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值,此外,由于关于冗余度相关的具体原理在其他实施例中已经有较为详细的介绍,因此此处不再赘述。
在本实施例中,根据所述评价语音调整与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长,包括:
确定所述评价语音发生时所述回复语音已播放的第一时长,并控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长。
在本实施例中,正如上面所描述的,工作指令组可以按照指令主题的方式进行划分,例如,可以按照生活指令、工作指令、学习指令中的一种或多种进行划分。相应地,得到生活指令语音组、工作指令语音组和学习指令语音组。
举例来说,“今天限号号码”、“天气预报”、“七步洗手法”等指令语音属于生活指令语音组中的指令语音。举例来说,“英文单词pop的由来”、“十二生肖的故事”等指令语音属于学习指令语音组中的指令语音。举例来说,“如何成为靠谱的职场人”、“如何做好工作计划”等指令语音属于工作指令组中的指令语音。
可以理解的是,有些用户对于生活指令的回复语音比较重视,希望回复语音较为丰富多彩,内容幽默有趣。这类用户包括小孩、自由职业、全职主妇或老人等,而有些用户对于学习指令的回复语音比较重视,希望回复语音能够较为详细地阐述知识背后的典故、原理等,这类用户包括学生、业务学习爱好者等,此外,还有用户对于工作指令的回复语音比较重视,希望回复语音能够较为详细地阐述针对工作问题的答复,这类用户包括上班人士等。
可以理解的是,由于用户对同一指令语音组中各个指令语音具有相同的播放长度和/或冗余度诉求,因此,将语音指令按照指令语音组的方式进行划分后,则对属于同一指令语音组中的多个语音,智能设备(或终端设备或服务器)可以采用类似播放时长和/或冗余度的回复语音对属于同一指令语音组中的指令语音进行回复,从而省去了用户对于同一指令语音组的部分或全部语音指令的回复语音均发出评价语音进行调整的麻烦。
在本处理方式中,当根据用户发出的评价语音确定某一指令语音对应的回复语音被打断时,表示用户希望该指令语音的回复语音是简短有效的,无需太多冗余信息。根据上面指令语音组的分析可知,用户也希望该指令 语音所在的指令语音组对应的回复语音均没有太多冗余信息,因此,为提高用户使用体验,避免用户针对同一指令语音组中的不同指令语音的回复语音多次发送评价语音,本处理方式使得与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长,使得该用户在发出同一指令语音组中的其他指令语音时,也可以得到播放时长和/冗余度较低的回复语音,从而可以避免用户针对同一指令语音组中的不同指令语音的回复语音多次发送评价语音,从而可以提高用户使用体验。
在本实施例中,根据所述评价语音调整与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的冗余度,包括:
确定所述评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值。
在本实施例中,与上述实施例“与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长”的原理类似,区别主要在于本实施例强调的是回复语音的冗余度,在本实施例中,控制冗余度时利用的阈值是所述评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值,此外,由于关于回复语音的冗余度调整的具体原理在其他实施例中已经有较为详细的介绍,因此此处不再赘述。
基于上述实施例的内容,在本实施例中,根据所述第一比值调整对应所述指令语音的回复语音的冗余度,具体包括:
控制后续与所述指令语音相同的指令语音对应的回复语音的冗余度小于或等于所述第一比值;
或,控制与第一用户发出的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值;
或,控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值。
基于上述实施例的内容,在本实施例中,根据所述负面反馈信息中携带的第一关键词,确定对应所述指令语音的第一对话策略调整方向,具体 包括:
确定所述第一关键词为与偏好有关的关键词,则确定对应所述指令语音的第一对话策略调整方向为降低所述回复语音的使用频率或更换新的回复语音的方向。
在本实施例中,当根据负面反馈信息确定第一关键词为与偏好有关的关键词时,说明用户对于当前的回复语音不喜欢,因此,可以确定对应所述指令语音的第一对话策略调整方向为降低所述回复语音的使用频率或更换新的回复语音的方向。
在本实施例中,与偏好有关的关键词包括:不喜欢、Don’t like、更换、以后不要出现、换一个等等。
在本实施例中,当根据负面反馈信息确定第一关键词为与偏好有关的关键词时,说明用户对于当前的回复语音不喜欢,此时有多种调整对话策略的方式:
A、从与所述指令语音对应的回复语音库中选择与所述回复语音不同的回复语音进行播放(可以是本次也可以是下次),具体介绍如下:
可以理解的是,当收到用户发送的负面评价语音时,表示用户不喜欢该回复语音或认为该回复语音的长度过长,此时一种处理方式可以是从与所述指令语音对应的回复语音库中选择与所述回复语音不同的回复语音进行播放,也即在收到评价语音时,表示接收到了用户不喜欢该回复语音或嫌该回复语音过长的信息,此时,可以从与指令语音对应的回复语音库中选择其他回复语音替换当前回复语音进行播放。可以理解的是,在从回复语音库中选择其他回复语音时,遵循的原则可以包括但不限于下面几种中的任意一种或多种(多种组合不矛盾的前提下):①以随机的方式选择其他回复语音;②以语音长度小于当前回复语音的标准选择其他回复语音;③以语音内容对应的主题与当前回复语音对应的扩展主题不同为标准选择其他回复语音;④以语音内容对应的声色与当前回复语音对应的声色不同为标准选择其他回复语音(例如,男声变换为女声,或,女声变换为男声,或,成人变换为儿童,或儿童变换为成人等)。⑤以所述评价语音中携带的提示信息为依据选择回复语音(例如提示信息为喜欢足球主题,则在更换新的回复语音时,可以根据评价语音中携带的提示信息选择与足球 主题匹配的回复语音)。
可以理解的是,对于具备语音交互功能的智能设备,其一般具有预设数量的交互技能,当用户向智能设备发出一个指令语音时,智能设备会通过意图识别将用户的指令语音划分到某一个或几个交互技能上,然后再进行后续的处理。需要说明的是,一般情况下,每个交互技能都至少对应有一个回复语音库,当通过意图识别的方式识别出该指令语音的意图后,即可将该指令语音划分到一个或几个交互技能上,由于每个交互技能都至少对应有一个回复语音库,因此可以确定与指令语音对应的一个或多个回复语音库。
可以理解的是,与指令语音对应的一个或多个回复语音库中存储有一个或多个回复语音,这些回复语音可以为语音长短不同的回复语音,也可以为扩展主题不同的回复语音,也可以为声色不同的回复语音,本实施例对此不作限定。
可以理解的是,与指令语音对应的一个或多个回复语音库中存储的一个或多个回复语音属于均能够作为指令语音的回复语音,只是在时间长短、扩展主题、声色等形式或内容上呈现出不同而已。
举例来说,与指令语音对应的回复语音库中存储有不同时长的回复语音,分别为1s,3s,5s,10s,15s,20s,25s,30s,50s的回复语音。
举例来说,与指令语音对应的回复语音库中存储有不同扩展主题的回复语音,扩展主题包括但不限于为信息类的(仅传达信息例如现在是下午3点)、有趣类的(现在是下午3点,要不要听个笑话缓解下心情,笑话内容为:…)、知识类的(现在是下午3点,天气晴朗,下午3点属于大脑神经元比较活跃的时段,可以选择一些记忆类的工作进行处理等等)、故事类的(现在是下午3点,历史上的今天下午3点发生过什么重大事情等)、音乐类的(现在是下午3点,欢迎收听歌手A的一首老歌)、体育类的(现在是下午3点,3点50分CBA北京VS广州队半决赛开始,请不要错过)、对话类的(现在是下午3点,要不要做个猜字谜的游戏等)。
举例来说,与指令语音对应的回复语音库中存储有不同声色的回复语音,例如,对于同一回复语音,可以采用男生、女生、成人和儿童分别进行录制,得到不同声色的回复语音。
可以理解的是,对于上面描述的不同时长、不同扩展主题以及不同声色可以根据需要进行组合,本实施例对此不作限定。
可以理解的是,在本处理方式中,在根据所述评价语音从与所述指令语音对应的回复语音库中选择与所述回复语音不同的回复语音进行播放后,还可以进一步确定更改后的回复语音是否存在反面评价语音,若没有,则可以选用更改后的回复语音作为后续响应所述指令语音的回复语音,若更改后的回复语音不存在反面评价语音,则可以继续更换新的回复语音进行播放直至不再接收到用户的反面评价语音为止。
此外,为进一步完善方案,还可以记录当前的时间段,并在确定更改后的回复语音不存在反面评价语音时,选用更新后的回复语音作为所述指令语音的响应,以提高用户满意度。
B、减少所述回复语音的使用频率;其中,减少所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从与所述指令语音对应的回复语音库中选择所述回复语音作为响应的概率降低,具体介绍如下:
在本处理方式中,侧重点在于,当某一回复语音收到负面评价语音时,后续将减少该回复语音的使用频率,也即由于该回复语音在作为所述指令语音的回复语音时不被欢迎,因此,后续在响应所述指令语音时,将减低选择该回复语音的可能性,也即后续将从与所述指令语音对应的回复语音库中选择该回复语音作为响应的概率降低。采用本实施例的处理方式,有一个好处就是无需对回复语音库中的回复指令进行调整或改变,而是选择比较合适或匹配的回复语音作为指令语音的响应,这种方式实现起来较为简单方便。
举例来说,与某一指令语音对应的回复语音库中存在有长短不一的回复指令,有些用户希望收到播放时长较长、冗余度较高的回复指令,有些用户希望收到播放时长较短、冗余度较低的回复指令,在这种情况下,可以根据与该用户对不同回复语音的反馈信息,确定后续为该用户选择哪一个或几个回复语音作为响应指令语音的回复语音。根据上面的描述可知,当某一回复语音收到负面评价语音时,表明该回复语音在作为所述指令语音的回复语音时不被欢迎,那么后续将减少该回复语音的使用频率,也即后续将从与所述指令语音对应的回复语音库中选择该回复语音作为响应 的概率降低。
C、减少播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率;其中,减少播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率是指在后续响应所述指令语音时,从与所述指令语音对应的回复语音库中选择播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率降低;
本处理方式与上述处理方式类似,主要区别在于,本处理方式用于减少与所述指令语音对应的回复语音库中播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率,可以理解的是,若回复语音收到负面评价语音,则表明用户不喜欢播放时长和/或冗余度大于或等于所述回复语音的回复语音,因此,后续可以降低播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率,从而可以更加贴合用户需求。由于本实施例的处理方式与上述实施例类似,故此处不再赘述。
D、根据所述负面反馈信息中携带的用户希望更换的主题,从与所述指令语音对应的回复语音库中选择与所述主题匹配的回复语音进行播放。
例如,假设用户在评价信息中携带了希望更换的主题为“健康主题”,则在进行对话策略调整时,可以根据用户希望更换的主题,从与所述指令语音对应的回复语音库中选择与所述主题匹配的回复语音进行播放,从而可以准确匹配用户需求。
基于上述实施例的内容,在本实施例中,根据所述第一对话策略调整方向,调整对应所述指令语音的对话策略,具体包括:
降低所述回复语音的使用频率;其中,降低所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从与所述指令语音对应的回复语音库中选择所述回复语音作为响应的概率降低;
或,降低播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率;其中,减低播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率是指在后续响应所述指令语音时,从与所述指令语音对应的回复语音库中选择播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率降低;
或,从与所述指令语音对应的回复语音库中选择与所述回复语音不同 的回复语音进行播放;
或,根据所述负面反馈信息中携带的用户希望更换的主题,从与所述指令语音对应的回复语音库中选择与所述主题匹配的回复语音进行播放。
关于该实施例各部分的详细介绍,上述实施例已经给出,具体内容和效果可参见上述实施例的相关内容,此处不再赘述。
基于上述实施例的内容,在本实施例中,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
确定所述评价语音携带的反馈信息为正面反馈信息,则根据所述正面反馈信息中携带的第二关键词,确定对应所述指令语音的第二对话策略调整方向,并根据所述第二对话策略调整方向,调整对应所述指令语音的对话策略。
在本实施例中,第二对话策略调整方向是指根据评价语音携带的正面反馈信息对响应于所述指令语音的回复语音的进行调整以进一步保持或增强用户体验的方向。例如,若确定所述评价语音中包含带有正面色彩的关键词且所述关键词与保持或提高播放时长有关,则保持或提高响应所述指令语音的回复语音的播放时长和/或冗余度。若确定所述评价语音中包含带有正面色彩的关键词且所述关键词与保持或提高使用频率有关,则保持或提高所述回复语音作为所述指令语音的响应的使用频率。这里的正面色彩是指带有喜欢、认可、支持等正面反馈的信息或含义。
在本实施例中,可以理解的是,当所述评价语音携带正面反馈信息时,为实现对话策略的精准调整,还可以根据反馈信息中携带的第二关键词确定第二对话策略调整方向。例如,可以根据反馈信息中携带的第二关键词确定第二对话策略调整方向是保持或提高播放时长(保持或提高冗余度)的调整方向,还是保持或提高相关回复语音使用频率的调整方向,或者是其他调整方向等等,从而可以更加精准匹配用户需求。
基于上述实施例的内容,在本实施例中,根据所述正面反馈信息中携带的第二关键词,确定对应所述指令语音的第二对话策略调整方向,具体包括:
确定所述第二关键词为与保持或提高播放时长有关的关键词,则确定对应所述指令语音的第二对话策略调整方向为保持或提高播放时长的方 向,或,保持或提高冗余度的方向。
在本实施例中,与保持或提高播放时长有关的关键词可以为:时长正好、下次时长可以再适当增加、很喜欢这个长短的回复语音、内容丰富且时间正好等。
可以理解的是,若确定第二关键词为与保持或提高播放时长有关的关键词,则确定对应所述指令语音的第二对话策略调整方向为保持或提高播放时长的方向,或,保持或提高冗余度的方向。
在本实施例,若确定对应所述指令语音的第二对话策略调整方向为保持或提高播放时长的方向,或,保持或提高冗余度的方向,则可以有如下几种实现方式:
A、保持或提高所述回复语音的播放时长和/或冗余度,具体介绍如下:
在本处理方式中,回复语音的冗余度是指回复语音中非回复指令语音所必需的语音内容与回复语音全部语音内容的比值。
可以理解的是,当回复语音收到正面评价语音时,表明用户可能比较认可或比较能接受该回复语音的播放时长和/或冗余度,因此,在一种实现方式中,可以保持所述回复语音的播放时长和/或冗余度。此外,当某一播放时长大于预设阈值的回复语音得到正面评价语音时,表明用户可能比较认可或希望接收较长播放时长或较高冗余度的回复语音,因此在一种实现方式中,还可以提高所述回复语音的播放时长和/或冗余度。由此可见,本实施例能够根据用户的评价语音对回复语音进行调整,从而使得回复语音更加贴合用户习惯或需求。
B、从与所述指令语音对应的回复语音库中选择与所述回复语音的播放时长和/或冗余度的差值在预设范围内的回复语音进行播放,具体介绍如下:
可以理解的是,当回复语音收到正面评价语音时,表明用户可能比较认可或比较能接受该回复语音的播放时长和/或冗余度,因此,在一种实现方式中,可以在后续响应相同的指令语音时,从与所述指令语音对应的回复语音库中选择与所述回复语音的播放时长和/或冗余度的差值在预设范围内的回复语音进行播放,也即从与所述指令语音对应的回复语音库中选择与所述回复语音的播放时长和/或冗余度接近的回复语音进行播放,从而 能够满足用户对回复语音的播放时长和/或冗余度的要求。
C、保持或提高所述回复语音对应的回复文本的字数和/或冗余度,具体介绍如下:
在本处理方式中,与上述处理方式类似,主要区别在于本处理方式强调的是回复文本的字数和/或冗余度,也即本处理方式是通过调整回复文本的字数和/或冗余度的方式来调整回复语音的长度和/或冗余度。这里的字数条件和/或冗余度条件可以根据需要进行设定。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。
D、保持或提高与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度,具体介绍如下:
在本处理方式中,侧重点在于强调调整与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度。
需要说明的是,本处理方式与前述实施例中介绍的降低与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度正好是相反的关系,因此,关于具体原理可按照相反逻辑参见前面实施例的介绍,此处不再赘述。
E、保持或提高后续时间段内与所述指令语音相同的指令语音对应的回复语音的播放时长和/或冗余度,具体介绍如下:
可以理解的是,当回复语音收到正面评价语音时,表明用户可能比较认可或比较能接受该回复语音的播放时长和/或冗余度,因此,在一种实现方式中,可以保持后续时间段内与所述指令语音相同的指令语音对应的回复语音的播放时长和/或冗余度,从而能够满足用户对回复语音的播放时长和/或冗余度的要求。此外,当某一播放时长大于预设阈值的回复语音得到正面评价语音时,表明用户可能比较认可或希望接收较长播放时长或较高冗余度的回复语音,因此在一种实现方式中,还可以提高后续时间段内与所述指令语音相同的指令语音对应的回复语音的播放时长和/或冗余度。由此可见,本实施例能够根据用户的评价语音对回复语音进行调整,从而使得回复语音更加贴合用户习惯或需求。
F、保持或提高后续时间段内与所述指令语音相同的指令语音对应的回复语音的回复文本的字数和/或冗余度,具体介绍如下:
在本处理方式中,与上述处理方式类似,主要区别在于本处理方式强调的是回复文本的字数和/或冗余度,也即本处理方式是通过调整回复文本的字数和/或冗余度的方式来调整回复语音的长度和/或冗余度。这里的字数条件和/或冗余度条件可以根据需要进行设定。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。
G、保持或提高与第一用户发出的所有或部分指令语音对应的回复语音的播放时长和/或冗余度;其中,所述第一用户为发出所述指令语音的用户,具体介绍如下:
在本处理方式中,所述第一用户为发出所述指令语音的用户。
可以理解的是,若回复语音为播放时长大于预设阈值的回复语音,则当回复语音收到第一用户发送的正面语音评价时,表明第一用户可能比较认可或比较能接受较长播放时长和/或较高冗余度的回复语音,因此,在一种实现方式中,可以在后续响应第一用户发出的所有或部分指令语音时,保持或提高回复语音的播放时长和/或冗余度,从而能够满足用户对回复语音的播放时长和/或冗余度的要求。
H、保持或提高与第一用户发出的所有或部分指令语音对应的回复文本的字数和/或冗余度,具体介绍如下:
在本处理方式中,与上述处理方式类似,主要区别在于本处理方式强调的是回复文本的字数和/或冗余度,也即本处理方式是通过调整回复文本的字数和/或冗余度的方式来调整回复语音的长度和/或冗余度。这里的字数条件和/或冗余度条件可以根据需要进行设定。
I、保持或提高与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度,具体介绍如下:
在本处理方式中,侧重在于调整与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度。
在本处理方式中,指令语音组的划分方式可以是多种多样的,比如可以按照指令主题进行划分,也可以按照指令语音的长短和/或复杂度进行划分,还可以按照相似度进行划分等等,对于具体划分方式,不作限定。
举例来说,所述指令语音组可以以指令主题的方式进行划分,例如,可以按照生活指令、工作指令、学习指令中的一种或多种进行划分。相应 地,得到生活指令语音组、工作指令语音组和学习指令语音组。例如,“现在是几点”、“今天天气”、“明天天气”、“交通状况”、“限号号码”、“超市打折”等指令语音属于生活指令语音组中的指令语音,而“刻舟求剑的含义”、“5G手机是什么手机”、“log函数的由来”等指令语音属于学习指令语音组中的指令语音,又如“如何合理安排时间”、“出差注意事项”、“如何提高工作效率”、“人工智能算法都有哪些”等指令语音属于工作指令组中的指令语音。
可以理解的是,有些用户对于生活指令的回复语音比较重视,希望回复语音较为丰富多彩,内容幽默有趣。这类用户包括家庭主妇、退休老人等,而有些用户对于学习指令的回复语音比较重视,希望回复语音能够较为详细地阐述知识背后的典故、原理等,这类用户包括学生、学者等。
可以理解的是,不同的用户对于不同指令语音组对应的回复语音的播放时长和/或冗余度的要求是不同的,例如,职业人士希望针对工作指令组的回复语音较为详实,而希望对于生活指令组的回复语音较为简短。例如,当用户对指令语音“现在是几点”的回复语音的需求是简短有效时,那么对于与“现在是几点”位于同一指令语音组的其他语音指令,如“今天天气如何”、“限行尾号”、“某路线是否堵车”的回复语音的需求也是简单有效。
在本处理方式中,当某一播放时长和/或冗余度大于预设播放时长阈值和/或冗余度阈值的回复语音收到正面评价语音时,说明该回复语音的播放时长和/或冗余度能够被用户接受或喜欢,也侧面说明该用户希望回复语音具有较长的播放时长和/或较高的冗余度,根据上面指令语音组的分析可知,用户应该也希望该指令语音所在的指令语音组对应的回复语音均具有较长的播放时长和/或较高的冗余度,因此,为提高用户使用体验,避免用户针对同一指令语音组中的不同指令语音的回复语音多次发送正面评价语音,本处理方式调整与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度,使得该用户在发出同一指令语音组中的其他指令语音时,也可以得到播放时长和/冗余度较高的回复语音,从而可以避免用户针对同一指令语音组中的不同指令语音的回复语音多次发送正面评价语音,从而可以提高用户使用体验。
J、保持或提高与所述指令语音在同一指令语音组中的所有或部分指 令语音对应的回复文本的字数和/或冗余度,具体介绍如下:
在本处理方式中,与上述处理方式类似,主要区别在于本处理方式强调的是回复文本的字数和/或冗余度,也即本处理方式是通过调整回复文本的字数和/或冗余度的方式来调整回复语音的长度和/或冗余度。这里的字数条件和/或冗余度条件可以根据需要进行设定。
可以理解的是,在提高回复语音的播放时长和/或冗余度时,可以通过查询数据库中存储的扩展信息的方式进行完成。例如,当某个回复语音是“现在是下午3点”时,若想提高该回复语音的播放时长和/或冗余度,则可以通过查询数据库中存储的各种扩展信息的方式完成,例如,通过查询数据库进行扩展后,得到的回复语音分别为①“现在是下午3点,请起身喝杯咖啡吧”;②“现在是下午3点,给您放一首舒缓的歌曲吧”;③“现在是下午3点,历史上的下午3点发生的趣事是…”;④“现在是下午3点,请找个安静的地方,合上双眼,跟我一起做冥想吧”等等。
基于上述实施例的内容,在本实施例中,根据所述第二对话策略调整方向,调整对应所述指令语音的对话策略,具体包括:
保持或提高所述回复语音的播放时长和/或冗余度;其中,回复语音的冗余度是指回复语音中非回复指令语音所必需的语音内容与回复语音全部语音内容的比值;
或,保持或提高与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度;
或,保持或提高与第一用户发出的所有或部分指令语音对应的回复语音的播放时长和/或冗余度;其中,所述第一用户为发出所述指令语音的用户;
或,保持或提高与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度;
或,从与所述指令语音对应的回复语音库中选择与所述回复语音的播放时长和/或冗余度的差值在预设范围内的回复语音进行播放。
关于该实施例各部分的详细介绍,上述实施例已经给出,具体内容和效果可参见上述实施例的相关内容,此处不再赘述。
基于上述实施例的内容,在本实施例中,根据所述正面反馈信息中携 带的第二关键词,确定对应所述指令语音的第二对话策略调整方向,具体包括:
确定所述第二关键词为与保持或提高使用频率有关的关键词,则确定对应所述指令语音的第二对话策略调整方向为与保持或提高所述回复语音的使用频率。
在本实施例中,与保持或提高使用频率有关的关键词可以为:以后多多出现、非常喜欢、以后就是你了、多多使用等等。
在本实施例中,当确定对应所述指令语音的第二对话策略调整方向为与保持或提高所述回复语音的使用频率时,具体有如下几种实现方式:
A、增加所述回复语音的使用频率;其中,增加所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从回复语音库中选择所述回复语音作为响应的概率增加,具体介绍如下:
在本处理方式中,侧重点在于,当某一回复语音收到正面评价语音时,后续可以增加该回复语音的使用频率,也即由于该回复语音在作为所述指令语音的回复语音时比较被欢迎,因此,后续在响应所述指令语音时,将增加选择该回复语音的可能性,也即后续将从与所述指令语音对应的回复语音库中选择该回复语音作为响应的概率增加。采用本实施例的处理方式,有一个好处就是无需对回复语音库中的回复指令进行调整或改变,而是选择比较合适或匹配的回复语音作为指令语音的响应,这种方式实现起来较为简单方便。
可以理解的是,为增加选择所述回复语音作为响应的概率,可以通过增加与所述回复语音对应的分值的方式,也可以通过特殊标记的方式来提高选择所述回复语音作为响应的可能性。
B、增加主题与所述回复语音接近的回复语音的使用频率;
在本处理方式中,与上述处理方式类似,不同之处在于,为丰富用户体验,可以增加主题与所述回复语音接近的回复语音的使用频率。举例来说,当用户对体育主题的回复语音比较喜欢时,可以尝试增加关于瑜伽或冥想等比较接近的主题的回复语音的使用。
C、增加播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率;其中,增加播放长度和/或冗余度大于或等于所述回复语音的回复 语音使用频率是指在后续响应所述指令语音时,从与所述指令语音对应的回复语音库中选择播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率增加,具体介绍如下:
本处理方式与上述处理方式类似,主要区别在于,本处理方式用于增加与所述指令语音对应的回复语音库中播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率,可以理解的是,若回复语音收到正面评价语音,则表明用户可能比较认可或希望接收较长播放时长或较高冗余度的回复语音,因此在一种实现方式中,后续可以增加播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率,从而可以更加贴合用户需求。由于本实施例的处理方式与上述实施例类似,故此处不再赘述。
基于上述实施例的内容,在本实施例中,根据所述第二对话策略调整方向,调整对应所述指令语音的对话策略,具体包括:
增加所述回复语音的使用频率;其中,增加所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从回复语音库中选择所述回复语音作为响应的概率增加;
或,增加主题与所述回复语音接近的回复语音的使用频率;
或,增加播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率;其中,增加播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率是指在后续响应所述指令语音时,从与所述指令语音对应的回复语音库中选择播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率增加。
关于该实施例各部分的详细介绍,上述实施例已经给出,具体内容和效果可参见上述实施例的相关内容,此处不再赘述。
此外,在本实施例中,需要补充说明的是,当用户比较喜欢某一回复语音,进而给出正面评价时,还可以存在一种实现方式为:重复播放所述回复语音,具体介绍如下:
在本处理方式中,当回复语音收到正面评价语音时,表明用户比较喜欢该回复语音,因此,在一种实现方式中,可以重复播放所述回复语音,以满足用户想要回听所述回复语音的需求。此外,需要说明的是,当在回 复语音播放过程中或播放结束后收到正面评价语音时,可以是本次重复播放所述回复语音,也可以是下次为响应相同指令语音时重复播放所述回复语音,也可以是两者兼具。
与收到正面评价语音进行重复播放相对应的收到反面评价语音后的一种特殊处理方式是:结束所述回复语音,具体介绍如下:
可以理解的是,当在回复语音的播放过程中收到用户发送的负面评价语音时,表示用户不喜欢该回复语音或认为该回复语音的长度过长,此时一种处理方式可以是根据评价语音结束该回复语音,也即在收到评价语音时未播放的回复语音不再继续播放,结束该回复语音,这样可以使得用户不再受到过长或不喜欢的回复语音的困扰,使得能够在评价语音发出的同时实现回复语音停止播放的效果。可以理解的是,这里的结束所述回复语音可以指彻底结束回复语音的播放,也可以指暂时中止回复语音的播放,待接收到重启播放指令后再接着播放等,本实施例对此不作限定。
基于上述实施例的内容,在本实施例中,确定所述评价语音携带的反馈信息为负面反馈信息,具体包括:
确定所述评价语音中携带有第一信息,所述第一信息是指与第一数据库中的评语信息相匹配的信息;其中,所述第一数据库中存储有负面评语信息;
或,确定所述评价语音中携带有第二信息,所述第二信息是指与所述回复语音中包含的信息具有相反含义的信息;
或,确定所述评价语音对应的语调与第一语调库中的语调信息相匹配,所述第一语调库中存储有带有负面情绪的语调;
或,确定所述评价语音对应的响度大于或等于第一响度。
在本实施例中,在确定评价语音中是否携带有针对所述回复语音的负面反馈信息时,至少有如下A、B、C、D四种实现方式,具体说明如下:
A、确定所述评价语音中携带有第一信息,所述第一信息是指与第一数据库中的评语信息相匹配的信息;其中,所述第一数据库中存储有负面评语信息;
这里,负面评语信息可以包括不好、不喜欢、太长、太复杂、受到干 扰、No、Bad、Stop等。
B、确定所述评价语音中携带有第二信息,所述第二信息是指与所述回复语音中包含的信息具有相反含义的信息;
可以理解的是,负面的评价语音还可以为包含与回复语音中包含的信息具有相反含义的信息,也即当用户比较不喜欢回复语音时,会通过表达相反含义来表达不喜欢的感情。
举例来说,当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,若用户不喜欢该语音,则对应的评价语音可能是“不加油!”或“不想努力”或“不想奋斗”等。
C、确定所述评价语音对应的语调与第一语调库中的语调信息相匹配,所述第一语调库中存储有带有负面情绪的语调;
可以理解的是,当用户不喜欢回复语音时,发出的评价语音会带有负面情绪的语调,如不开心、如叹气、如怨声等。因此,通过确定评价语音对应的语调是否与第一语调库中的语调信息相匹配进而可以确定评价语音中是否携带有针对所述回复语音的负面反馈信息。
D、确定所述评价语音对应的响度大于或等于第一响度。
可以理解的是,当用户不喜欢回复语音时,发出的评价语音的响度一般会比较高,例如,发出讨厌!不喜欢!Stop!等。因此,通过确定评价语音对应的响度是否大于或等于第一响度(第一响度可以根据需要进行设定)进而可以确定评价语音中是否携带有针对所述回复语音的负面反馈信息。
由此可见,本实施例给出了确定所述评价语音是否携带针对所述回复语音的负面反馈信息的不同处理方式,这些处理方式从不同角度出发,能够较为全面且准确地确定所述评价语音是否携带针对所述回复语音的负面反馈信息。
基于上述实施例的内容,在本实施例中,确定所述评价语音携带的反馈信息为正面反馈信息,具体包括:
确定所述评价语音中携带有第三信息,所述第三信息是指与第二数据库中的评语信息相匹配的信息;其中,所述第二数据库中存储有正面评语 信息;
或,确定所述评价语音中携带有第四信息,所述第四信息是指与所述回复语音中包含的信息具有相同或类似含义的信息;
或,确定所述评价语音对应的语调与第二语调库中的语调信息相匹配,所述第二语调库中存储有带有正面情绪的语调;
或,确定所述评价语音对应的响度小于第一响度。
在本实施例中,在确定所述评价语音中是否携带有针对所述回复语音的正面反馈信息的方式时,具体可采用如下A、B、C和D中的任意一种或多种实现:
A、确定所述评价语音中携带有第三信息,所述第三信息是指与第二数据库中的评语信息相匹配的信息;其中,所述第二数据库中存储有正面评语信息;
这里,正面评语信息可以包括“不好”或“不喜欢”或“太长”或“受到干扰”或“No”或“Bad”或“Stop”等。
B、确定所述评价语音中携带有第四信息,所述第四信息是指与所述回复语音中包含的信息具有相同或类似含义的信息;
可以理解的是,正面的评价语音还可以为包含与回复语音中包含的信息具有相同含义的信息,也即当用户比较喜欢回复语音时,会通过表达相同或类似含义来表达喜欢的感情。
举例来说,当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,若用户喜欢该语音,则对应的评价语音可能是“一起加油!”或“努力奋斗”或“我也祝福你”等。
C、确定所述评价语音对应的语调与第二语调库中的语调信息相匹配,所述第二语调库中存储有带有正面情绪的语调;
可以理解的是,当用户喜欢回复语音时,发出的评价语音会带有正面情绪的语调,如开心、欢呼、高兴等。因此,通过确定评价语音对应的语调是否与第二语调库中的语调信息相匹配进而可以确定评价语音中是否携带有针对所述回复语音的正面反馈信息。
D、确定所述评价语音对应的响度小于第一响度。
可以理解的是,当用户喜欢回复语音时,发出的评价语音的响度一般会比较小,例如,发出挺好的,喜欢,不错等。因此,通过确定评价语音对应的响度是否小于第一响度(第一响度可以根据需要进行设定)进而可以确定评价语音中是否携带有针对所述回复语音的正面反馈信息。
由此可见,本实施例给出了确定所述评价语音是否携带针对所述回复语音的正面反馈信息的不同处理方式,这些处理方式从不同角度出发,能够较为全面且准确地确定所述评价语音是否携带针对所述回复语音的正面反馈信息。
基于上述实施例的内容,在本实施例中,用于对评价语音进行分析的数据库和用于对指令语音进行分析的数据库相互独立;
相应地,在回复语音的播放过程中或播放结束后的时间窗口内,对接收到的语音基于用于对评价语音进行分析的数据库进行分析,确定所述评价语音携带的反馈信息为负面反馈信息或正面反馈信息。
在本实施例中,为提高处理效率,可以独立地设置用于对评价语音进行分析的数据库和用于对指令语音进行分析的数据库,这样两个数据库互不干扰,且每个数据库可以更有针对性,因而可以有效提高分析时的针对性,进而提高分析效率,同时提高分析准确度和分析速度。
在本实施例中,可以理解的是,智能设备(如智能音箱)被预先设定好在回复语音的播放过程中或播放结束后的时间窗口内执行接收评价语音以及针对评价语音的分析工作,从而能有效降低智能设备的能耗,同时,由于智能设备利用专门的用于对评价语音进行分析的数据库进行分析,从而能有效提高处理效率,且能够得到较为准确的分析结果。
基于上述实施例的内容,在本实施例中,所述用于对评价语音进行分析的数据库位于智能设备侧,由智能设备在播放回复语音过程中或播放结束后的时间窗口内,对接收到的语音基于用于对评价语音进行分析的数据库进行分析,确定所述评价语音携带的反馈信息为负面反馈信息或正面反馈信息。
在本实施例中,将用于对评价语音进行分析的数据库位于智能设备侧,由智能设备在播放回复语音过程中或播放结束后的时间窗口内,对接收到的语音基于用于对评价语音进行分析的数据库进行分析,确定所述评价语 音携带的反馈信息为负面反馈信息或正面反馈信息,从而可以在智能设备本地完成分析(省去了与服务器或终端交互的交互过程),从而可以降低时延,使得可以迅速得到分析结果进而可以利用分析结果对智能设备进行调整。例如当可以及时分析出用户的评价语音中包含负面反馈信息时,可以及时中断当前回复语音或及时调整当前回复语音的冗余度或播放时长等(具体调整方式可以参见前述实施例的介绍),从而提高用户体验。
基于上述实施例的内容,在本实施例中,所述指令语音组以指令主题的方式进行划分,所述指令主题包括:生活指令、工作指令、学习指令中的一种或多种。
在本实施例中,正如上面所描述的,工作指令组可以按照指令主题的方式进行划分,例如,可以按照生活指令、工作指令、学习指令中的一种或多种进行划分。相应地,得到生活指令语音组、工作指令语音组和学习指令语音组。
举例来说,“现在是几点”、“今天天气”、“七步洗手的方式”等指令语音属于生活指令语音组中的指令语音。
举例来说,“守株待兔的含义”、“二十四节气”、“ln函数的由来”等指令语音属于学习指令语音组中的指令语音。
举例来说,“PPT制备方法”、“如何做好工作计划”等指令语音属于工作指令组中的指令语音。
可以理解的是,有些用户对于生活指令的回复语音比较重视,希望回复语音较为丰富多彩,内容幽默有趣。这类用户包括家庭主妇、退休老人等,而有些用户对于学习指令的回复语音比较重视,希望回复语音能够较为详细地阐述知识背后的典故、原理等,这类用户包括学生、学者等,此外,还有用户对于工作指令的回复语音比较重视,希望回复语音能够较为详细地阐述针对工作问题的答复,这类用户包括职场人士等。
可以理解的是,将语音指令按照指令语音组的方式进行划分后,则对属于同一指令语音组中的多个语音,智能设备(或终端设备或服务器)可以采用类似播放时长和/或冗余度的回复语音对属于同一指令语音组中的指令语音进行回复,从而省去了用户对于同一指令语音组的部分或全部语音指令的回复语音均发出评价语音进行调整的麻烦。
在本处理方式中,当根据用户发出的负面评价语音确定某一指令语音对应的回复语音不被用户喜欢时,表示用户希望该指令语音的回复语音是简短有效的,无需太多冗余信息。根据上面指令语音组的分析可知,用户也希望该指令语音所在的指令语音组对应的回复语音均没有太多冗余信息,因此,为提高用户使用体验,避免用户针对同一指令语音组中的不同指令语音的回复语音多次发送负面评价语音,本处理方式调整与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度,使得该用户在发出同一指令语音组中的其他指令语音时,也可以得到播放时长和/冗余度较低的回复语音,从而可以避免用户针对同一指令语音组中的不同指令语音的回复语音多次发送负面评价语音,从而可以提高用户使用体验。
基于上述实施例的内容,在本实施例中,根据所述评价语音对所述回复语音进行调整,包括:
根据所述评价语音中携带的提示信息对所述回复语音进行调整;其中,所述提示信息用于提示针对所述回复语音的调整策略。
在本实施例中,当所述评价语音中携带有提示信息时,可以直接根据所述评价语音中携带的提示信息对所述回复语音进行调整。
举例来说,所述提示信息可以为:播放与体育主题相关的回复语音;也可以是:播放时长控制在3-6s;还可以是:播放时长缩短一些;还可以是:播放时长在10s以上;还可以是播放时长加长一些;还可以是冗余度控制在0.5以下;还可以是:冗余度在0.5以上等。
可以理解的是,举例来说,当接收到类似“播放时长缩短一些”这类评价语音时,可以根据评价语音进行调整。例如,可以缩短后续针对相同指令语音的回复语音的长度,或者,可以缩短后续针对该用户发出的所有或部分指令语音的回复语音的长度。此外,假设评价语音中携带有“我希望回复语音的长度控制在5s内”这类的时长条件信息,则可以提取该时长条件信息,并根据该时长条件信息对后续针对相同指令语音的回复语音的长度,或者,可以缩短后续针对该用户发出的所有或部分指令语音的回复语音的长度进行调整。
又比如,当接收到类似“不喜欢这个主题”这类评价语音时,可以根 据评价语音进行调整。例如,可以更换新的回复语音。假设当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,假设评价语音是“喜欢足球主题”,则可以更换新的回复语音,例如,更换为新的回复语音:“现在是凌晨3点整,早上5点有巴萨与皇马的对决赛,请记得及时收看!”。
基于上述实施例的内容,在本实施例中,根据所述评价语音中携带的提示信息对所述回复语音进行调整,包括:
若所述提示信息用于提示降低或提高所述回复语音的播放时长和/或冗余度,则根据所述提示信息降低或提高所述回复语音的播放时长和/或冗余度;
和/或,
若所述提示信息用于提示更换新的回复语音,则根据所述提示信息更换新的回复语音。
在本实施例中,当提示信息是用于提示降低或提高所述回复语音的播放时长和/或冗余度的提示信息,则根据所述提示信息降低或提高所述回复语音的播放时长和/或冗余度;当提示信息是用于提示更换新的回复语音的提示信息,则根据所述提示信息更换新的回复语音。关于这部分内容的举例可参见上述实施例的介绍,此处不再赘述。
基于上述实施例的内容,在本实施例中,所述提示信息中包含有目标播放时长信息和/或目标冗余度信息,和/或,所述提示信息中包含目标扩展主题信息;
根据所述提示信息降低或提高所述回复语音的播放时长和/或冗余度,包括:
根据所述提示信息中携带的目标播放时长信息和/或目标冗余度信息,降低或提高所述回复语音的播放时长和/或冗余度;
和/或,
根据所述提示信息更换新的回复语音,包括:
根据所述提示信息中携带的目标扩展主题信息,更换带有所述目标扩展主题信息的新的回复语音。
在本实施例中,根据所述提示信息中携带的目标播放时长信息和/或目 标冗余度信息,降低或提高所述回复语音的播放时长和/或冗余度,例如,假设评价语音中携带有“我希望回复语音的长度控制在5s内”这类的目标播放时长信息,则可以提取该目标播放时长信息,并根据该目标播放时长信息对后续针对相同指令语音的回复语音的长度,或者,可以缩短后续针对该用户发出的所有或部分指令语音的回复语音的长度进行调整。
在本实施例中,根据所述提示信息中携带的目标扩展主题信息,更换带有所述目标扩展主题信息的新的回复语音。例如,当回复语音为“现在已经是凌晨3点整,天已不早,早点入睡,知道你很辛苦,我也一直在祝福,明天继续加油!”时,假设评价语音是“喜欢足球主题”,则可以提示信息中携带的目标扩展主题信息(足球)更换新的回复语音,例如,更换为新的回复语音:“现在是凌晨3点整,早上5点有巴萨与皇马的对决赛,请记得及时收看!”。
基于上述实施例的内容,在本实施例中,接收针对回复语音的评价语音,包括:
在回复语音的播放过程中接收针对回复语音的评价语音;
和/或,
在回复语音播放结束后的时间窗口内接收针对回复语音的评价语音。
在本实施例中,可以理解的是,既可以是在回复语音的播放过程中接收针对回复语音的评价语音,也可以是在回复语音播放结束后的时间窗口内接收针对回复语音的评价语音,也可以是两者兼具,本实施例对此不作限定。由此可见,对于本实施例提供的回复语音调整方案来说,不限定用户发表评价语音的时间,用户可以自由灵活地根据需要在回复语音的播放过程中发表评价语音,或在回复语音播放结束后的时间窗口内(如结束后的5s、10s内)发表评价语音。关于时间窗口,可以根据需要进行设定,本实施例对此不作限定。
基于上述实施例的内容,在本实施例中,与所述指令语音对应的语音或文本分析数据库为第一语音或文本数据库;与所述评价语音对应的语音或文本分析数据库为第二语音或文本数据库;所述第一语音或文本数据库中存储有与指令分析相关的语音或文本内容;所述第二语音或文本数据库中存储有与评价分析相关的语音或文本内容。
在本实施例中,需要说明的是,指令语音一般为:“现在是几点”、“明天天气如何”、“下周限号号码”、“天为什么是蓝的”、“青蛙有几条腿”等询问类的指令内容,而评价语音一般为:“喜欢”、“不喜欢”、“Yes”、“No”、“希望切换至篮球主题”等评价类的指令内容,由此可知,由于指令语音的内容和评价语音的内容相差较大,因此,对指令语音进行语音或文本分析的数据库与对评价语音进行语音或文本分析的数据库可以不同,以方便各自进行更为专业的分析,从而可以提高分析效率。
基于上述实施例的内容,在本实施例中,若在回复语音的播放过程中接收针对回复语音的评价语音,且所述评价语音中携带有针对所述回复语音的负面反馈信息,则:
调高所述回复语音的未播放部分的播放速度;
或,
在所述回复语音的未播放部分中截取部分内容进行继续播放;或,降低所述回复语音的未播放部分的冗余度;
或,
减少所述回复语音的未播放部分对应的回复文本的字数;
或,
降低所述回复语音的未播放部分对应的回复文本的冗余度;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长,并控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制后续与所述指令语音相同的指令语音对应的回复语音的冗余度小于或等于所述比值;
或,
确定接收所述评价语音时所述回复语音已播放部分对应的第一字数,并控制后续与所述指令语音相同的指令语音对应的回复语音的回复文本的字数小于或等于所述第一字数;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制后续与所述指令语音相同的指令语音对应的回复语音的回复文本的冗余度小于或等于所述比值;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长,并控制与第一用户发出的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制与第一用户发出的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长,并控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值。
若在回复语音的播放过程中接收针对回复语音的评价语音,且所述评价语音中携带有针对所述回复语音的负面反馈信息,则:
调高所述回复语音的未播放部分的播放速度;
或,
在所述回复语音的未播放部分中截取部分内容进行继续播放;或,降低所述回复语音的未播放部分的冗余度;
在本实施例中,在根据所述评价语音调整所述回复语音的播放时长时,可以根据所述评价语音调高所述回复语音的未播放部分的播放速度,也可以根据所述评价语音在所述回复语音的未播放部分中截取部分内容进行继续播放。可以理解的是,调高所述回复语音的未播放部分的播放速度的 方式的优势是:既可以兼顾用户对播放时长的要求可以保留完整回复语音内容,存在缺点是:对于用户的听觉体验可能不够好。
而在所述回复语音的未播放部分中截取部分内容进行继续播放的优势是:既可以兼顾用户对播放时长的要求又能够保留未播放部分中相对比较重要的内容,同时用户在听觉上的体验也比较好,不会有语音被加速压缩的感觉。
可以理解的是,加快播放速度的方式优势是不缩减信息,同时能够保证较短时间播放完。而在所述回复语音的未播放部分中截取部分内容进行继续播放的方式,可以从未播放部分中截取重要或关键的内容进行播放,因而可以避免损失回复信息中位于后面但是比较有效的信息。举例来说,当问今天天气如何时,假设回复语音为:“天气晴朗,阳光灿烂,温度15-20,大风4-5级,不适合外出游玩或爬山”,对于这种情况,假设在该回复语音播放至“天气晴朗”时被中断,此时为降低播放时长,可以选取未播放部分中比较重要的信息如“大风4-5级,不适合外出游玩或爬山”进行播放。
在本实施例中,还可以降低所述回复语音的未播放部分的冗余度。
在本实施例中,除了上面实施例所介绍的调整所述回复语音的播放时长以外,还可以像本实施例这样,降低所述回复语音的未播放部分的冗余度。
在本实施例中,需要说明的是,回复语音的冗余度是指回复语音中非回复指令语音所必需的语音内容与回复语音全部语音内容的比值;同理,回复语音的未播放部分的冗余度是指回复语音的未播放部分中非回复指令语音所必需的语音内容与未播放部分语音内容的比值。
在本实施例中,可以理解的是,回复指令语音所必需的语音内容可以理解成是与指令语音直接相关的内容,非回复指令语音所必需的语音内容可以理解成是与指令语音不是直接相关的内容,而是属于主动推介的内容,如温馨提示、音乐分享、俏皮话、广告等等。
举例来说,对于回复语音:“现在是上午11点,工作累了吧,记得多补充水分,多吃水果哦,伸下懒腰,做下伸展运动有利于健康呀”来说,“现在是上午11点”为与指令语音直接相关的内容,而“工作累了吧,记得多补充水分,多吃水果哦,伸下懒腰,做下伸展运动有利于健康呀”为与指 令语音不是直接相关的内容。
假设在上述回复语音播放至“现在是上午11点,工作累了吧”时收到了用户发送的评价语音,此时可以通过降低所述回复语音的未播放部分的冗余度的方式对回复语音进行调整,例如,可以将“记得多补充水分,多吃水果哦,伸下懒腰,做下伸展运动有利于健康呀”这句话的冗余度降低变为“记得多补充水分,做下伸展运动有利于健康”。可以理解的是,具体冗余度降低的方式,本实施例不作限定,可以是利用预设关键词确定哪些内容进行保留的方式,也可以是利用预设低效词确定哪些内容进行删除的方式,可以是将表达重复语义的内容进行删除的方式,也可以是保留重要信息的方式,也可以是随机删除部分信息的方式,也可以是其他降低冗余度的方式,本实施例对此不作限定。
在本实施例中,调整与所述回复语音对应的回复文本的字数,包括:
减少所述回复语音的未播放部分对应的回复文本的字数。
在本实施例中,跟前述实施例类似,主要区别在于本处理方式强调的是回复文本的字数,也即本处理方式是通过调整回复文本的字数的方式来调整回复语音的长度。这里的字数条件可以根据需要进行设定。例如,可以根据字数条件从回复文本中的未播放部分选择部分文本内容,选取的方式可以是顺序的,也可以是随机的。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。
此外,可以理解的是,在通过调整回复文本的字数的方式来调整回复语音的长度并播放调整后的回复语音的同时,还可以进一步展示对应的经过调整后的回复文本或原始未经过调整的回复文本,以供用户查看相应的文本,提高用户体验。
比如,在有些场景下,当用户因接听电话没来得及听回复语音,或因为噪声等导致回复语音未听清楚,又或是因为刚听完却忘记,此时有对应的回复文本可以帮忙用户获知回复语音的内容信息。此外,显示原始未经过调整的回复文本的好处是,一方面因为不会播放,因此不会占用用户的时间,另一方面,为用户提供了查看完整回复内容的机会,若用户想要了解完整回复语音的内容,则可以通过展示的回复文本获知相关信息。
在本实施例中,调整与所述回复语音对应的回复文本的冗余度,包括:
降低所述回复语音的未播放部分对应的回复文本的冗余度。
在本实施例中,跟前述实施例类似,主要区别在于本处理方式强调的是回复文本的冗余度,也即本处理方式是通过调整回复文本的冗余度是方式来调整回复语音的冗余度。这里的冗余度条件可以根据需要进行设定。例如,可以根据冗余度条件从回复文本中的未播放部分选择部分文本内容,选取的方式可以是顺序的,也可以是随机的。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。
此外,可以理解的是,在通过调整回复文本的冗余度的方式来调整回复语音的冗余度并播放调整后的回复语音的同时,还可以进一步展示对应的经过调整后的回复文本或原始未经过调整的回复文本,以供用户查看相应的文本,提高用户体验。在本实施例中,在根据所述评价语音调整后续与所述指令语音相同的指令语音对应的回复语音的播放时长时,一种实现方式是确定所述评价语音发生时所述回复语音已播放的第一时长,并控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长。
基于上述实施例的内容,在本实施例中,若在回复语音的播放过程中接收针对回复语音的评价语音,且所述评价语音中携带有针对所述回复语音的正面反馈信息,则
保持或调低所述回复语音的未播放部分的播放速度;
或,
保持或提高所述回复语音的未播放部分的冗余度;
或,
保持或提高所述回复语音的未播放部分对应的回复文本的冗余度。
在本实施例中,若在回复语音的播放过程中接收针对回复语音的正面评价语音,则说明用户继续欣赏该回复语音或喜欢较长播放时长或较高冗余度的回复语音,因此,可以保持或调低所述回复语音的未播放部分的播放速度;或,保持或提高所述回复语音的未播放部分的冗余度;或,保持或提高所述回复语音的未播放部分对应的回复文本的冗余度,从而满足用户的语音交互需求。
基于上述实施例的内容,在本实施例中,若在回复语音播放结束后的 时间窗口内接收针对回复语音的评价语音,且所述评价语音中携带有针对所述回复语音的负面反馈信息,则:
确定接收所述评价语音时所述回复语音已播放的第一时长,并控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长;
或,
确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制后续与所述指令语音相同的指令语音对应的回复语音的冗余度小于或等于所述比值;
或,
确定接收所述评价语音时所述回复语音已播放部分对应的第一字数,并控制后续与所述指令语音相同的指令语音对应的回复语音的回复文本的字数小于或等于所述第一字数;
或,确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制后续与所述指令语音相同的指令语音对应的回复语音的回复文本的冗余度小于或等于所述比值;
或,确定接收所述评价语音时所述回复语音已播放的第一时长,并控制与第一用户发出的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长;
或,确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制与第一用户发出的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值;
或,确定接收所述评价语音时所述回复语音已播放的第一时长,并控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长;
或,确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的冗余度小于或等于所述比值。
在本实施例中,在根据所述评价语音调整后续与所述指令语音相同的指令语音对应的回复语音的播放时长时,一种实现方式是确定所述评价语 音发生时所述回复语音已播放的第一时长,并控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长。由于所述回复语音在播放至第一时长时用户发出评价语音,因此,表明第一时长这个长度是用户能够接受的最大长度,超过这个长度的回复语音是用户所不愿意接受的,因此,可以以此为条件,控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长,从而满足用户对回复语音播放时长的需求。
举例来说,假设一个回复语音完整的播放时长是15s,当在该回复语音播放至6s时接收到了用户的评价语音,则表明该用户针对该指令语音的回复语音的播放时长的需求是在6s或6s以下,因此,可以将6s作为阈值,控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于6s。
在本实施例中,所述控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长,包括:
控制后续与所述指令语音相同的指令语音对应的回复语音在播放时长小于或等于所述第一时长时停止播放;
或,
控制后续与所述指令语音相同的指令语音对应的回复语音在播放时截取部分内容进行播放;
或,
从与所述指令语音对应的回复语音库中选择播放时长小于或等于所述第一时长的回复语音作为后续与所述指令语音相同的指令语音对应的回复语音;
或,
调高后续与所述指令语音相同的指令语音对应的回复语音的播放速度。
在本实施例中,在控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长时,有多种实现方式,例如可以是:A、控制后续与所述指令语音相同的指令语音对应的回复语音在播放时长小于或等于所述第一时长时停止播放;或,B、控制后续与所述指 令语音相同的指令语音对应的回复语音在播放时截取部分内容进行播放;或,C、从与所述指令语音对应的回复语音库中选择播放时长小于或等于所述第一时长的回复语音作为后续与所述指令语音相同的指令语音对应的回复语音;或,D、调高后续与所述指令语音相同的指令语音对应的回复语音的播放速度。
由此可见,本实施例给出了多种实现方式,上述方式A的优势在于,控制起来简单方便,只需在回复语音的播放时长小于或等于所述第一时长时停止播放即可。上述方式B的优势在于,比较灵活,例如可以根据需要截取回复语音中相对比较重要的信息进行播放。上述方式C的优势在于,不用对回复语音库中的回复语音进行调整,实现起来简单方便,可以直接选择播放时长满足要求的回复语音作为响应。上述方式D的优势在于,不损失回复语音的信息内容,同时能够满足缩短播放时长的效果。
在本实施例中,根据所述评价语音调整后续与所述指令语音相同的指令语音对应的回复语音的冗余度,包括:
确定所述评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制后续与所述指令语音相同的指令语音对应的回复语音的冗余度小于或等于所述比值。
在本实施例中,在根据所述评价语音调整后续与所述指令语音相同的指令语音对应的回复语音的冗余度时,可以确定所述评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制后续与所述指令语音相同的指令语音对应的回复语音的冗余度小于或等于所述比值。举例来说,假设一个回复语音完整的播放时长是15s,当在该回复语音播放至6s时接收到了用户的评价语音,则评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值为0.4,则可以控制后续与所述指令语音相同的指令语音对应的回复语音的冗余度小于或等于所述比值,也即后续在对回复语音进行控制时,保证回复语音中与指令语音不存在直接关联的部分占总的指令语音的比例小于0.4。
举例来说,对于回复语音:“现在是上午11点,工作累了吧,记得多补充水分,多吃水果哦,伸下懒腰,做下伸展运动有利于健康呀”来说,“现在是上午11点”为与指令语音直接相关的内容,而“工作累了吧,记得多 补充水分,多吃水果哦,伸下懒腰,做下伸展运动有利于健康呀”为与指令语音不是直接相关的内容。目前回复语音的冗余度为0.85,假设当在该回复语音播放至6s时接收到了用户的评价语音,则评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值为0.4,则可以控制后续与所述指令语音相同的指令语音对应的回复语音的冗余度小于或等于所述比值,也即后续在对回复语音进行控制时,保证回复语音中与指令语音不存在直接关联的部分占总的指令语音的比例小于0.4,也即可以将回复语音调整为“现在是上午11点,工作累了吧”。
在本实施例中,根据所述评价语音调整后续与所述指令语音相同的指令语音对应的回复语音的回复文本的字数,包括:
确定所述评价语音发生时所述回复语音已播放部分对应的第一字数,并控制后续与所述指令语音相同的指令语音对应的回复语音的回复文本的字数小于或等于所述第一字数。
在本实施例中,跟前述实施例类似,主要区别在于本实施例强调的是回复文本的字数,也即本处理方式是通过调整回复文本的字数的方式来调整回复语音的长度。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。
在本实施例中,根据所述评价语音调整后续与所述指令语音相同的指令语音对应的回复语音的回复文本的冗余度,包括:
确定所述评价语音发生时所述回复语音已播放的第一时长占所述回复语音总时长的比值,并控制后续与所述指令语音相同的指令语音对应的回复语音的回复文本的冗余度小于或等于所述比值。
在本实施例中,跟前述实施例类似,主要区别在于本实施例强调的是回复文本的冗余度,也即本处理方式是通过调整回复文本的冗余度的方式来调整回复语音的冗余度。由于本实施例的具体处理方式与上述实施例类似,因此此处不再做具体介绍。
在本实施例中,根据所述评价语音调整与第一用户发出的所有或部分指令语音对应的回复语音的播放时长,包括:
确定所述评价语音发生时所述回复语音已播放的第一时长,并控制与第一用户发出的所有或部分指令语音对应的回复语音的播放时长小于或 等于所述第一时长。
基于上述实施例的内容,在本实施例中,所述的语音交互处理方法,还包括:
确定接收所述评价语音时对应的时间段信息;
相应地,在后续与所述时间段信息相对应的时间段,根据所述评价语音对所述回复语音进行调整。
在本实施例中,为进一步进行精细化控制,可以先确定所述评价语音发生时对应的时间段信息,然后在后续与所述时间段信息相对应的时间段,根据所述评价语音对所述回复语音进行调整。
可以理解的是,用户可能在不同时间段对于回复语音的播放长度和/或冗余度有不同的要求,例如在第一时间段(如下午16:00-17:00),更倾向于接收内容丰富的回复语音,例如,包含与指令语音直接相关以及与指令语音不直接相关的内容,而在第二时间段(如早上8:00-9:00),更倾向于接收内容简短的回复语音,例如,包含与指令语音直接相关的内容。因此,即便对于同一指令语音,可能因其所处的时间段不同,用户对该指令语音的回复语音要求也是不同的。为解决该问题,本实施例先确定所述评价语音发生时对应的时间段信息,然后在后续与所述时间段信息相对应的时间段,根据所述评价语音对所述回复语音进行调整。
例如,可以在后续与所述时间段信息相对应的时间段,执行前面实施例所述的处理方式1至处理方式13中的任意一种或多种调整方式。
可以理解是,可以将一天分为多个时间段,然后分别确定用户在各个时间段对于不同回复语音的调整方式。此外,还可以以1小时为单位分割成24个时段,分别确定用户在各个时间段对于不同回复语音的调整方式,本实施例对此不作限定。
基于上述实施例的内容,在本实施例中,在根据所述评价语音对所述回复语音进行调整之前,所述方法还包括:
确定所述评价语音是否为有效的评价语音,具体包括:
确定所述评价语音是否不包含唤醒词,和/或,确定所述评价语音的时长是否小于第一时长,和/或,所述评价语音与所述指令语音或所述回复语音的响度差是否大于第一差值,若是,则确定所述评价语音为有效的评价 语音。
在本实施例中,在根据所述评价语音对所述回复语音进行调整之前,可以先判断所述评价语音是否为有效的评价语音,若不是,则可以直接丢弃,不用对其进行分析,从而可以节省资源。
在确定所述评价语音是否为有效的评价语音时,有多种实现方式,举例来说,由于评价语音不是指令语音,无需唤醒智能设备,因此,评价语音一般不包含唤醒词,因此,在一种实现方式中,可以通过确定评价语音中是否包含唤醒词来确定是否为有效的评价语音。例如,当确定不包含唤醒词时,为有效的评价语音。当确定包含唤醒词时,为无效的评价语音。
此外,由于评价语音一般都是简短的“好”、“不好”、“Yes”、“No”、“Shut up”等,也即评价语音都是一些长度较短的语句,因此,在一种实现方式中,可以通过评价语音的时长是否小于第一时长来确定是否为有效的评价语音。例如,若小于第一时长,则确定为有效的评价语音,否则确定为无效的评价语音。其中,第一时长的大小可以根据需要进行设定,本实施例不作限定。
此外,由于评价语音与指令语音或回复语音一般存在响度差,因此,在一种实现方式中,可以通过判断评价语音与指令语音或回复语音的响度差是否大于第一差值来确定是否为有效的评价语音。例如,若大于第一差值,则确定为有效的评价语音,否则确定为无效的评价语音。其中,第一差值的大小可以根据需要进行设定,本实施例不作限定。
基于上述实施例的内容,在本实施例中,与所述指令语音对应的语义识别算法为第一语义识别算法,与所述评价语音对应的语义识别算法为第二语义识别算法,所述第二语义识别算法的实时性低于所述第一语义识别算法的实时性。
在本实施例中,由于用户对指令语音是否被及时响应的敏感度较高,因此,与指令语音对应的语义识别算法的实时性要求较高,而由于用户对评价语音是否被及时响应的敏感度相对较低,因此,与评价语音对应的语义识别算法的实时性要求相对较低,此外,由于与评价语音对应的语义识别算法的实时性要求相对较低,因此,可以使用准确度更高,更为复杂的识别算法准确识别评价语音所包含的评价含义,进而对回复语音进行更为 精准的调整。
基于上述实施例的内容,在本实施例中,若所述评价语音中携带有针对所述回复语音的负面反馈信息,则根据所述评价语音对所述回复语音进行调整,包括:
确定所述指令语音的长度;
根据所述指令语音的长度对所述回复语音的播放时长和/或冗余度进行调整。
在本实施例中,采用了与前述实施例不同的方式,也即本实施例不是根据接收所述评价语音时所述回复语音已播放的第一时长对回复语音进行调整,而是根据指令语音的长度对回复语音进行调整。例如,当用户发出的指令语音较长时,则对应的回复语音的播放时长也较长;当用户发出的指令语音较短时,则对应的回复语音的播放时长也较短。
可以理解的是,当用户是位希望接收简短有效的回复语音的用户时,其所发出的指令语音一般也较为简短,因此,根据该处理方式,可以较为简单有效地确定回复语音的长度。
此外,可以理解的是,由于指令语音的长度是时间值,因此在对播放时长进行调整时,可以直接利用,而在对冗余度进行调整时,可以按照预先设定的时长与冗余度的关系,确定合适的冗余度,进而对冗余度进行调整。例如,假设预先设定的时长与冗余度的关系是:当时长为2s时,冗余度为0.1,当时长为5s时,冗余度为0.2,当时长为8s时,冗余度为0.3等等。
在本实施例中,根据所述指令语音的长度对所述回复语音的播放时长进行调整可以指:控制所述回复语音的播放时长小于或等于所述指令语音的长度;也可以指:控制所述回复语音的播放时长与所述指令语音的长度的差值的绝对值位于预设区间内。此外,对于冗余度的调整,也可以采用类似的方式,本实施例不再赘述。
基于上述实施例的内容,在本实施例中,根据所述指令语音的长度对所述回复语音的播放时长进行调整,包括:
根据所述指令语音的长度控制所述回复语音在播放时长与所述指令语音的长度匹配时停止播放;
或,
根据所述指令语音的长度在所述回复语音的未播放部分中截取部分内容进行继续播放,使得调整后的回复语音的总播放时长与所述指令语音的长度匹配;
或,
根据所述指令语音的长度调高所述回复语音的未播放部分的播放速度,使得调整后的回复语音的总播放时长与所述指令语音的长度匹配。
在本实施例中,在根据所述指令语音的长度对所述回复语音的播放时长进行调整时,有多种实现方式:比如,①根据所述指令语音的长度控制所述回复语音在播放时长与所述指令语音的长度匹配时停止播放。这里的匹配包括多种情况,例如可以包括所述回复语音的播放时长小于或等于所述指令语音的长度,或所述回复语音的播放时长与所述指令语音的长度的差值的绝对值位于预设区间内等。
此外,还可以有②根据所述指令语音的长度在所述回复语音的未播放部分中截取部分内容进行继续播放,使得调整后的回复语音的总播放时长与所述指令语音的长度匹配。此外,还可以有③根据所述指令语音的长度调高所述回复语音的未播放部分的播放速度,使得调整后的回复语音的总播放时长与所述指令语音的长度匹配。
由此可见,本实施例给出了多种不同实现方式,具体实施时,可以根据需要选择合适的方式。
可以理解的是,对于第①种方式,根据所述指令语音的长度控制所述回复语音在播放时长与所述指令语音的长度匹配时停止播放,其优势是可以较为简单且准确地控制回复语音的播放时长。对于第②种方式,加快播放速度,其中优势是可以不缩减信息,同时能够保证较短时间播放完。而对于第③种方式,在所述回复语音的未播放部分中截取部分内容进行继续播放,其优势是可以从未播放部分中截取重要或关键的内容进行播放,因而可以避免损失回复信息中位于后面但是比较有效的信息。举例来说,当问今天天气如何时,假设回复语音为:“天气晴朗,阳光灿烂,温度15-20,大风4-5级,不适合外出游玩或爬山”,对于这种情况,假设按照第①种,也即所述回复语音在播放时长与所述指令语音的长度匹配时停止播放的 方式,则有可能会错过后面的“大风4-5级,不适合外出游玩或爬山”的有效信息,因此,采用这种处理方式,可以避免这种情况的发生。
基于上述实施例的内容,在本实施例中,根据所述指令语音的长度对所述回复语音的冗余度进行调整,包括:
根据所述指令语音的长度对应的长度范围区间,确定所述回复语音的冗余度对应的冗余度。
在本实施例中,可以理解的是,由于指令语音的长度是时间值,因此在对冗余度进行调整时,没有办法直接利用,需要转换为对应的冗余度相关信息。在本实施例中,在将指令语音的长度信息转换为冗余度相关信息,可以根据所述指令语音的长度对应的长度范围区间,确定所述回复语音的冗余度。例如,假设当所述指令语音的长度对应的长度范围区间为(0-2]s时,所述回复语音的冗余度为0.1,当所述指令语音的长度对应的长度范围区间为(2-5]s时,所述回复语音的冗余度为0.2,当所述指令语音的长度对应的长度范围区间为(5-10]s时,所述回复语音的冗余度为0.3等。
基于上述实施例的内容,在本实施例中,若所述评价语音中携带有针对所述回复语音的负面反馈信息,则根据所述评价语音对所述回复语音进行调整,包括:
确定所述指令语音的长度;
根据所述指令语音的长度和接收所述评价语音时所述回复语音已播放的第一时长,对所述回复语音的播放时长和/或冗余度进行调整。
在本实施例中,采用了与前述实施例不同的方式,也即本实施例不仅仅根据接收所述评价语音时所述回复语音已播放的第一时长对回复语音进行调整,也不仅仅根据指令语音的长度对回复语音进行调整,而是综合两者对回复语音进行调整。例如可以根据两者的平均值进行调整,也可以根据两者中的最小值进行调整等。可以理解的是,综合两者对回复语音进行调整的优势在于:可以更加准确反映用户对于回复语音的播放时长的接受度,因此,采用这种方式确定的回复语音的播放时长和/或冗余度比较符合用户预期。
基于上述实施例的内容,在本实施例中,根据所述指令语音的长度和接收所述评价语音时所述回复语音已播放的第一时长,对所述回复语音的 播放时长和/或冗余度进行调整,包括下述方式中的任意一种:
根据所述指令语音的长度和第一时长的平均值,对所述回复语音的播放时长和/或冗余度进行调整;
根据所述指令语音的长度和第一时长中的最小值,对所述回复语音的播放时长和/或冗余度进行调整;
根据所述指令语音的长度和第一时长之和,对所述回复语音的播放时长和/或冗余度进行调整;
根据所述指令语音的长度和第一时长采用第一关系模型或第二关系模型,确定回复语音的目标时长,并根据所述目标时长对所述回复语音的播放时长和/或冗余度进行调整;其中,所述第一关系模型包括:T=k 1
(αT 1+βT 2);其中,T表示目标时长,T 1表示指令语音的长度,T 2表示第一时长,α表示指令语音的权重,β表示第一时长的权重,k 1表示第一调节系数;
所述第二关系模型包括:T 0=k 2(αlnT 1+βlnT 2);其中,T 0表示目标时长,T 1表示指令语音的长度,T 2表示第一时长,α表示指令语音的权重,β表示第一时长的权重,k 2表示第二调节系数。
在本实施例中,给出了综合所述指令语音的长度和第一时长,对所述回复语音的播放时长和/或冗余度进行调整的具体方式,例如可以根据两者的平均值进行调整,也可以根据两者中的最小值进行调整,还可以根据两者之和进行调整,此外,还可以采用上述第一关系模型或第二关系模型进行调整。
可以理解的是,根据两者的平均值进行调整的优势在于:用户发出指令语音的长度以及发生评价语音时用户所能接受的最长播放时长(也即第一时长)这两者的平均值比较能准确反映用户对于回复语音的播放时长的接受度,因此,采用这种方式确定的回复语音的播放时长比较符合用户预期。
可以理解的是,根据两者中的最小值进行调整的优势在于:根据两者的最小值确定回复语音的播放时长能够最大程度地使得回复语音简短精炼有效,从而可以满足用户对于回复语音简短精炼的要求。
可以理解的是,根据两者之和进行调整的优势在于:能够在基本满足 用户对于回复语音的播放时长要求的前提下,尽可能多地为用户提供一些附加扩展信息,以使得回复语音不要显得过于单调。
可以理解的是,采用上述第一关系模型或第二关系模型进行调整的优势在于:可以根据需求分别为指令语音的长度以及第一时长赋予不同的权重,比如更侧重于使得回复语音的播放时长偏向于与指令语音的时长接近,则可以使得与指令语音的时长对应权重增加,比如更侧重于使得回复语音的播放时长偏向于与第一时长接近,则可以使得与第一时长对应的权重增加,最后上述第一关系模型和第二关系模型还设置了调节系数,用于在最后根据指令语音的时长和第一时长共同确定出时长后,对该时长进行适当调节,比如,在倾向于更短的回复语音时,可以设置调节系数为0.5,在倾向于较长的回复语音时,可以设置调节系数为0.8或1等等。
此外,可以理解的是,不管是根据平均值,还是根据最小值,还是根据两者之和,还是根据目标时长,这些都是时间值,对播放时长进行调整时,可以直接利用,而在对冗余度进行调整时,可以按照预先设定的时长与冗余度的关系,确定合适的冗余度,进而对冗余度进行调整。例如,假设预先设定的时长与冗余度的关系是:当时长为2s时,冗余度为0.1,当时长为5s时,冗余度为0.2,当时长为8s时,冗余度为0.3等等。
基于上述实施例的内容,在本实施例中,对于具备唤醒词的智能设备,所述指令语音包括唤醒词。
在本实施例中,对于具备唤醒词的智能设备,所述指令语音包括唤醒词,相应地,当某个指令语音中不包含唤醒词时,将不会被识别以及响应,从而可以减少无关语音的干扰。
需要说明的是,对于唤醒词来说,不同的智能设备会有不同的设计,本实施例对唤醒词的具体内容设置和长短设置不作要求,一般来说,唤醒词跟产品特点或昵称有关,此外,唤醒词一般不宜过长,且需要比较容易发音。
根据上述技术方案可知,本实施例提供的语音交互处理方法,通过在回复语音的播放过程中发送评价语音的方式对回复语音进行调整,从而使得调整后的回复语音更加匹配用户需求,从而可以为用户提供更好的语音交互服务体验。
在本实施例中,给出关于上述出现的一些名词的更为详细的解释:
指令语音:是指由用户发出的能够触发语音交互设备(可以是智能设备,也可以是终端设备,也可以是服务器,也可以是多者组合)对话管理(Diaglou Management,简称DM)的语音内容。需要说明的是,在利用唤醒词唤醒的语音交互设备中,该指令语音一般需包括唤醒词。
语音交互设备:可以为由智能设备、终端设备和服务器三者组成,例如,由智能设备接收指令语音,由终端设备进行语音识别,由服务器进行对话管理等。此外,还可以是令终端设备与智能设备连接,然后借由终端设备接收指令语音,并由服务器进行语音识别(也可以放到终端设备)、对话管理等。此外,语音交互设备也可以由智能设备和服务器两者组成,也即由智能设备接收指令语音,然后由服务器进行语音识别和对话管理等。此外,语音交互设备还可以由智能设备组成,也即在智能设备本地执行接收指令语音,同时也在本地进行语音识别和对话管理等的全过程。此外,语音交互设备可以由智能设备和终端设备组成,也即由智能设备接收指令语音,然后由终端设备进行语音识别和对话管理等的处理过程。此外,语音交互设备可以由终端设备组成,也即由终端设备接收指令语音,然后由终端设备进行语音识别和对话管理等的处理过程。可以理解的是,语音交互设备可以由智能设备、终端设备和服务器中的一个、两个或三个组成,本实施例对此不再一一举例说明。
回复语音:是指响应用户一次指令语音而由语音交互设备所播放的语音。
回复语音的时长:是指回复语音的音频长度,约等于播放完回复语音所需的时间。
评价语音:是指针对于回复语音的评价,例如用“好”、“不”、“否”、“住嘴”、“shut up”等对回复语音进行评价。其中,经调查发现,长度短于一定阈值的语音更可能是评价性的语音而非指令性的语音。另外,评价性要素的文本数据库远小于存储指令语音的对话数据库、语调(例如升降调达到一定阈值,则认为该语音包含评价性特征)或者响度(高于一定阈值或与上一句语音的响度差大于一定阈值)等非内容性的特征因素,得到用户对上一条回复语音的评价。评价语音不是指令语音,即不能直接触发语 音交互设备“对话管理”的回复语音,在利用唤醒词唤醒的语音交互设备中,该评价语音通常不包括唤醒词(评价语音对识别度的要求一般要低于指令语音)。
本申请的基本原理为:语音交互设备在播放回复语音的一定时间窗口内(例如10秒),确认用户反馈了评价语音,则根据该评价对回复语音进行调整,如调整其出现的频率。下面结合附图3、图4、图5、图6和图7以及具体实施例对本申请提供的语音交互处理方法进行详细解释和说明。
实施例一
如图3-5所示,语音交互系统包括语音交互终端(也称语音交互设备)和云服务器,语音交互终端的作用是接收来自用户的语音信息,示例性的,语音交互终端包括智能音箱,安装有语音助手类软件的智能手机,具有语音模块和通信模块的电视、冰箱、空调等智能家电以及运动手环、智能手表等可穿戴型智能设备。
用户在利用智能语音交互功能时,首先由用户发出指示语音。例如,“小美小美,现在几点了?”,其中“小美小美”为唤醒词。相应地,语音交互终端通过麦克风模块接收用户发出的语音,经降噪、增强等初步语音音频处理后,判断语音音频数据的首部是否包含预设的唤醒词(例如,该首部与对应“小美小美”的音频波形是否匹配),若包含,则就将处理后的语音音频数据上传至云服务器。否则,做丢弃处理。上传至云服务器的语音音频数据依次经过自动语言识别模块(音频转文本)、自然语言处理模块(文本分析)后,进入对话管理模块,由对话管理模块决策反馈对应的回复语音和/或设备操作命令。语音交互终端接收来自云服务器下发的回复语音,并通过扬声器模块进行播放。
需要说明的是,语音交互终端在播放回复语音期间或者播放完毕后的一个时间窗内(例如5秒),语音交互终端继续录制用户的非指令语音(即非意在命令语音交互系统实现一定功能的语句,例如可以是单纯情绪宣泄,通常不包括“唤醒词”,不会主动唤醒设备),则将该语音上传至云服务器的评价特征提取模块进行评价解析,该评价特征提取模块从文本内容解析该语音非指令语音,而是包含用户对上一句回复语音的评价(情绪),然后将该评价输出至对话管理模块,进而用于调整上一句回复语音的出现 频次。
可以理解的是,在分析文本时,尽管也是基于文本内容进行分析,由于非指令语音并非指令,两者在内容上有很大的差别,评价特征提取模块连接不同于对话管理模块的第二文本数据库(图4的评价数据库)。在评价分析时,可以先针对非文本内容的特征要素(例如,非指令语音的时长、非指令语音与指令语音或者回复语音的响度差等)进行辨认,符合一定条件后再针对文本内容进行辨认。相较于回复语音,用户对于评价特征特征提取的实时性及准确性的敏感度很低,因此优选在对指令语音和非指令语音采用不同的处理策略(例如,可以使用不同的数据库,且对非指令语音采用更复杂的识别模式,而可适当放松对实时性的要求等)。
在本实施例中,可以理解的是,实施主体可以为服务器,也可以为终端语音设备(此时在本地执行语音识别和对话管理等相关处理)。
本实施例提供的语音交互方法的处理过程可参见图4和图5所示:用户发出指令语音,例如“小美小美(唤醒词),现在是几点”,在播放回复语音“现在已经是……明天继续加油!”期间或者播放完毕后5秒的时间窗口内,检测到来自用户的评价语音输入“不加油”。由此可见,该用户对于该回复语音的态度是不满意,则后续可以降低该回复语音的出现频率。
实施例二
参见图7所示,本实施例二与实施例一的主要区别在于,评价特征提取模块在语音交互终端上,而非在服务器上。
在本实施例中,评价特征提取模块可以提取文字内容作为输出评价的判断标准,也可以只提取语调、响度等几个非文本的维度作为输出评价的判断标准。当只提取语调、响度等几个非文本的维度作为输出评价的判断标准可以降低终端的硬件要求。
根据上面的技术方案可知,本申请实施例可以根据用户针对回复语音的评价语音反馈,调整话术的策略,从而使得调整后的回复话术更符合用户习惯或需求。
基于相同的发明构思,本申请另一实施例提供了一种语音交互处理装置,参见图8,本实施例提供的语音交互处理装置,包括:接收模块21 和处理模块22,其中:
接收模块21,用于接收在回复语音的播放过程中或播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
处理模块22,用于根据所述评价语音,确定对应所述指令语音的对话策略。
可以理解的是,本实施例包括两个并列的方案:
方案1:接收在回复语音的播放过程中用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;根据所述评价语音,确定对应所述指令语音的对话策略
方案2:接收在回复语音播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;根据所述评价语音,确定对应所述指令语音的对话策略。
由于本实施例提供的语音交互处理装置可以用于执行上述实施例所述的语音交互处理方法,其工作原理和有益效果类似,故此处不再详述,具体内容可参见上述实施例的介绍。
基于相同的发明构思,本申请另一实施例提供了一种智能设备,该智能设备包括如上面实施例所述的语音交互处理装置。
在本实施例中,可以理解的是,由于上述语音交互处理装置的处理过程可以在智能设备上实现,因此,本实施例提供了一种包含所述语音交互处理装置的智能设备,进而实现上述语音交互处理过程。可以理解的是,智能设备可以是各种智能电器,如智能音箱、智能电冰箱、智能电饭煲、智能热水器、智能电视、智能洗衣机等等,本实施例对此不做限定。
由于本实施例提供的智能设备包括上面实施例所述的语音交互处理装置,因此其工作原理和有益效果类似,故此处不再详述,具体内容可参见上述实施例的介绍。
基于相同的发明构思,本申请另一实施例提供了一种终端设备,该终端设备包括如上面实施例所述的语音交互处理装置。
在本实施例中,可以理解的是,由于上述语音交互处理装置的处理过 程可以在终端设备上实现,因此,本实施例提供了一种包含所述语音交互处理装置的终端设备,进而实现上述语音交互处理过程。可以理解的是,终端设备可以是各种设备,如手机、pad、智能手表、笔记本等等,本实施例对此不做限定。
由于本实施例提供的终端设备包括上面实施例所述的语音交互处理装置,因此其工作原理和有益效果类似,故此处不再详述,具体内容可参见上述实施例的介绍。
基于相同的发明构思,本申请另一实施例提供了一种服务器,该服务器包括如上面实施例所述的语音交互处理装置。
在本实施例中,可以理解的是,由于上述语音交互处理装置的处理过程可以在服务器上实现,因此,本实施例提供了一种包含所述语音交互处理装置的服务器,进而实现上述语音交互处理过程。在本实施例中,服务器可以是云服务器,也可以是其他服务器,本实施例对此不作限定。当为云服务器时,具体处理速度快,安全性高等优势。
由于本实施例提供的服务器包括上面实施例所述的语音交互处理装置,因此其工作原理和有益效果类似,故此处不再详述,具体内容可参见上述实施例的介绍。
基于相同的发明构思,本申请又一实施例提供了一种智能设备,参见图9,所述智能设备具体包括如下内容:处理器301、存储器302、通信接口303和通信总线304;
其中,所述处理器301、存储器302、通信接口303通过所述通信总线304完成相互间的通信;所述通信接口303用于实现各建模软件及智能制造装备模块库等相关设备之间的传输;
所述处理器301用于调用所述存储器302中的计算机程序,所述处理器执行所述计算机程序时实现上述语音交互处理方法的全部步骤,例如,所述处理器执行所述计算机程序时实现下述步骤:接收在回复语音的播放过程中或播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;根据所述评价语音,确定对应所述指令语音的对话策略。
可以理解的是,所述计算机程序可以执行的细化功能和扩展功能可参 照上面实施例的描述。
可以理解的是,智能设备可以是各种智能电器,如智能音箱、智能电冰箱、智能电饭煲、智能热水器、智能电视、智能洗衣机等等,本实施例对此不做限定。
基于相同的发明构思,本申请又一实施例提供了一种终端设备,参见图10,所述终端设备具体包括如下内容:处理器401、存储器402、通信接口403和通信总线404;
其中,所述处理器401、存储器402、通信接口403通过所述通信总线404完成相互间的通信;所述通信接口403用于实现各建模软件及智能制造装备模块库等相关设备之间的传输;
所述处理器401用于调用所述存储器402中的计算机程序,所述处理器执行所述计算机程序时实现上述语音交互处理方法的全部步骤,例如,所述处理器执行所述计算机程序时实现下述步骤:接收在回复语音的播放过程中或播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;根据所述评价语音,确定对应所述指令语音的对话策略。
可以理解的是,所述计算机程序可以执行的细化功能和扩展功能可参照上面实施例的描述。
可以理解的是,终端设备可以是各种设备,如手机、pad、智能手表、笔记本等等,本实施例对此不做限定。
基于相同的发明构思,本申请又一实施例提供了一种服务器,参见图11,所述服务器具体包括如下内容:处理器501、存储器502、通信接口503和通信总线504;
其中,所述处理器501、存储器502、通信接口503通过所述通信总线504完成相互间的通信;所述通信接口503用于实现各建模软件及智能制造装备模块库等相关设备之间的传输;
所述处理器501用于调用所述存储器502中的计算机程序,所述处理器执行所述计算机程序时实现上述语音交互处理方法的全部步骤,例如,所述处理器执行所述计算机程序时实现下述步骤:接收在回复语音的播放过程中或播放结束后的时间窗口内用户针对回复语音的评价语音;所述回 复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;根据所述评价语音,确定对应所述指令语音的对话策略。
可以理解的是,所述计算机程序可以执行的细化功能和扩展功能可参照上面实施例的描述。
在本实施例中,服务器可以是云服务器,也可以是其他服务器,本实施例对此不作限定。当为云服务器时,具体处理速度快,安全性高等优势。
基于相同的发明构思,本申请又一实施例提供了一种非暂态计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述语音交互处理方法的全部步骤,例如,所述处理器执行所述计算机程序时实现下述步骤:接收在回复语音的播放过程中或播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;根据所述评价语音,确定对应所述指令语音的对话策略。
可以理解的是,所述计算机程序可以执行的细化功能和扩展功能可参照上面实施例的描述。
此外,上述的存储器中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本申请实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的语音交互处理方法。
在本申请的描述中,需要说明的是,术语“上”、“下”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制。除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。
此外,在本申请中,诸如“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
此外,在本申请中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
此外,在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (27)

  1. 一种语音交互处理方法,其特征在于,包括:
    接收在回复语音的播放过程中用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
    根据所述评价语音,确定对应所述指令语音的对话策略。
  2. 一种语音交互处理方法,其特征在于,包括:
    接收在回复语音播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
    根据所述评价语音,确定对应所述指令语音的对话策略。
  3. 根据权利要求1或2所述的语音交互处理方法,其特征在于,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
    根据所述评价语音,调整后续响应所述指令语音时所述回复语音出现的频率。
  4. 根据权利要求1或2所述的语音交互处理方法,其特征在于,所述回复语音为基于用户发出的指令语音通过查询对话数据库确定的回复语音;
    相应地,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
    根据所述评价语音,查询评价数据库,确定所述评价语音中包含的反馈信息,并根据所述反馈信息,确定对应所述指令语音的对话策略;
    其中,所述评价数据库和所述对话数据库独立设置,所述评价数据库设置在智能设备侧,且所述评价数据库的内容少于所述对话数据库。
  5. 根据权利要求1或2所述的语音交互处理方法,其特征在于,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
    确定所述评价语音中包含带有负面色彩的关键词且所述关键词与降低播放时长有关,则降低响应所述指令语音的回复语音的播放时长和/或冗余度。
  6. 根据权利要求5所述的语音交互处理方法,其特征在于,降低响 应所述指令语音的回复语音的播放时长和/或冗余度,具体包括:
    确定接收所述评价语音时所述回复语音已播放的第一时长,根据所述第一时长调整对应所述指令语音的回复语音的播放时长;
    和/或,
    确定接收所述评价语音时所述回复语音已播放的第一时长占所述回复语音总时长的第一比值,根据所述第一比值调整对应所述指令语音的回复语音的冗余度。
  7. 根据权利要求6所述的语音交互处理方法,其特征在于,根据所述第一时长调整对应所述指令语音的回复语音的播放时长,具体包括下述内容中的一项或多项:
    控制后续与所述指令语音相同的指令语音对应的回复语音的播放时长小于或等于所述第一时长;
    控制与第一用户发出的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长;其中,所述第一用户为发出所述指令语音的用户;
    控制与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长小于或等于所述第一时长。
  8. 根据权利要求1或2所述的语音交互处理方法,其特征在于,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
    确定所述评价语音中包含带有负面色彩的关键词且所述关键词与用户偏好有关,则降低所述回复语音作为所述指令语音的响应的使用频率或更换新的回复语音作为所述指令语音的响应。
  9. 根据权利要求8所述的语音交互处理方法,其特征在于,降低所述回复语音作为所述指令语音的响应的使用频率或更换新的回复语音作为所述指令语音的响应,具体包括:
    降低所述回复语音的使用频率;其中,降低所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从与所述指令语音对应的回复语音库中选择所述回复语音作为响应的概率降低;
    或,降低播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率;其中,减低播放长度和/或冗余度大于或等于所述回复语音的回复 语音使用频率是指在后续响应所述指令语音时,从与所述指令语音对应的回复语音库中选择播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率降低;
    或,从与所述指令语音对应的回复语音库中选择与所述回复语音不同的回复语音进行播放;
    或,根据所述负面反馈信息中携带的用户希望更换的主题,从与所述指令语音对应的回复语音库中选择与所述主题匹配的回复语音进行播放。
  10. 根据权利要求1或2所述的语音交互处理方法,其特征在于,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
    确定所述评价语音中包含带有正面色彩的关键词且所述关键词与保持或提高播放时长有关,则保持或提高响应所述指令语音的回复语音的播放时长和/或冗余度。
  11. 根据权利要求10所述的语音交互处理方法,其特征在于,保持或提高响应所述指令语音的回复语音的播放时长和/或冗余度,具体包括下述中的任意一项或多项:
    保持或提高所述回复语音的播放时长和/或冗余度;其中,回复语音的冗余度是指回复语音中非回复指令语音所必需的语音内容与回复语音全部语音内容的比值;
    保持或提高与所述指令语音对应的回复语音库中的部分或所有回复语音的播放时长和/或冗余度;
    保持或提高与第一用户发出的所有或部分指令语音对应的回复语音的播放时长和/或冗余度;其中,所述第一用户为发出所述指令语音的用户;
    保持或提高与所述指令语音在同一指令语音组中的所有或部分指令语音对应的回复语音的播放时长和/或冗余度;
    从与所述指令语音对应的回复语音库中选择与所述回复语音的播放时长和/或冗余度的差值在预设范围内的回复语音进行播放。
  12. 根据权利要求10所述的语音交互处理方法,其特征在于,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
    确定所述评价语音中包含带有正面色彩的关键词且所述关键词与保持或提高使用频率有关,则保持或提高所述回复语音作为所述指令语音的 响应的使用频率。
  13. 根据权利要求12所述的语音交互处理方法,其特征在于,保持或提高所述回复语音作为所述指令语音的响应的使用频率,具体包括下述内容中的一项或多项:
    增加所述回复语音的使用频率;其中,增加所述回复语音的使用频率是指在后续时间段内响应所述指令语音时,从回复语音库中选择所述回复语音作为响应的概率增加;
    增加主题与所述回复语音接近的回复语音的使用频率;
    增加播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率;其中,增加播放长度和/或冗余度大于或等于所述回复语音的回复语音使用频率是指在后续响应所述指令语音时,从与所述指令语音对应的回复语音库中选择播放长度和/或冗余度大于或等于所述回复语音的回复语音作为响应的概率增加。
  14. 根据权利要求5或8所述的语音交互处理方法,其特征在于,确定所述评价语音中包含带有负面色彩的关键词,具体包括下述内容中的一项或多项:
    确定所述评价语音中携带有第一信息,所述第一信息是指与第一数据库中的评语信息相匹配的信息;其中,所述第一数据库中存储有负面评语信息;
    确定所述评价语音中携带有第二信息,所述第二信息是指与所述回复语音中包含的信息具有相反含义的信息;
    确定所述评价语音对应的语调与第一语调库中的语调信息相匹配,所述第一语调库中存储有带有负面情绪的语调;
    确定所述评价语音对应的响度大于或等于第一响度。
  15. 根据权利要求10或12所述的语音交互处理方法,其特征在于,确定所述评价语音中包含带有正面色彩的关键词,具体包括下述内容中的一项或多项:
    确定所述评价语音中携带有第三信息,所述第三信息是指与第二数据库中的评语信息相匹配的信息;其中,所述第二数据库中存储有正面评语信息;
    确定所述评价语音中携带有第四信息,所述第四信息是指与所述回复语音中包含的信息具有相同或类似含义的信息;
    确定所述评价语音对应的语调与第二语调库中的语调信息相匹配,所述第二语调库中存储有带有正面情绪的语调;
    确定所述评价语音对应的响度小于第一响度。
  16. 根据权利要求1~13任一项所述的语音交互处理方法,其特征在于,还包括:
    确定接收所述评价语音时对应的时间段信息;
    相应地,在后续与所述时间段信息相对应的时间段,根据所述评价语音,确定对应所述指令语音的对话策略。
  17. 根据权利要求1~13任一项所述的语音交互处理方法,其特征在于,在根据所述评价语音,确定对应所述指令语音的对话策略之前,所述方法还包括:
    确定所述评价语音是否为有效的评价语音,具体包括:
    确定所述评价语音是否不包含唤醒词,和/或,确定所述评价语音的时长是否小于第一时长,和/或,所述评价语音与所述指令语音或所述回复语音的响度差是否大于第一差值,若是,则确定所述评价语音为有效的评价语音。
  18. 根据权利要求1或2所述的语音交互处理方法,其特征在于,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
    确定所述指令语音的长度,根据所述指令语音的长度对所述回复语音的播放时长进行调整,或,根据所述指令语音的长度对所述回复语音的冗余度进行调整。
  19. 根据权利要求18所述的语音交互处理方法,其特征在于,根据所述指令语音的长度对所述回复语音的播放时长进行调整,包括:
    根据所述指令语音的长度控制所述回复语音在播放时长与所述指令语音的长度匹配时停止播放;
    或,
    根据所述指令语音的长度在所述回复语音的未播放部分中截取部分内容进行继续播放,使得调整后的回复语音的总播放时长与所述指令语音 的长度匹配;
    或,
    根据所述指令语音的长度调高所述回复语音的未播放部分的播放速度,使得调整后的回复语音的总播放时长与所述指令语音的长度匹配。
  20. 根据权利要求18所述的语音交互处理方法,其特征在于,根据所述指令语音的长度对所述回复语音的冗余度进行调整,包括:
    根据所述指令语音的长度对应的长度范围区间,确定所述回复语音的冗余度对应的冗余度。
  21. 根据权利要求1或2所述的语音交互处理方法,其特征在于,根据所述评价语音,确定对应所述指令语音的对话策略,具体包括:
    确定所述指令语音的长度,根据所述指令语音的长度和接收所述评价语音时所述回复语音已播放的第一时长,对所述回复语音的播放时长和/或冗余度进行调整。
  22. 根据权利要求21所述的语音交互处理方法,其特征在于,根据所述指令语音的长度和接收所述评价语音时所述回复语音已播放的第一时长,对所述回复语音的播放时长和/或冗余度进行调整,包括下述方式中的任意一种:
    根据所述指令语音的长度和第一时长的平均值,对所述回复语音的播放时长和/或冗余度进行调整;
    根据所述指令语音的长度和第一时长中的最小值,对所述回复语音的播放时长和/或冗余度进行调整;
    根据所述指令语音的长度和第一时长之和,对所述回复语音的播放时长和/或冗余度进行调整;
    根据所述指令语音的长度和第一时长采用第一关系模型或第二关系模型,确定回复语音的目标时长,并根据所述目标时长对所述回复语音的播放时长和/或冗余度进行调整;其中,所述第一关系模型包括:T=k 1(αT 1+βT 2);其中,T表示目标时长,T 1表示指令语音的长度,T 2表示第一时长,α表示指令语音的权重,β表示第一时长的权重,k 1表示第一调节系数;
    所述第二关系模型包括:T 0=k 2(αlnT 1+βlnT 2);其中,T 0表示目 标时长,T 1表示指令语音的长度,T 2表示第一时长,α表示指令语音的权重,β表示第一时长的权重,k 2表示第二调节系数。
  23. 根据权利要求2所述的语音交互处理方法,其特征在于,所述时间窗口与所述回复语音的播放过程的至少一部分重合,所述评价语音的至少一部分落入所述时间窗口中与所述回复语音的播放过程相重合的区间。
  24. 一种语音交互处理装置,其特征在于,包括:
    接收模块,用于接收在回复语音的播放过程中用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
    处理模块,用于根据所述评价语音,确定对应所述指令语音的对话策略。
  25. 一种语音交互处理装置,其特征在于,包括:
    接收模块,用于接收在播放结束后的时间窗口内用户针对回复语音的评价语音;所述回复语音为响应于用户发出的指令语音的语音;所述指令语音为下发指令的语音;
    处理模块,用于根据所述评价语音,确定对应所述指令语音的对话策略。
  26. 一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现如权利要求1至23任一项所述语音交互处理方法的步骤。
  27. 一种非暂态计算机可读存储介质,其上存储有计算机程序,其特征在于,该计算机程序被处理器执行时实现如权利要求1至23任一项所述语音交互处理方法的步骤。
PCT/CN2020/140213 2020-12-14 2020-12-28 语音交互处理方法、装置、电子设备及存储介质 WO2022126734A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011474827.8 2020-12-14
CN202011474827.8A CN112463108B (zh) 2020-12-14 2020-12-14 语音交互处理方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022126734A1 true WO2022126734A1 (zh) 2022-06-23

Family

ID=74804210

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/140213 WO2022126734A1 (zh) 2020-12-14 2020-12-28 语音交互处理方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN112463108B (zh)
WO (1) WO2022126734A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115268324A (zh) * 2022-07-25 2022-11-01 青岛海尔科技有限公司 指令的修正方法及装置、存储介质及电子装置

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008233305A (ja) * 2007-03-19 2008-10-02 Toyota Central R&D Labs Inc 音声対話装置、音声対話方法及びプログラム
CN108053826A (zh) * 2017-12-04 2018-05-18 泰康保险集团股份有限公司 用于人机交互的方法、装置、电子设备及存储介质
CN108388926A (zh) * 2018-03-15 2018-08-10 百度在线网络技术(北京)有限公司 语音交互满意度的确定方法及设备
CN108536802A (zh) * 2018-03-30 2018-09-14 百度在线网络技术(北京)有限公司 基于儿童情绪的交互方法及装置
CN109036388A (zh) * 2018-07-25 2018-12-18 李智彤 一种基于对话设备的智能语音交互方法
CN109712618A (zh) * 2018-12-06 2019-05-03 珠海格力电器股份有限公司 一种语音服务的控制方法、装置、存储介质及空调
CN110637339A (zh) * 2017-05-15 2019-12-31 苹果公司 使用隐式反馈优化数字助理的对话策略决策
CN111488435A (zh) * 2019-01-28 2020-08-04 宝马股份公司 人工智能对话方法和装置、聊天机器人和存储介质
CN111881254A (zh) * 2020-06-10 2020-11-03 百度在线网络技术(北京)有限公司 话术生成方法、装置、电子设备及存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB9928011D0 (en) * 1999-11-27 2000-01-26 Ibm Voice processing system
JP2008250992A (ja) * 2007-03-07 2008-10-16 Sanyo Electric Co Ltd 音データ処理装置
US20080221876A1 (en) * 2007-03-08 2008-09-11 Universitat Fur Musik Und Darstellende Kunst Method for processing audio data into a condensed version
CN101075435B (zh) * 2007-04-19 2011-05-18 深圳先进技术研究院 一种智能聊天系统及其实现方法
JP6400445B2 (ja) * 2014-11-27 2018-10-03 Kddi株式会社 会話分析装置、会話分析システム、会話分析方法及び会話分析プログラム
CN105334743B (zh) * 2015-11-18 2018-10-26 深圳创维-Rgb电子有限公司 一种基于情感识别的智能家居控制方法及其系统
CN106601257B (zh) * 2016-12-31 2020-05-26 联想(北京)有限公司 一种声音识别方法、设备和第一电子设备
CN106992012A (zh) * 2017-03-24 2017-07-28 联想(北京)有限公司 语音处理方法及电子设备
CN107918653B (zh) * 2017-11-16 2022-02-22 百度在线网络技术(北京)有限公司 一种基于喜好反馈的智能播放方法和装置
CN108257597A (zh) * 2017-12-28 2018-07-06 合肥凯捷技术有限公司 一种基于语音识别的音频数据检索系统
CN110308660B (zh) * 2019-06-06 2020-12-22 美的集团股份有限公司 智能设备控制方法及装置
CN110288990B (zh) * 2019-06-12 2021-07-20 深圳康佳电子科技有限公司 一种语音控制优化方法、存储介质及智能终端
CN111429899A (zh) * 2020-02-27 2020-07-17 深圳壹账通智能科技有限公司 基于人工智能的语音响应处理方法、装置、设备及介质

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2008233305A (ja) * 2007-03-19 2008-10-02 Toyota Central R&D Labs Inc 音声対話装置、音声対話方法及びプログラム
CN110637339A (zh) * 2017-05-15 2019-12-31 苹果公司 使用隐式反馈优化数字助理的对话策略决策
CN108053826A (zh) * 2017-12-04 2018-05-18 泰康保险集团股份有限公司 用于人机交互的方法、装置、电子设备及存储介质
CN108388926A (zh) * 2018-03-15 2018-08-10 百度在线网络技术(北京)有限公司 语音交互满意度的确定方法及设备
CN108536802A (zh) * 2018-03-30 2018-09-14 百度在线网络技术(北京)有限公司 基于儿童情绪的交互方法及装置
CN109036388A (zh) * 2018-07-25 2018-12-18 李智彤 一种基于对话设备的智能语音交互方法
CN109712618A (zh) * 2018-12-06 2019-05-03 珠海格力电器股份有限公司 一种语音服务的控制方法、装置、存储介质及空调
CN111488435A (zh) * 2019-01-28 2020-08-04 宝马股份公司 人工智能对话方法和装置、聊天机器人和存储介质
CN111881254A (zh) * 2020-06-10 2020-11-03 百度在线网络技术(北京)有限公司 话术生成方法、装置、电子设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115268324A (zh) * 2022-07-25 2022-11-01 青岛海尔科技有限公司 指令的修正方法及装置、存储介质及电子装置

Also Published As

Publication number Publication date
CN112463108A (zh) 2021-03-09
CN112463108B (zh) 2023-03-31

Similar Documents

Publication Publication Date Title
US11860913B2 (en) Streaming real-time dialog management
CN108536802B (zh) 基于儿童情绪的交互方法及装置
WO2020024582A1 (zh) 语音合成方法及相关设备
US11645547B2 (en) Human-machine interactive method and device based on artificial intelligence
CN109189980A (zh) 与用户进行语音交互的方法和电子设备
Krause et al. Edina: Building an open domain socialbot with self-dialogues
WO2020098756A1 (zh) 一种基于情感的语音交互方法、存储介质及终端设备
WO2008128423A1 (fr) Système de dialogue intelligent et son procédé de réalisation
JP2004527809A (ja) 個人のインタラクションをシミュレートする環境反応型ユーザインタフェース/エンタテインメントデバイス
CN106653016A (zh) 智能交互方法和装置
JP6860010B2 (ja) 情報処理システム、情報処理方法、および情報処理プログラム
CN105244042B (zh) 一种基于有限状态自动机的语音情感交互装置与方法
CN109599130A (zh) 收音方法、装置及存储介质
US20190371319A1 (en) Method for human-machine interaction, electronic device, and computer-readable storage medium
JP7274210B2 (ja) 対話システムおよびプログラム
WO2022126734A1 (zh) 语音交互处理方法、装置、电子设备及存储介质
Siegert “Alexa in the wild”–Collecting unconstrained conversations with a modern voice assistant in a public environment
CN112735423A (zh) 语音交互方法、装置、电子设备及存储介质
CN109887509A (zh) 一种基于声纹的点餐控制方法、电子设备及存储介质
CN110858234A (zh) 一种根据人物情感进行信息推送的方法及装置
CN114495981A (zh) 语音端点的判定方法、装置、设备、存储介质及产品
JP4413486B2 (ja) 家電制御装置、家電制御方法及びプログラム
Schuller Emotion modelling via speech content and prosody: in computer games and elsewhere
Palaniappan An Enhancement For Voice Assistant Skills That Uses Natural Language Processing (NLP) Technique–A Research Proposal
Namkung Research on Emotional Factors and Voice Trend by Country to be considered in Designing AI's Voice-An analysis of interview with experts in Finland and Norway

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20965747

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 14.11.2023)