CN117253478A - Voice interaction method and related device - Google Patents

Voice interaction method and related device Download PDF

Info

Publication number
CN117253478A
CN117253478A CN202311037799.7A CN202311037799A CN117253478A CN 117253478 A CN117253478 A CN 117253478A CN 202311037799 A CN202311037799 A CN 202311037799A CN 117253478 A CN117253478 A CN 117253478A
Authority
CN
China
Prior art keywords
text
voice
reply
voice data
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311037799.7A
Other languages
Chinese (zh)
Inventor
高微
俞焕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311037799.7A priority Critical patent/CN117253478A/en
Publication of CN117253478A publication Critical patent/CN117253478A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

After triggering a voice reply request based on acquired voice to be replied, dividing a target reply text according to the word sequence of reply words included in the target reply text to obtain n text fragments to be processed, and converting the i text fragments to be processed into i voice data fragments when the i text fragments to be processed are obtained by dividing. In the process of converting to obtain the voice data fragments, according to the arrangement sequence of n voice data fragments, generating audio files according to the obtained voice data fragments in turn, and playing the generated audio files in sequence. In the process of playing the generated audio files, in order to achieve the effect of sequential playing, the audio files corresponding to the j+1th voice data segment are played after the audio files corresponding to the j th voice data segment are played. Therefore, the reply speed of voice interaction can be improved, the waiting time of a user is shortened, and the user experience is improved.

Description

Voice interaction method and related device
Technical Field
The present disclosure relates to the field of computers, and in particular, to a voice interaction method and related apparatus.
Background
Human-computer interaction is always a hot topic in the large background of artificial intelligence, and voice interaction is an important branch of human-computer interaction, has wide application value and is also applied to various scenes, such as intelligent interaction equipment, intelligent customer service, telemarketing and the like.
During the voice interaction process, the voice input by the user is recognized and replied. The reply mode is to generate an audio file for reply according to the voice input by the user and play the audio file to the user, so as to realize voice interaction with the user.
However, in the current voice interaction manner, especially in the case that the reply text represented by the audio file is relatively long, the speed of replying to the user is relatively slow, the reply time delay is relatively long, the waiting time of the user is relatively long, and the user experience is poor.
Disclosure of Invention
In order to solve the technical problems, the application provides a voice interaction method and a related device, which can improve the reply speed of voice interaction, shorten the waiting time of a user and improve the user experience.
The embodiment of the application discloses the following technical scheme:
in one aspect, an embodiment of the present application provides a voice interaction method, where the method includes:
Acquiring voice to be replied;
triggering a voice reply request based on the voice to be replied, wherein the voice reply request comprises a voice text corresponding to the voice to be replied;
dividing a target reply text according to the text sequence of reply text included in the target reply text to obtain n pieces of text to be processed, and converting the ith piece of text to be processed into an ith voice data piece when the ith piece of text to be processed is obtained by dividing, wherein the target reply text is a text which is generated based on the voice text and is used for replying the voice to be replied, i is a positive integer which is more than 0 and less than or equal to n, and the arrangement sequence of the n pieces of voice data obtained by conversion is the same as that of the n pieces of text to be processed;
in the process of obtaining voice data fragments based on the text fragments to be processed, generating audio files according to the obtained voice data fragments in sequence according to the arrangement sequence of the n voice data fragments, and playing the generated audio files, in the process of playing the generated audio files, playing the audio files corresponding to the j+1th voice data fragment after the audio files corresponding to the j voice data fragment are played, wherein j is a positive integer greater than 0 and less than n.
In one aspect, an embodiment of the present application provides a voice interaction device, where the device includes an obtaining unit, a requesting unit, a converting unit, a generating unit, and a playing unit:
the acquisition unit is used for acquiring the voice to be replied;
the request unit is used for triggering a voice reply request based on the voice to be replied, and the voice reply request comprises a voice text corresponding to the voice to be replied;
the conversion unit is used for dividing the target reply text into n pieces of text to be processed according to the text sequence of the reply text included in the target reply text, converting the ith piece of text to be processed into an ith piece of voice data when the ith piece of text to be processed is obtained by dividing, wherein the target reply text is a text which is generated based on the voice text and is used for replying the voice to be replied, i is a positive integer which is more than 0 and less than or equal to n, and the arrangement sequence of the n pieces of voice data obtained by conversion is the same as that of the n pieces of text to be processed;
the generating unit is used for sequentially generating audio files according to the obtained voice data fragments according to the arrangement sequence of the n voice data fragments in the process of obtaining the voice data fragments based on the text fragments to be processed;
The playing unit is used for playing the generated audio file, and playing the audio file corresponding to the j+1th voice data fragment after the audio file corresponding to the j voice data fragment is played in the process of playing the generated audio file, wherein j is a positive integer greater than 0 and less than n.
In one aspect, embodiments of the present application provide a computer device comprising a processor and a memory:
the memory is used for storing a computer program and transmitting the computer program to the processor;
the processor is configured to perform the method of any of the preceding aspects according to instructions in the computer program.
In one aspect, embodiments of the present application provide a computer-readable storage medium for storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of the preceding aspects.
In one aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the preceding aspects.
According to the technical scheme, after the voice response request is triggered based on the acquired voice to be responded, the target reply text is segmented according to the text sequence of the reply text included in the target reply text to obtain n text segments to be processed, and when the i text segments to be processed are obtained through segmentation, the i text segments to be processed are converted into the i voice data segments. The voice reply request comprises a voice text corresponding to the voice to be replied, the target reply text is a text which is generated based on the voice text and is used for replying the voice to be replied, i is a positive integer which is more than 0 and less than or equal to n, and the arrangement sequence of n voice data fragments obtained through conversion is the same as the arrangement sequence of n text fragments to be processed. The method comprises the steps of segmenting a target reply text and sequentially converting the target reply text into voice data fragments, wherein compared with the step of converting a complete target reply text into voice data, the time consumption for obtaining the voice data fragments is obviously reduced, so that after the voice data fragments are obtained in the process of obtaining the voice data fragments based on the text fragments to be processed, an audio file can be sequentially generated according to the obtained voice data fragments according to the arrangement sequence of the n voice data fragments, and the generated audio file is sequentially played, so that the reply of voice interaction equipment can be obtained without waiting for generating voice data corresponding to the complete target reply text, and the reply speed of voice interaction is improved. In the process of playing the generated audio files, in order to achieve the effect of sequential playing, the audio files corresponding to the j+1th voice data segment are played after the audio files corresponding to the j th voice data segment are played, wherein j is a positive integer greater than 0 and less than n. Therefore, in the voice interaction process, the reply speed of voice interaction can be improved, the waiting time of a user is shortened, and the user experience is improved through the technical scheme provided by the application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.
Fig. 1 is an application scenario architecture diagram of a voice interaction method provided in an embodiment of the present application;
fig. 2 is a flowchart of a voice interaction method provided in an embodiment of the present application;
fig. 3 is a diagram illustrating a hardware architecture of a voice interaction method according to an embodiment of the present application;
FIG. 4 is an exemplary diagram of a first audio data queue provided in an embodiment of the present application;
FIG. 5 is an exemplary diagram of a second audio data queue provided in an embodiment of the present application;
fig. 6 is a process example diagram for obtaining a first audio data queue based on a second audio data queue according to an embodiment of the present application;
FIG. 7 is an exemplary diagram of a generated dialog model streaming return target reply text provided by an embodiment of the present application;
FIG. 8 is an exemplary diagram of a streaming return target reply text for another generated dialog model provided by embodiments of the present application;
FIG. 9 is a diagram of an exemplary magic mirror according to an embodiment of the present application;
FIG. 10 is a flowchart of another voice interaction method according to an embodiment of the present application;
FIG. 11 is a block diagram of a voice interaction device according to an embodiment of the present application;
fig. 12 is a block diagram of a terminal according to an embodiment of the present application;
fig. 13 is a block diagram of a server according to an embodiment of the present application.
Detailed Description
Embodiments of the present application are described below with reference to the accompanying drawings.
During the voice interaction process, the voice input by the user is recognized and replied. The reply mode is to generate an audio file for reply according to the voice input by the user and play the audio file to the user, so as to realize voice interaction with the user.
However, in the current voice interaction mode, the user needs to wait for the complete audio file to be generated and then start to play, and when the reply text represented by the audio file is relatively long, the time required for waiting until the audio file is generated and the total time required for obtaining the play time are relatively long, so that the speed of replying to the user is relatively slow, the waiting time of the user is relatively long, and the user experience is relatively poor.
In order to solve the above technical problems, the embodiments of the present application provide a voice interaction method, in which a segmentation processing manner is adopted, that is, a longer target reply text is segmented, a plurality of text segments to be processed are sequentially segmented, in the segmentation process, one text segment to be processed is obtained, and the obtained text segment to be processed is converted into a voice data segment, so that according to the arrangement sequence of n voice data segments, an audio file is sequentially generated according to the obtained voice data segments, and the generated audio file is sequentially played. Therefore, the audio file is not required to be played after the audio file corresponding to the complete target reply text is generated.
It should be noted that, the voice interaction method provided in the embodiment of the present application may be applied to a voice interaction scenario, for example, a scenario of intelligent interaction equipment, intelligent customer service, telemarketing, etc., which is not limited in this embodiment of the present application. The voice interaction scene may be a voice interaction scene implemented based on various languages, for example, chinese, english, korean, etc., which is not limited in the embodiment of the present application.
The voice interaction method provided by the embodiment of the application can be executed by computer equipment, the computer equipment can be voice interaction equipment under various voice interaction scenes, and the voice interaction equipment can be a terminal, wherein the terminal comprises, but is not limited to, a smart phone, a tablet computer, a notebook computer, an intelligent household appliance, an intelligent mirror, a vehicle-mounted terminal, a conversation robot and the like.
As shown in fig. 1, fig. 1 shows an application scenario architecture diagram of a voice interaction method, where the application scenario is described by taking a voice interaction device as an example of a smart phone.
The application scenario may include the smart phone 100 and the server 200, where the smart phone 100 has an intelligent dialogue function, specifically, the intelligent dialogue function may be implemented through voice recognition, intelligent reply, etc., and the user may perform voice interaction with the smart phone 100. The server 200 serves the smart dialogue functions of the smartphone 100. The smart phone 100 and the server 200 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
When the user needs to perform voice interaction with the smart phone 100, the user may perform voice input through the smart phone 100, the smart phone 100 may obtain voice input by the user, and the voice obtained by the smart phone 100 may be referred to as voice to be replied. The voice to be replied to may be a voice input by the user, which needs to be replied to.
The smart phone 100 may trigger a voice reply request based on the voice to be replied, where the voice reply request is used to request a reply text of the voice to be replied, so as to reply to the voice to be replied of the user based on the reply text. Typically, a voice reply request from the smartphone 100 is sent to the server 200. The voice reply request includes a voice text corresponding to the voice to be replied, and the voice text may be a text capable of reflecting the dialogue intention of the voice to be replied, so the server 200 may generate a target reply text based on the voice text and return the target reply text to the smartphone 100.
The smart phone 100 may segment the target reply text according to the text sequence of the reply text included in the target reply text to obtain n pieces of text to be processed, and when the n pieces of text to be processed are obtained by segmentation, convert the i pieces of text to be processed into the i pieces of voice data. Wherein i is a positive integer greater than 0 and less than or equal to n, and the arrangement sequence of the n voice data fragments obtained by conversion is the same as the arrangement sequence of the n text fragments to be processed. Taking n as 3 as an example, n pieces of text to be processed obtained by dividing the target reply text are respectively a text piece 1, a text piece 2 and a text piece 3, and each time a piece of text to be processed is obtained, the piece of text to be processed can be converted into a voice data piece, for example, the text piece 1 is converted into the voice data piece 1, the text piece 2 is converted into the voice data piece 2, the text piece 3 is converted into the voice data piece 3, the arrangement sequence of the n pieces of text to be processed is sequentially a text piece 1, a text piece 2 and a text piece 3 from front to back, and the arrangement sequence of the voice data piece is sequentially a voice data piece 1, a voice data piece 2 and a voice data piece 3 from front to back.
The target reply text is segmented and sequentially converted into each voice data segment, and compared with the process of converting the complete target reply text into voice data, the time consumption for obtaining the voice data segments is obviously reduced, so that after the voice data segments are obtained in the process of obtaining the voice data segments based on the text segments to be processed, the intelligent mobile phone 100 can sequentially generate audio files according to the obtained voice data segments according to the arrangement sequence of the n voice data segments and sequentially play the generated audio files, and reply of voice interaction equipment can be obtained without waiting for generating voice data corresponding to the complete target reply text, and the reply speed of voice interaction is improved.
In the process of playing the generated audio files, in order to achieve the effect of sequential playing, the audio files corresponding to the j+1th voice data segment are played after the audio files corresponding to the j th voice data segment are played, wherein j is a positive integer greater than 0 and less than n. Therefore, in the voice interaction process, the reply speed of voice interaction can be improved, the waiting time of a user is shortened, and the user experience is improved through the technical scheme provided by the application.
For example, according to the three voice data segments and the arrangement sequence of the three voice data segments, after the conversion of the voice data segment 1 is completed, the audio file 1 can be generated based on the voice data segment 1 and played. After the conversion of the voice data segment 2 is completed, the audio file 2 can be continuously generated based on the voice data segment 2, and after the audio file 2 of the voice data segment 1 is played, the audio file 2 of the voice data segment 2 is continuously played. And so on until the audio file playing of all the voice data fragments is completed.
In the corresponding embodiment of fig. 1, the description is given by taking the example that the server determines the target reply text. In some possible implementations, the target reply text may also be determined by the voice interaction device itself, that is, the method provided by the embodiment of the application may be separately executed by the terminal serving as the voice interaction device.
It should be noted that, in the specific embodiment of the present application, relevant data such as user information may be involved in the whole process, and when the above embodiments of the present application are applied to specific products or technologies, individual consent or individual permission of the user needs to be obtained, and the collection, use and processing of relevant data need to comply with relevant laws and regulations and standards of relevant countries and regions.
It should be noted that, the method provided in the embodiment of the present application may relate to an artificial intelligence technology, and automatically interact with a user by voice based on the artificial intelligence technology to reply to the voice of the user. Artificial intelligence (ArtificialIntelligence, AI) is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain optimal results. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.
It may be appreciated that the voice interaction method provided by the embodiments of the present application may relate to a voice processing technology. Key technologies of the voice technology (Speech Technology) are an automatic voice recognition technology and a voice synthesis technology, and a voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.
In the process of replying, natural language processing technology may also be involved, and natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like. The target reply text may be segmented, for example, by text processing or the like.
Additionally, in the process of replying, the target reply text may be generated based on a neural network model. Machine learning may also be involved in training neural network models. Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. In embodiments of the present application, machine learning may be used to train neural network models.
Next, the voice interaction method provided in the embodiments of the present application will be described with reference to the accompanying drawings, taking an example that the computer device is a voice interaction device. Referring to fig. 2, fig. 2 shows a flowchart of a voice interaction method, the method comprising:
s201, obtaining the voice to be replied.
The voice interaction device may be a device that interacts with a user based on voice, and the voice interaction device may implement an intelligent dialogue function, which may be, for example, a question-answering function (answer various questions posed by the user, including common questions, intellectual questions, entertainment questions, etc.), a boring function (may perform various types of boring including weather, news, sports, culture, entertainment, etc.), an intelligent recommendation function (may recommend related products, services, articles, etc. according to interests and demands of the user), an emotion analysis function (may analyze emotion states of the user, perform corresponding answers and interactions), and the like.
When the user needs to perform voice interaction with the voice interaction device, the user can perform voice input through the voice interaction device, for example, the voice to be replied can be input through a microphone of the voice interaction device. The voice interaction device can acquire voice to be replied, which is input by a user and needs to be replied, from the user. The voice interaction device may acquire the voice to be replied by detecting whether the user performs voice input by the voice interaction device, and if the voice input by the user is detected, ending the detection and acquiring the voice to be replied.
It can be appreciated that the user does not always interact with the voice interaction device, so as to avoid the voice interaction device from always detecting, thereby wasting processing resources and excessively consuming electric power. Typically, the voice interaction device will be in a dormant state. When the user needs to perform voice interaction with the voice interaction device, the voice interaction device can be awakened by the awakening word, so that the voice interaction device performs a detection mode. Taking the voice interaction device as an example, the intelligent mirror can be called a magic mirror, the wake-up words can be, for example, "hell magic mirror", "magic mirror", and when the magic mirror recognizes the wake-up word, the detection mode can be entered.
In one possible implementation, a prompt feedback may be given after a wake word hit, such as: "I am", "what can help you". After entering the detection mode, the User Interface (UI) of the voice interaction device can display the response change, the animation which is being recorded appears, after a period of time, if the User input to be replied voice is not detected, the detection is finished, and the UI of the voice interaction device is changed into a default mode. If it is detected that the user inputs a voice to be replied to, the subsequent steps are continued.
It should be noted that, based on the foregoing description, in one possible implementation manner, the hardware architecture used in the method provided in the embodiment of the present application may be shown in fig. 3, and includes a voice input detection module 301, a wake word recognition module 302, a processor 303, a screen peripheral 304, and a power supply module 305. The voice input detection module 301 is configured to detect whether a user performs voice input; the wake word recognition module 302 is configured to recognize a wake word, so as to determine whether to wake the voice interaction device; the processor 303 is configured to perform the steps of S201-S204, thereby implementing speech recognition, intelligent reply, etc.; the screen peripheral 304 may provide a UI for the user to make various visual displays; the power supply module 305 is configured to supply power to the voice input detection module 301, the wake-up word recognition module 302, the processor 303, and the screen peripheral 304, so as to ensure normal operation of the voice input detection module 301, the wake-up word recognition module 302, the processor 303, and the screen peripheral 304.
S202, triggering a voice reply request based on the voice to be replied.
After the voice interaction device acquires the voice to be replied, a voice reply request can be triggered based on the voice to be replied, wherein the voice reply request is used for requesting a reply text of the voice to be replied, so that the voice to be replied of the user is replied based on the reply text.
In order to determine how to reply to the voice to be replied, the voice reply request generally includes a voice text corresponding to the voice to be replied, and the voice text may be a text capable of reflecting a dialogue intention of the voice to be replied, so that a reply text (i.e. a target reply text) for replying to the voice to be replied can be obtained.
The voice text corresponding to the voice to be replied can comprise the text of the voice to be replied, namely the text obtained by converting the voice to be replied through a voice-to-text technology. In another possible implementation manner, the voice to be replied input by the user may be a question which is continuously performed according to the existing context dialogue, in this case, the voice text corresponding to the voice to be replied may include, in addition to the text of the voice to be replied itself, the text corresponding to the context dialogue of the voice to be replied, so that the target reply text is determined in combination with richer information, so that the voice to be replied is replied more accurately.
It should be noted that, in the embodiment of the present application, the service may be provided for the intelligent dialogue function through the voice interaction device itself, or the service may be provided for the intelligent dialogue function of the voice interaction device through the server. When the voice interaction device provides service for the intelligent dialogue function, the triggered voice reply request can be processed by the voice interaction device, at this time, the voice interaction method provided by the embodiment of the application can be independently executed by the voice interaction device, and the hardware architecture corresponding to fig. 3 can be used as hardware of the intelligent voice interaction device; when the server provides services for the intelligent dialogue function of the voice interaction device, the triggered voice reply request can be sent to the server, and the server processes the voice reply request, at this time, the voice interaction method provided by the embodiment of the application can be executed by the terminal serving as the voice interaction device and the server in a matched manner, and the hardware architecture corresponding to fig. 3 can be used as an overall hardware architecture for realizing voice interaction.
It should be noted that, because the voice interaction device needs a certain time to reply to the voice to be replied, after triggering the voice reply request, the screen peripheral of the voice interaction device can display the UI in waiting, thereby ensuring the interaction effect with the user.
S203, according to the word sequence of the reply words included in the target reply text, dividing the target reply text into n pieces of text to be processed, and when the n pieces of text to be processed are obtained, converting the i pieces of text to be processed into the i pieces of voice data.
The target reply text can be obtained based on the voice text in the voice reply request, and the target reply text is generated based on the voice text and is used for replying to the voice to be replied. The target reply text may be, for example, "i am magic mirror, which is a piece of answer. Is there a need for me assistance? ".
It can be understood that the target reply text may be predicted based on a voice text through a neural network model, and the network structure of the neural network model is not limited in the embodiment of the present application. In one possible implementation, the neural network model may be a generative dialog model, i.e., the target reply text is generated by the generative dialog model based on the phonetic text. The generative dialog model may be a large-scale language model that successfully implements the ability to converse with humans by way of deep learning.
The target reply text is generated through the generated dialogue model, so that a more intelligent chat function can be realized, a wider range of questions can be asked, and the answer logic is more reasonable, so that a user can feel as if the user is talking with a real person.
Each text included in the target reply text can be used as a reply text, and the reply text is sequenced according to the required expressed semantics to obtain the target reply text. The voice interaction device can divide the target reply text into n pieces of text to be processed according to the text sequence of the reply text included in the target reply text, and after each division is performed to obtain one piece of text to be processed, the divided piece of text to be processed can be converted into a voice data piece, so that segmented text-to-voice streaming data processing is realized. For example, when the ith text segment to be processed is divided, the ith text segment to be processed may be converted into the ith voice data segment. Wherein i is a positive integer greater than 0 and less than or equal to n, and the arrangement sequence of the n voice data fragments obtained by conversion is the same as the arrangement sequence of the n text fragments to be processed.
In this embodiment of the present application, a text synthesis voice technology is used for converting the ith text segment to be processed into the ith voice data segment, so in order to realize conversion of the ith text segment to be processed into the ith voice data segment, the voice interaction device may call an application program interface (Application Programming Interface, API) of the text synthesis voice, and transfer the ith text segment to the API of the text synthesis voice as a parameter, so as to convert the ith text segment to be processed into the ith voice data segment.
It should be noted that, in general, the target reply text may be divided into segments according to a certain standard, and in this embodiment of the present application, the standard according to which may include multiple types, and the manner of dividing the target reply text into segments according to different standards according to which may be different.
In some cases, the user may often prefer to hear a smoother reply, which requires that the voice data clip played by the voice interaction device each time not be too short. In this case, the criterion on which this is based may be a segment length threshold. Based on this, in one possible implementation manner, according to the word order of the reply words included in the target reply text, the manner of dividing the target reply text into n pieces of text to be processed by the segment division may be to divide the target reply text into n pieces of text to be processed based on the segment length threshold according to the word order, so that the segment lengths of the n pieces of text to be processed obtained by the division are not less than the segment length threshold. The segment length threshold may be preset, and according to different actual situations, the segment length threshold may be different. The segment length threshold may be the number of words of the reply text, for example, the segment length threshold may be 6 reply text, and then the segment lengths of the divided text segments to be processed are not less than 6 reply text.
And dividing the target reply text by segments based on the segment length threshold value, so that a text segment to be processed with the segment length meeting the requirement can be obtained, and a user can hear a smoother reply when the voice data segment is played later, thereby improving the user experience. Meanwhile, excessive text fragments to be processed, which are obtained by dividing due to the fact that the fragment length is too small, are avoided, the processing times of the text fragments to be processed are reduced, and consumption of processing resources is reduced.
In some cases, normal human speech is usually a sentence by sentence, and there may be a slight pause between two sentences. In order to make the speech replied by the speech interaction device more consistent with the speaking way of a normal person, so that the user sounds more natural, in one possible implementation, the criterion on which this is based may be a semantic separator. Based on this, according to the word ordering of the reply words included in the target reply text, the manner of dividing the target reply text into n pieces of text to be processed may be to divide the target reply text into n pieces of text to be processed based on semantic separators according to the word ordering. The semantic separator may be a symbol that divides different sentences based on semantics, and in general, the semantic separator may be punctuation marks, such as commas, periods, question marks, exclamation marks, and the like.
For example, the target reply text is "I am magic mirror, which is a piece of answer. Is there a need for me assistance? And dividing the target reply text according to the semantic separator, wherein the obtained text fragments to be processed are 'I are magic mirrors', 'I are a section of answers', 'I need to help' respectively.
The target reply text is divided based on the semantic separator, and the text fragments to be processed obtained through division are more in accordance with the speaking habit of normal people, so that the reply of the voice interaction equipment is more in accordance with the speaking habit of people, and users can sound more naturally, and user experience is improved.
In other possible implementations, the criteria underlying may be semantic separators and segment length thresholds. Based on the above, according to the text sequence of the reply text included in the target reply text, the method of dividing the target reply text into n text segments to be processed may be to divide the target reply text into n text segments to be processed according to the text sequence based on the semantic separator and the segment length threshold, thereby obtaining the text segments to be processed which satisfy the segment length threshold and have a pause conforming to the normal semantic pause.
For example, the target reply text is "I am magic mirror, which is a piece of answer. Is there a need for me assistance? If the segment length threshold is 6 reply words, the target reply text is segmented according to the semantic separator and the segment length threshold, and the obtained text segments to be processed are 'i are magic mirrors', which are a section of answers 'and' i need what i need help.
The target reply text is divided based on the semantic separator and the segment length threshold value, so that a to-be-processed text segment meeting the segment length threshold value and stopping according with normal semantic stopping is obtained, under the condition that the reply is close to the speaking habit of a person, the situation that the segment length of the to-be-processed text segment is too small to cause too many to-be-processed text segments obtained through division is avoided, the processing times of the to-be-processed text segment are reduced, and the consumption of processing resources is reduced.
In one possible implementation manner, according to the word ordering, the manner of dividing the target reply text into n pieces of text to be processed based on the semantic separator and the segment length threshold may be to traverse the reply text included in the target reply text according to the word ordering, and when traversing to the semantic separator, divide the target reply text into pieces of candidate text based on the semantic separator. If the segment length of the candidate text segment is greater than or equal to the segment length threshold, determining the candidate text segment as the text segment to be processed, and continuing traversing until n text segments to be processed are obtained. If the segment length of the candidate text segment is smaller than the segment length threshold, continuing to traverse the reply text, if traversing to the next semantic separator, judging whether the segment length of the candidate text segment obtained based on the semantic separator is larger than or equal to the segment length threshold, if the segment length of the newly obtained candidate text segment is larger than or equal to the segment length threshold, determining the newly obtained candidate text segment as a text segment to be processed, and the like, and continuing traversing until n text segments to be processed are obtained.
For example, the target reply text is "I am magic mirror, which is a piece of answer. Is there a need for me assistance? If the segment length threshold is 6 reply words, traversing reply words included in the target reply text according to word sequencing, when traversing to comma behind the ' me is a magic mirror ', taking the ' me is a magic mirror ' as a candidate text segment, wherein the segment length of the ' me is the magic mirror ' is 4 reply words and smaller than the segment length threshold, so that the traversing is continued, when traversing to the period behind the ' the answer is a segment ', taking the ' me is a magic mirror, which is a segment of answer ' as a new candidate text segment, and the segment length of the new candidate text segment is 10 reply words and larger than the segment length threshold, so that the ' me is a magic mirror ' is intercepted from the target reply text, which is a segment of answer ' as a text segment to be processed. Then, continuing to traverse, and taking the "what is needed to be helped by me" as a candidate text segment when traversing to the question mark behind the "what is needed to be helped by me", wherein the segment length of the candidate text segment is 10 and is greater than the segment length threshold, so that the "what is needed to be helped by me" is intercepted as a text segment to be processed.
In one possible implementation, to facilitate management of the streaming generated speech data segments, including storage and interception, the streaming acquired speech data segments may be stored in a queue, which may be referred to as a first audio data queue. The first audio data queue is used for storing voice data fragments which complete conversion. A queue is a commonly used data storage structure, and data processing is generally performed according to a first-in first-out principle, and in general, an insertion operation is performed from the tail of the queue, and a retrieval operation is performed from the head of the queue. Based on this, after the conversion of the ith speech data segment is completed, the ith speech data segment may be pushed into the end of the queue of the first audio data queue. That is, after each conversion of one voice data segment is completed, the voice data segments can be pushed into the first audio data queue from the tail sequentially according to the arrangement sequence of n voice data segments. The arrangement sequence of the n voice data fragments is the same as the arrangement sequence of the n text fragments to be processed.
In one possible implementation, not only the speech data segments that complete the conversion may be stored in a queue, but also the speech data segments that are in the process of conversion may be stored in a queue, which may be referred to as a second audio data queue, which is used to store the speech data segments that are in the process of conversion. Based on this, in the process of converting the ith text segment to be processed into the ith speech data segment, the ith speech data segment in the conversion process is pushed into the tail of the second audio data queue. Recording of the completion status is required while storing the voice data segments, wherein the completion status may be embodied by status identifiers, and thus the stored ith voice data segment has a corresponding status identifier.
Accordingly, the manner of determining whether the conversion of the ith voice data fragment is completed may be to determine whether the conversion of the ith voice data fragment is completed according to the state identification. If the status flag is a completion flag, it may be determined that the conversion of the ith voice data fragment is completed. If the status flag is an incomplete flag, it may be determined that the conversion of the ith voice data fragment has not been completed. After determining that the conversion of the ith voice data segment is completed, the ith voice data segment may be taken out from the head of the second audio data queue, and inserted into the first audio data queue from the tail of the first audio data queue. These stored speech data fragments may be buffer (buffer) data.
Wherein the status identifier is used to identify a finish (finish) status of the piece of speech data, the status identifier may be represented by a number, symbol, word, etc. In one possible implementation, the status identifier may be represented by a word, such as a completion identifier represented as true, and an incomplete identifier represented as false.
The voice data fragments are stored through the queue, so that the voice data fragments can be managed conveniently. Meanwhile, the completion state of the voice data segment is recorded through the state identification, so that whether the voice data segment is converted or not can be checked conveniently, the voice data segment can be played more accurately, and continuous playing of the voice data segment is ensured.
The above process may be referred to as segmented text-to-speech streaming data processing, where the returned speech data segments are both streaming returns, requiring recording of the completion status while storing the speech data segments. The first audio data queue may be seen in fig. 4, including a section of speech data "i am magic mirror" and a section of speech data "this is an answer". The second audio data queue may be shown with reference to fig. 5, and includes a voice data section of "i am a magic mirror" and a voice data section of "this is an answer" and each voice data section has a corresponding status identification, where the status identification of the voice data section of "i am a magic mirror" is true and the status identification of the voice data section of "this is an answer" is false. In this case, the voice data segment "i am magic mirror" may be taken out from the head of the second audio data queue, and the process of obtaining the first audio data queue based on the second audio data queue may be described with reference to fig. 6. Accordingly, if the status flag of the voice data segment "this is an answer" becomes true, the voice data segment "this is an answer" may also be fetched from the head of the second audio data queue. In this process, if a new speech data segment starts to be converted, for example, "what is needed for me help" is being converted into a speech data segment, a speech data segment "what is needed for me help" may be inserted at the end of the queue of the second audio data, whose state is identified as false, and so on.
The S203 provided by the embodiment of the present application may implement processing of converting segmented text into speech streaming data, that is, each time a text segment to be processed is obtained by dividing, the text segment to be processed may be converted into speech segment data, so that the text segment to be processed may be played directly according to the obtained speech data segment, without waiting for the complete speech data corresponding to the target reply text, thereby greatly reducing the waiting time of the user. On this basis, since the target reply text also has a certain length, it may take a certain time for the voice interaction device to acquire the complete target reply text. In this case, in order to further reduce the waiting time of the user, before the target reply text is divided into n pieces of text to be processed according to the text sequence of the reply text included in the target reply text, a plurality of reply text pieces may be sequentially obtained according to the segment sequences of a plurality of reply text pieces included in the target reply text, where each reply text piece includes at least one reply text, that is, the reply streaming type return segmentation processing is implemented. The segment length of the reply text segment may be set according to actual requirements, for example, may be 1 reply text, that is, the reply stream type return segmentation process returns one reply text at a time.
For example, the target reply text is "I am magic mirror, which is a piece of answer. Is there a need for me assistance? ", one reply word in the target reply text is returned at a time.
Based on this, the manner of dividing the target reply text into n pieces of text to be processed according to the text sequence of the reply text included in the target reply text may be to divide the target reply text into n pieces of text to be processed according to the text sequence in the process of sequentially obtaining a plurality of pieces of reply text, and dividing the target reply text into n pieces of text to be processed based on the obtained reply text until the division of the target reply text is completed.
According to the method and the device for the target reply text, the target reply text is returned through the streaming mode, so that the reply speed is further increased, the waiting time of a user is reduced, and the user experience is improved.
In one possible implementation, to facilitate managing the streaming captured reply text, including storing and intercepting, the streaming captured reply text segments may be stored in a queue, which may be referred to as a text queue. Based on this, according to the segment ranks of the plurality of reply text segments included in the target reply text, the manner of sequentially obtaining the plurality of reply text segments may be to push the obtained reply text segments to the tail of the text queue when obtaining the reply text segments according to the segment ranks, where the arrangement order of the obtained reply text in the text queue is the same as the text ranks. In the process of sequentially obtaining the plurality of reply text fragments, according to word sequencing, performing fragment division based on the obtained reply text until the fragment division of the target reply text is completed, wherein the mode of obtaining n pieces of text to be processed can be that in the process of sequentially obtaining the plurality of reply text fragments, the obtained reply text is subjected to fragment division from the head of a word queue until the fragment division of the target reply text is completed, and n pieces of text to be processed are obtained.
Taking the example of accessing the voice interaction device into the generated dialogue model, the generated dialogue model generates the target reply text, and the generated dialogue model can return the target reply text in a streaming manner, which is shown in fig. 7. For example, the target reply text is "I am magic mirror, which is a piece of answer. Is there a need for me assistance? … … ", according to the above-mentioned word ordering of the reply words, one reply word in the target reply text is returned each time, and thus the reply word returned each time is pushed into the end of the word queue, which can be seen from the right side in fig. 7. While pushing the tail part of the text queue, the obtained reply text is divided into segments from the head of the text queue, for example, the reply text is traversed from the head of the text queue, and the sentence is judged to be a sentence after punctuation. For example, after traversing to comma, get the first sentence "i am magic mirror", thus push out the first sentence in the word queue from the head of the queue, call the API of the word synthesis speech after pushing out, pass the first sentence as parameter to the API of the word synthesis speech. After the process is finished, the text queue is changed into the state shown in fig. 8, i.e. the word "i am magic mirror" is deleted from the text queue, and other reply text "need to please" is pushed in from the tail of the text queue. The traversal then continues until the segmentation of the target reply text is completed.
According to the method and the device for storing the reply characters, the reply characters are stored in the form of the character queue, management of the reply characters can be facilitated, the obtained reply characters are divided into segments according to the character sequence, and accuracy of division is guaranteed.
S204, in the process of obtaining the voice data fragments based on the text fragments to be processed, generating audio files according to the obtained voice data fragments in sequence according to the arrangement sequence of the n voice data fragments, and playing the generated audio files, and in the process of playing the generated audio files, playing the audio files corresponding to the j+1th voice data fragment after the audio files corresponding to the j voice data fragments are played, wherein j is a positive integer greater than 0 and less than n.
In the process of generating the voice data fragments, the voice interaction device can generate an audio file according to the obtained voice data fragments according to the arrangement sequence of n voice data fragments, and play the generated audio file every time one voice data fragment conforming to the arrangement sequence is generated. In the process of playing the generated audio file, playing the audio file corresponding to the j+1th voice data fragment after the audio file corresponding to the j-th voice data fragment is played, wherein j is a positive integer greater than 0 and less than n. That is, in the process of playing audio files, in order to achieve the effect of sequential playing, it is necessary to wait for the playing of the previous audio file to be completed and then play the next audio file.
After the audio file is needed to be generated for the generated voice data fragment, the voice interaction device can play the voice data fragment through the sound card. Because the lengths of the voice data segments are different, the time for synthesizing the voice data segments may be different, in order to ensure the playing continuity of the voice data segments, it is necessary to continuously detect the voice data segments at the head of the second audio data queue, only the voice data segments with continuous completion status being completed (i.e. status identifier is true) can be pushed into the first audio data queue, and the audio file starts to be generated. In this case, in the process of obtaining the voice data segments based on the text segments to be processed, according to the arrangement sequence of the n voice data segments, the audio file may be sequentially generated according to the obtained voice data segments by obtaining the voice data segments from the top of the first audio data queue in the process of obtaining the voice data segments based on the text segments to be processed, and generating the corresponding audio file according to the obtained voice data segments.
In order to achieve the effect of sequential playing, after the audio file of the previous voice data segment is played, the next voice data segment of the first audio data queue needs to be played sequentially, and at the moment, an audio file needs to be generated first and then played. For the audio file to be played, the audio file needs to be dequeued from the head of the first audio data queue, so that the order of next acquired voice data fragments is correct.
The process can comprise the steps of processing the voice data to generate the audio file and reading the audio file according to the sequence of the reply text, so that the reply of the voice interaction equipment can be obtained without waiting for generating the voice data corresponding to the complete target reply text, and the reply speed of voice interaction is improved.
In one possible implementation manner, in order to facilitate the user to view and understand the reply of the voice interaction device to the reply voice, in the process of playing the generated audio file, the voice interaction device may display the corresponding text segment to be processed according to the voice data segment corresponding to the played audio file. Based on the hardware architecture introduced in fig. 3, the text segment to be processed can be displayed on the screen peripheral, thereby presenting the target reply text stream on the screen peripheral.
In one possible implementation manner, the voice interaction device may also have an environmental information display function, where the environmental information may include important information such as temperature, humidity, time, weather, and the like. Based on the above, the voice interaction device can also acquire the environmental information of the space where the voice interaction device is located, and further display the environmental information. Therefore, the user can know the surrounding environment condition at any time, and the user agent is more convenient and comfortable to experience.
According to the technical scheme, after the voice response request is triggered based on the acquired voice to be responded, the target reply text is segmented according to the text sequence of the reply text included in the target reply text to obtain n text segments to be processed, and when the i text segments to be processed are obtained through segmentation, the i text segments to be processed are converted into the i voice data segments. The voice reply request comprises a voice text corresponding to the voice to be replied, the target reply text is a text which is generated based on the voice text and is used for replying the voice to be replied, i is a positive integer which is more than 0 and less than or equal to n, and the arrangement sequence of n voice data fragments obtained through conversion is the same as the arrangement sequence of n text fragments to be processed. The method comprises the steps of segmenting a target reply text and sequentially converting the target reply text into voice data fragments, wherein compared with the step of converting a complete target reply text into voice data, the time consumption for obtaining the voice data fragments is obviously reduced, so that after the voice data fragments are obtained in the process of obtaining the voice data fragments based on the text fragments to be processed, an audio file can be sequentially generated according to the obtained voice data fragments according to the arrangement sequence of the n voice data fragments, and the generated audio file is sequentially played, so that the reply of voice interaction equipment can be obtained without waiting for generating voice data corresponding to the complete target reply text, and the reply speed of voice interaction is improved. In the process of playing the generated audio files, in order to achieve the effect of sequential playing, the audio files corresponding to the j+1th voice data segment are played after the audio files corresponding to the j th voice data segment are played, wherein j is a positive integer greater than 0 and less than n. Therefore, in the voice interaction process, the reply speed of voice interaction can be improved, the waiting time of a user is shortened, and the user experience is improved through the technical scheme provided by the application.
Next, the voice interaction method provided in the embodiment of the present application will be described with reference to an actual application scenario. In the application scene, the voice interaction device can be an intelligent mirror, and the intelligent mirror can be applied to various occasions such as families, hotels and the like. The smart mirror of the electronic contest may be hereinafter referred to as a magic mirror, and the magic mirror may have an intelligent dialogue function and an environmental information display function, and in addition, since the magic mirror is a wall mirror of the electronic contest, the electronic contest is usually an electronic contest player participating in the electronic contest, and during the electronic contest, the electronic contest player may wish to know the contest schedule at any time, so the magic mirror may also have a contest schedule display function. The magic mirror can be shown in fig. 9, wherein 901 can be a wall surface of an electronic hotel, and 902 can be a magic mirror.
The screen peripheral of the magic mirror can be divided into different areas according to functions, taking the magic mirror with an intelligent dialogue function, an environment information display function and a competition schedule display function as an example, 9021 can be a first display area corresponding to the intelligent dialogue function, and the first display area is used for displaying related contents generated in the process of realizing the intelligent dialogue function, such as displaying prompt feedback, displaying animation in recording, displaying text fragments to be processed and the like. The 9022 may be a second display area corresponding to the environmental information display function, where the second display area is used for displaying environmental information, and may include important information such as temperature, humidity, time, weather, and the like. 9023 may be a third display area corresponding to the game schedule display function, and the third display area is used for displaying a game schedule, and may include, for example, "team a VS B team, team 5 month 15 day", "team C VSD team, team 5 month 16 day", and so on.
The magic mirror is connected with the generated dialogue model, so that interactivity with users who check in an electronic contest hotel is enhanced. The combination of multiple functions makes the magic mirror a practical voice interaction device, which can bring more convenient and comfortable experience for users.
Based on the above-described voice interaction device, the voice interaction method provided in the embodiment of the present application may be shown in fig. 10, where the method includes:
s1001, the magic mirror acquires a wake-up word input by a user, and wakes up the magic mirror based on the wake-up word.
The wake-up word may be, for example, "hell magic mirror", "magic mirror", where the user inputs the wake-up word, so that the magic mirror is woken up by the wake-up word, and is brought into a detection mode. After the wake-up word is hit, the magic mirror can also give prompt feedback, such as' what i am in, can help you.
S1002, the magic mirror displays a user interface waiting for a user to input a voice to be replied.
After entering the detection mode, the UI of the magic mirror displays a response change, an animation which is being recorded appears, and after a period of time, if the magic mirror does not detect that the user inputs the voice to be replied, the detection is finished, and the UI is changed into a default mode. If the magic mirror detects that the user inputs the voice to be replied, the following steps are continuously executed.
S1003, the magic mirror acquires the voice to be replied input by the user.
S1004, calling an interface of the generated dialogue model by the magic mirror.
After the magic mirror detects the voice to be replied input by the user, the voice to be replied can be converted into text, and the service of the generated dialogue model is requested by calling the interface of the generated dialogue model so as to generate the target replied text based on the generated dialogue model.
S1005, the magic mirror waits for the target reply text returned in a streaming mode, and plays the voice data corresponding to the target reply text in a streaming mode.
It should be noted that, the voice interaction method provided in the embodiment of the present application mainly includes four parts of a return stream type return segment process, a segment text to voice stream type data process, a voice data generation audio file process, and an audio file reading process according to a return text sequence. After the target reply text is obtained, the target reply text can be returned in a streaming mode, namely, reply text fragments are sequentially acquired (a reply streaming type return segmentation process). In the process of sequentially acquiring the reply text fragments, the target reply text can be subjected to fragment division according to the acquired reply text, and the text fragments to be processed are sequentially acquired. Each time a text segment to be processed is obtained, the text segment to be processed can be converted into a corresponding voice data segment (segmented text-to-voice streaming data processing, namely splitting voice data corresponding to a target reply text into a plurality of voice data segments). In generating the pieces of speech data, the pieces of speech data conforming to the order may be generated into an audio file according to the arrangement order of the pieces of speech data (speech data generation audio file processing), and the audio file may be sequentially played (audio file reading processing in the order of the reply text).
S1006, judging whether the voice to be replied is a question which is continued according to the context dialogue. If yes, S1007 is executed, and if no, S1008 is executed.
S1007, the text of the context dialog is delivered.
If there is a contextual dialog before the speech is to be replied to, the text carrying the contextual dialog may simultaneously request a service that generates a dialog model.
S1008, the magic mirror enters a default display state.
And when the request is replied to the voice to be replied, displaying the waiting UI on the screen peripheral of the magic mirror. After the target reply text is returned, the target reply text is displayed in the screen peripheral in a streaming mode, and simultaneously, the voice data corresponding to the target reply text starts to be played in a streaming mode. After the reply of the voice to be replied is completed, the magic mirror enters a default display state, and the default display state can be, for example, waiting for the user to continuously input the voice to be replied, or can be a black screen or the like, which is not limited in the embodiment of the present application.
The embodiment of the application solves the problem of the recovery speed of the magic mirror with the intelligent dialogue function. The magic mirror stream returns the target reply text, sequentially processes the text fragments to be processed to generate voice data fragments, respectively generates audio files after the completion of the processing and sequentially plays the audio files, thereby solving the problem of low reply speed in the voice interaction process.
It should be noted that, based on the implementation manner provided in the above aspects, further combinations may be further combined to provide further implementation manners.
Based on the voice interaction method provided in the corresponding embodiment of fig. 2, the embodiment of the present application further provides a voice interaction device 1100. Referring to fig. 11, the voice interaction apparatus 1100 includes an acquisition unit 1101, a request unit 1102, a conversion unit 1103, a generation unit 1104, and a playback unit 1105:
the obtaining unit 1101 is configured to obtain a voice to be replied;
the request unit 1102 is configured to trigger a voice reply request based on the voice to be replied, where the voice reply request includes a voice text corresponding to the voice to be replied;
the converting unit 1103 is configured to segment the target reply text according to a text sequence of reply text included in the target reply text to obtain n pieces of text to be processed, and when the n pieces of text to be processed are obtained by segmentation, convert the i pieces of text to be processed into i pieces of voice data, where the target reply text is a text generated based on the voice text and used for replying the voice to be replied, i is a positive integer greater than 0 and less than or equal to n, and an arrangement sequence of the n pieces of voice data obtained by conversion is the same as an arrangement sequence of the n pieces of text to be processed;
The generating unit 1104 is configured to generate, in a process of obtaining a speech data segment based on a text segment to be processed, an audio file according to the obtained speech data segments in sequence according to the arrangement sequence of the n speech data segments;
the playing unit 1105 is configured to play the generated audio file, and play the audio file corresponding to the j+1th voice data segment after the audio file corresponding to the j voice data segment is played in the process of playing the generated audio file, where j is a positive integer greater than 0 and less than n.
In a possible implementation manner, the converting unit 1103 is configured to:
according to the text ordering, dividing the target reply text into segments based on semantic separators to obtain n text segments to be processed;
or, according to the text ordering, dividing the target reply text into segments based on a segment length threshold value to obtain the n text segments to be processed;
or according to the text ordering, dividing the target reply text into n text fragments to be processed based on semantic separators and a fragment length threshold value.
In a possible implementation manner, the converting unit 1103 is configured to:
Traversing the reply text included in the target reply text according to the text sequence, and dividing the fragments based on the semantic separator when traversing to the semantic separator to obtain candidate text fragments;
if the segment length of the candidate text segment is greater than or equal to the segment length threshold, determining the candidate text segment as a text segment to be processed, and continuing traversing until the n text segments to be processed are obtained.
In a possible implementation manner, the obtaining unit 1101 is further configured to:
sequentially obtaining a plurality of reply text fragments according to the fragment sequences of the plurality of reply text fragments included in the target reply text, wherein each reply text fragment comprises at least one reply text;
the converting unit 1103 is configured to:
and in the process of sequentially acquiring a plurality of reply text fragments, according to the word sequencing, dividing the fragments based on the acquired reply words until the fragment division of the target reply text is completed, and acquiring the n text fragments to be processed.
In a possible implementation manner, the obtaining unit 1101 is configured to:
when the reply text fragments are obtained according to the fragment sequencing, pushing the obtained reply text fragments into the tail of a text queue, wherein the arrangement sequence of the obtained reply text in the text queue is the same as the text sequencing;
The converting unit 1103 is configured to:
and in the process of sequentially acquiring a plurality of reply text fragments, dividing the acquired reply text fragments from the head of the text queue until the fragment division of the target reply text is completed, and acquiring the n text fragments to be processed.
In a possible implementation, the device further comprises a pushing unit:
the pushing unit is used for pushing the ith voice data segment into the tail of the first audio data queue after completing the conversion of the ith voice data segment;
the generating unit 1104 is configured to:
and in the process of obtaining the voice data fragments based on the text fragments to be processed, obtaining the voice data fragments from the head of the first audio data queue, and generating corresponding audio files according to the obtained voice data fragments.
In a possible implementation, the pushing unit is further configured to:
pushing the ith voice data segment in the conversion process into the tail of a second audio data queue in the process of converting the ith text segment to be processed into the ith voice data segment, wherein the ith voice data segment has a state identifier;
Determining a manner of completing the conversion of the ith voice data fragment includes:
and if the state identifier is a completion identifier, determining that the conversion of the ith voice data fragment is completed.
In a possible implementation manner, the device further includes a display unit:
the display unit is used for displaying the corresponding text segment to be processed according to the voice data segment corresponding to the played audio file in the process of playing the generated audio file.
In one possible implementation, the target reply text is generated by a generative dialog model based on the speech text.
In a possible implementation manner, the obtaining unit 1101 is further configured to:
acquiring environment information of a space where the voice interaction equipment is located;
the display unit is used for displaying the environment information.
According to the technical scheme, after the voice response request is triggered based on the acquired voice to be responded, the target reply text is segmented according to the text sequence of the reply text included in the target reply text to obtain n text segments to be processed, and when the i text segments to be processed are obtained through segmentation, the i text segments to be processed are converted into the i voice data segments. The voice reply request comprises a voice text corresponding to the voice to be replied, the target reply text is a text which is generated based on the voice text and is used for replying the voice to be replied, i is a positive integer which is more than 0 and less than or equal to n, and the arrangement sequence of n voice data fragments obtained through conversion is the same as the arrangement sequence of n text fragments to be processed. The method comprises the steps of segmenting a target reply text and sequentially converting the target reply text into voice data fragments, wherein compared with the step of converting a complete target reply text into voice data, the time consumption for obtaining the voice data fragments is obviously reduced, so that after the voice data fragments are obtained in the process of obtaining the voice data fragments based on the text fragments to be processed, an audio file can be sequentially generated according to the obtained voice data fragments according to the arrangement sequence of the n voice data fragments, and the generated audio file is sequentially played, so that the reply of voice interaction equipment can be obtained without waiting for generating voice data corresponding to the complete target reply text, and the reply speed of voice interaction is improved. In the process of playing the generated audio files, in order to achieve the effect of sequential playing, the audio files corresponding to the j+1th voice data segment are played after the audio files corresponding to the j th voice data segment are played, wherein j is a positive integer greater than 0 and less than n. Therefore, in the voice interaction process, the reply speed of voice interaction can be improved, the waiting time of a user is shortened, and the user experience is improved through the technical scheme provided by the application.
The embodiment of the application also provides computer equipment which can execute the voice interaction method. The computer device may be a terminal, taking the terminal as a smart phone as an example:
fig. 12 is a block diagram illustrating a part of a structure of a smart phone according to an embodiment of the present application. Referring to fig. 12, the smart phone includes: radio Frequency (RF) circuit 1210, memory 1220, input unit 1230, display unit 1240, sensor 1250, audio circuit 1260, wireless fidelity (WiFi) module 1270, processor 1280, and power supply 1290. The input unit 1230 may include a touch panel 1231 and other input devices 1232, the display unit 1240 may include a display panel 1241, and the audio circuit 1260 may include a speaker 1261 and a microphone 1262. It will be appreciated that the smartphone structure shown in fig. 12 is not limiting of the smartphone, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.
Memory 1220 may be used to store software programs and modules, and processor 1280 may perform various functional applications and data processing for the smartphone by executing the software programs and modules stored in memory 1220. The memory 1220 may mainly include a storage program area that may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and a storage data area; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 1220 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.
Processor 1280 is a control center of the smartphone, connects various parts of the entire smartphone using various interfaces and lines, performs various functions of the smartphone and processes data by running or executing software programs and/or modules stored in memory 1220, and invoking data stored in memory 1220. In the alternative, processor 1280 may include one or more processing units; preferably, the processor 1280 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, application programs, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 1280.
In this embodiment, the processor 1280 in the smart phone may perform the following steps:
acquiring voice to be replied;
triggering a voice reply request based on the voice to be replied, wherein the voice reply request comprises a voice text corresponding to the voice to be replied;
dividing a target reply text according to the text sequence of reply text included in the target reply text to obtain n pieces of text to be processed, and converting the ith piece of text to be processed into an ith voice data piece when the ith piece of text to be processed is obtained by dividing, wherein the target reply text is a text which is generated based on the voice text and is used for replying the voice to be replied, i is a positive integer which is more than 0 and less than or equal to n, and the arrangement sequence of the n pieces of voice data obtained by conversion is the same as that of the n pieces of text to be processed;
In the process of obtaining voice data fragments based on the text fragments to be processed, generating audio files according to the obtained voice data fragments in sequence according to the arrangement sequence of the n voice data fragments, and playing the generated audio files, in the process of playing the generated audio files, playing the audio files corresponding to the j+1th voice data fragment after the audio files corresponding to the j voice data fragment are played, wherein j is a positive integer greater than 0 and less than n.
The computer device provided in the embodiment of the present application may also be a server, as shown in fig. 13, fig. 13 is a block diagram of a server 1300 provided in the embodiment of the present application, where the server 1300 may have a relatively large difference due to different configurations or performances, and may include one or more processors, such as a central processing unit (Central Processing Units, abbreviated as CPU) 1322, a memory 1332, one or more storage media 1330 (such as one or more mass storage devices) storing application programs 1342 or data 1344. Wherein the memory 1332 and storage medium 1330 may be transitory or persistent. The program stored on the storage medium 1330 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Further, the central processor 1322 may be configured to communicate with the storage medium 1330, and execute a series of instruction operations in the storage medium 1330 on the server 1300.
The Server 1300 may also include one or more power supplies 1326, one or more wired or wireless network interfaces 1350, one or more input/output interfaces 1358, and/or one or more operating systems 1341, such as Windows Server TM ,Mac OS X TM ,Unix TM ,Linux TM ,FreeBSD TM Etc.
In the present embodiment, the steps required to be executed by the cpu 1322 in the server 1300 may be implemented by the structure shown in fig. 13.
According to an aspect of the present application, there is provided a computer-readable storage medium for storing a computer program for executing the voice interaction method according to the foregoing embodiments.
According to one aspect of the present application, a computer program product is provided, the computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.
The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
The above embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims (14)

1. A method of voice interaction, the method comprising:
acquiring voice to be replied;
triggering a voice reply request based on the voice to be replied, wherein the voice reply request comprises a voice text corresponding to the voice to be replied;
dividing a target reply text according to the text sequence of reply text included in the target reply text to obtain n pieces of text to be processed, and converting the ith piece of text to be processed into an ith voice data piece when the ith piece of text to be processed is obtained by dividing, wherein the target reply text is a text which is generated based on the voice text and is used for replying the voice to be replied, i is a positive integer which is more than 0 and less than or equal to n, and the arrangement sequence of the n pieces of voice data obtained by conversion is the same as that of the n pieces of text to be processed;
In the process of obtaining voice data fragments based on the text fragments to be processed, generating audio files according to the obtained voice data fragments in sequence according to the arrangement sequence of the n voice data fragments, and playing the generated audio files, in the process of playing the generated audio files, playing the audio files corresponding to the j+1th voice data fragment after the audio files corresponding to the j voice data fragment are played, wherein j is a positive integer greater than 0 and less than n.
2. The method according to claim 1, wherein the dividing the target reply text into n pieces of text to be processed according to the text sequence of the reply text included in the target reply text includes:
according to the text ordering, dividing the target reply text into segments based on semantic separators to obtain n text segments to be processed;
or, according to the text ordering, dividing the target reply text into segments based on a segment length threshold value to obtain the n text segments to be processed;
or according to the text ordering, dividing the target reply text into n text fragments to be processed based on semantic separators and a fragment length threshold value.
3. The method of claim 2, wherein the segmenting the target reply text based on semantic separators and segment length thresholds according to the text ordering to obtain the n text segments to be processed comprises:
traversing the reply text included in the target reply text according to the text sequence, and dividing the fragments based on the semantic separator when traversing to the semantic separator to obtain candidate text fragments;
if the segment length of the candidate text segment is greater than or equal to the segment length threshold, determining the candidate text segment as a text segment to be processed, and continuing traversing until the n text segments to be processed are obtained.
4. The method of claim 1, wherein prior to the segmenting the target reply text into n pieces of text to be processed according to the text ordering of the reply text included in the target reply text, the method further comprises:
sequentially obtaining a plurality of reply text fragments according to the fragment sequences of the plurality of reply text fragments included in the target reply text, wherein each reply text fragment comprises at least one reply text;
The step of dividing the target reply text into n text segments to be processed according to the text ordering of the reply text included in the target reply text comprises the following steps:
and in the process of sequentially acquiring a plurality of reply text fragments, according to the word sequencing, dividing the fragments based on the acquired reply words until the fragment division of the target reply text is completed, and acquiring the n text fragments to be processed.
5. The method of claim 4, wherein sequentially obtaining the plurality of reply text segments according to the segment order of the plurality of reply text segments included in the target reply text comprises:
when the reply text fragments are obtained according to the fragment sequencing, pushing the obtained reply text fragments into the tail of a text queue, wherein the arrangement sequence of the obtained reply text in the text queue is the same as the text sequencing;
in the process of sequentially obtaining a plurality of reply text fragments, according to the text sequence, performing fragment division based on the obtained reply text until the fragment division of the target reply text is completed, obtaining the n text fragments to be processed, including:
And in the process of sequentially acquiring a plurality of reply text fragments, dividing the acquired reply text fragments from the head of the text queue until the fragment division of the target reply text is completed, and acquiring the n text fragments to be processed.
6. The method according to any one of claims 1-5, further comprising:
pushing the ith voice data segment into the tail of a first audio data queue after completing the conversion of the ith voice data segment;
in the process of obtaining the voice data fragments based on the text fragments to be processed, generating an audio file according to the obtained voice data fragments in sequence according to the arrangement sequence of the n voice data fragments, wherein the audio file comprises the following steps:
and in the process of obtaining the voice data fragments based on the text fragments to be processed, obtaining the voice data fragments from the head of the first audio data queue, and generating corresponding audio files according to the obtained voice data fragments.
7. The method of claim 6, wherein the method further comprises:
pushing the ith voice data segment in the conversion process into the tail of a second audio data queue in the process of converting the ith text segment to be processed into the ith voice data segment, wherein the ith voice data segment has a state identifier;
Determining a manner of completing the conversion of the ith voice data fragment includes:
and if the state identifier is a completion identifier, determining that the conversion of the ith voice data fragment is completed.
8. The method according to any one of claims 1-5, further comprising:
and in the process of playing the generated audio file, displaying the corresponding text segment to be processed according to the voice data segment corresponding to the played audio file.
9. The method of any of claims 1-5, wherein the target reply text is generated by a generative dialog model based on the phonetic text.
10. The method of any of claims 1-5, wherein the method is performed by a voice interaction device, the method further comprising:
acquiring environment information of a space where the voice interaction equipment is located;
and displaying the environment information.
11. The voice interaction device is characterized by comprising an acquisition unit, a request unit, a conversion unit, a generation unit and a play unit:
the acquisition unit is used for acquiring the voice to be replied;
the request unit is used for triggering a voice reply request based on the voice to be replied, and the voice reply request comprises a voice text corresponding to the voice to be replied;
The conversion unit is used for dividing the target reply text into n pieces of text to be processed according to the text sequence of the reply text included in the target reply text, converting the ith piece of text to be processed into an ith piece of voice data when the ith piece of text to be processed is obtained by dividing, wherein the target reply text is a text which is generated based on the voice text and is used for replying the voice to be replied, i is a positive integer which is more than 0 and less than or equal to n, and the arrangement sequence of the n pieces of voice data obtained by conversion is the same as that of the n pieces of text to be processed;
the generating unit is used for sequentially generating audio files according to the obtained voice data fragments according to the arrangement sequence of the n voice data fragments in the process of obtaining the voice data fragments based on the text fragments to be processed;
the playing unit is used for playing the generated audio file, and playing the audio file corresponding to the j+1th voice data fragment after the audio file corresponding to the j voice data fragment is played in the process of playing the generated audio file, wherein j is a positive integer greater than 0 and less than n.
12. A computer device, the computer device comprising a processor and a memory:
the memory is used for storing a computer program and transmitting the computer program to the processor;
the processor is configured to perform the method of any of claims 1-10 according to instructions in the computer program.
13. A computer readable storage medium for storing a computer program which, when executed by a processor, causes the processor to perform the method of any one of claims 1-10.
14. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of claims 1-10.
CN202311037799.7A 2023-08-16 2023-08-16 Voice interaction method and related device Pending CN117253478A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311037799.7A CN117253478A (en) 2023-08-16 2023-08-16 Voice interaction method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311037799.7A CN117253478A (en) 2023-08-16 2023-08-16 Voice interaction method and related device

Publications (1)

Publication Number Publication Date
CN117253478A true CN117253478A (en) 2023-12-19

Family

ID=89125508

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311037799.7A Pending CN117253478A (en) 2023-08-16 2023-08-16 Voice interaction method and related device

Country Status (1)

Country Link
CN (1) CN117253478A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117909600A (en) * 2024-03-13 2024-04-19 苏州元脑智能科技有限公司 Method and device for recommending interaction objects, storage medium and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117909600A (en) * 2024-03-13 2024-04-19 苏州元脑智能科技有限公司 Method and device for recommending interaction objects, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110634483B (en) Man-machine interaction method and device, electronic equipment and storage medium
CN110381389B (en) Subtitle generating method and device based on artificial intelligence
US20220335930A1 (en) Utilizing pre-event and post-event input streams to engage an automated assistant
CN110517689B (en) Voice data processing method, device and storage medium
EP3508991A1 (en) Man-machine interaction method and apparatus based on artificial intelligence
US11830482B2 (en) Method and apparatus for speech interaction, and computer storage medium
CN112201246B (en) Intelligent control method and device based on voice, electronic equipment and storage medium
CN112040263A (en) Video processing method, video playing method, video processing device, video playing device, storage medium and equipment
CN106297801A (en) Method of speech processing and device
CN107040452B (en) Information processing method and device and computer readable storage medium
CN108538298A (en) voice awakening method and device
CN109979450B (en) Information processing method and device and electronic equipment
CN110602516A (en) Information interaction method and device based on live video and electronic equipment
CN112466302B (en) Voice interaction method and device, electronic equipment and storage medium
CN110706707B (en) Method, apparatus, device and computer-readable storage medium for voice interaction
CN109992239A (en) Voice traveling method, device, terminal and storage medium
CN109165292A (en) Data processing method, device and mobile terminal
US11721328B2 (en) Method and apparatus for awakening skills by speech
CN108920649A (en) A kind of information recommendation method, device, equipment and medium
CN117253478A (en) Voice interaction method and related device
CN112739507B (en) Interactive communication realization method, device and storage medium
CN109325180A (en) Article abstract method for pushing, device, terminal device, server and storage medium
CN114064943A (en) Conference management method, conference management device, storage medium and electronic equipment
CN114155854A (en) Voice data processing method and device
CN109725798B (en) Intelligent role switching method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication