WO2023024455A1 - Voice interaction method and electronic device - Google Patents

Voice interaction method and electronic device Download PDF

Info

Publication number
WO2023024455A1
WO2023024455A1 PCT/CN2022/077091 CN2022077091W WO2023024455A1 WO 2023024455 A1 WO2023024455 A1 WO 2023024455A1 CN 2022077091 W CN2022077091 W CN 2022077091W WO 2023024455 A1 WO2023024455 A1 WO 2023024455A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
target
wake
collected
manipulation information
Prior art date
Application number
PCT/CN2022/077091
Other languages
French (fr)
Chinese (zh)
Inventor
程益君
徐昕媚
Original Assignee
北京达佳互联信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京达佳互联信息技术有限公司 filed Critical 北京达佳互联信息技术有限公司
Publication of WO2023024455A1 publication Critical patent/WO2023024455A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/21Server components or server architectures
    • H04N21/218Source of audio or video content, e.g. local disk arrays
    • H04N21/2187Live feed

Definitions

  • the present disclosure relates to the technical field of the Internet, and in particular to a voice interaction method and electronic equipment.
  • the disclosure provides a voice interaction method and electronic equipment.
  • the disclosed technical scheme is as follows:
  • a voice interaction method including:
  • the first wake-up recognition result is to wake up the target voice assistant
  • display preset prompt information on the play page corresponding to the target video the preset prompt information is used to indicate that the target voice assistant has been successfully awakened
  • the interactive operation associated with the target video is controlled based on the voice.
  • a voice interaction method including:
  • the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
  • a first target interaction operation is performed.
  • a voice interaction device including:
  • the first target acquisition voice acquisition module is configured to acquire the first target acquisition voice during the playback of the target video
  • the first wake-up recognition module is configured to perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result
  • the preset prompt information display module is configured to display preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant, and the preset prompt information is used for Prompting that the target voice assistant is successfully awakened, and controlling an interactive operation associated with the target video based on voice.
  • a voice interaction device including:
  • the second voice acquisition module is configured to acquire a second collection voice and a second playback voice when the target voice assistant is successfully awakened during the playback of the target video, and the second playback voice is for collecting the second The voice played in the target video when collecting the voice;
  • the second acoustic echo cancellation processing module is configured to perform echo cancellation on the second collected speech based on the second played speech, to obtain a second target collected speech;
  • a first manipulation information acquisition request sending module configured to send a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
  • the second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;
  • the second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.
  • an electronic device including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement follows the steps below:
  • the first wake-up recognition result is to wake up the target voice assistant
  • display preset prompt information on the play page corresponding to the target video the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
  • an electronic device including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement follows the steps below:
  • the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
  • a first target interaction operation is performed.
  • a computer-readable storage medium when instructions in the storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps:
  • the first wake-up recognition result is to wake up the target voice assistant
  • display preset prompt information on the play page corresponding to the target video the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
  • a computer-readable storage medium when instructions in the storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps:
  • the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
  • a first target interaction operation is performed.
  • a computer program product including a computer program, and the computer program is executed by a processor through the following steps:
  • the first wake-up recognition result is to wake up the target voice assistant
  • display preset prompt information on the play page corresponding to the target video the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
  • a computer program product containing instructions, including a computer program, the computer program is executed by a processor through the following steps:
  • the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
  • a first target interaction operation is performed.
  • the target voice assistant during the playback of the target video, combined with the voice collected by the first target for wake-up recognition, it can avoid falsely triggered voice interaction and improve the accuracy of voice interaction; in addition, when waking up the target voice assistant In the case where the target voice assistant is awakened successfully, and the preset prompt information for the interactive operation based on the voice control and the target video is displayed, the interaction between the voice and the target video can be realized, and the convenience and efficiency of the interaction are improved. , and in turn, it can also improve the interaction between users and anchors in live broadcast and other scenarios.
  • Fig. 1 is a schematic diagram showing an application environment according to an exemplary embodiment
  • Fig. 2 is a flowchart of a voice interaction method according to an exemplary embodiment
  • Fig. 3 is a flow chart showing a wake-up recognition of a first target collected voice to obtain a first wake-up recognition result according to an exemplary embodiment
  • Fig. 4 is a schematic diagram of a playing page showing preset prompt information according to an exemplary embodiment
  • Fig. 5 is a flow chart showing preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant according to an exemplary embodiment
  • Fig. 6 is a flowchart showing a corresponding interactive operation based on collected voice according to an exemplary embodiment
  • Fig. 7 is another flow chart showing corresponding interactive operations based on collected voice according to an exemplary embodiment
  • Fig. 8 is a flow chart showing another voice interaction method according to an exemplary embodiment
  • Fig. 9 is a block diagram of a voice interaction device according to an exemplary embodiment
  • Fig. 10 is a block diagram of a voice interaction device according to an exemplary embodiment
  • Fig. 11 is a block diagram showing an electronic device for voice interaction according to an exemplary embodiment.
  • the user information including but not limited to user equipment information, user personal information, etc.
  • data including but not limited to data for display, data for analysis, etc.
  • FIG. 1 is a schematic diagram showing an application environment according to an exemplary embodiment. As shown in FIG. 1 , the application environment includes a terminal 100 and a server 200 .
  • the terminal 100 is used to provide live broadcast service and voice assistant service to any user.
  • the terminal 100 includes, but is not limited to, smartphones, desktop computers, tablet computers, notebook computers, smart speakers, digital assistants, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) devices, smart Electronic devices such as wearable devices.
  • the software running on the above-mentioned electronic devices is used to provide live broadcast services and voice assistant services, such as application programs and the like.
  • the operating system running on the electronic device includes but not limited to Android system, IOS system, linux, windows and so on.
  • the server 200 provides background services for the terminal 100 .
  • the server 200 is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and big data and artificial intelligence platforms.
  • FIG. 1 is only an application environment provided by the present disclosure, and in actual application, other application environments are also included, for example, other application environments include a server and multiple terminals.
  • the terminal 100 and the server 200 are connected directly or indirectly through wired or wireless communication, which is not limited in this disclosure.
  • Fig. 2 is a flow chart of a voice interaction method according to an exemplary embodiment. As shown in Fig. 2 , the voice interaction method is executed by an electronic device such as a terminal, and includes the following steps S201 to S205.
  • step S201 during the playing of the target video, the first target collected voice is acquired.
  • the above target video playback process includes the process of playing the target video on the corresponding playback page when the application corresponding to the target video is running in the foreground; or, including the corresponding application running in the background.
  • the target video includes but is not limited to live video, pre-recorded video (movies, short videos, etc.).
  • the acquisition of the first target collected voice includes:
  • the first playback voice is the voice played in the target video when collecting the first collection voice
  • echo cancellation is performed on the first collected voice to obtain the first target collected voice.
  • the terminal is often provided with a voice collection device capable of collecting voice, such as a microphone, and the voice is collected based on the microphone on the terminal.
  • the first collected voice is the voice information collected based on the voice collection device during the playback of the target video.
  • the first playing voice is the voice information played in the target video when the first collected voice is collected.
  • the target video is played based on the player, and correspondingly, the first playing voice is acquired based on the player.
  • the collected first collected voice will also collect the voice sent during the playback of the target video information.
  • the first collected voice is subjected to acoustic echo cancellation processing, and the first target collected voice after the first broadcast voice is offset is obtained, thereby ensuring the accuracy of subsequent wake-up recognition.
  • the terminal is provided with a voice processing component, and the voice processing component is used for collecting voice and performing acoustic echo cancellation processing.
  • the acoustic echo cancellation process is performed on the first collected voice, which can ensure the validity of the first target collected voice used for voice assistant wake-up recognition, and then Improve the accuracy of subsequent voice wake-up recognition.
  • the first collected voice is used as the first target collected voice.
  • the volume of the voice in the target video being played is low. If the volume of the voice of the target video being played is less than the volume threshold, it means that the first collected voice is clear enough, that is, the The voice information uttered by the user in the first collected voice is clear enough, so there is no need to perform echo cancellation on the first collected voice, and the first collected voice can be used as the first target collected voice.
  • the volume threshold is an arbitrary value.
  • step S203 wake-up recognition is performed on the collected voice of the first target to obtain a first wake-up recognition result.
  • performing wake-up recognition on the first target collected voice means judging whether to wake up the target voice assistant based on the first target collected voice, and the first wake-up recognition result is used to indicate whether to wake up the target voice assistant.
  • the terminal performs wake-up recognition of the voice assistant locally.
  • performing wake-up recognition on the first target collected voice and obtaining the first wake-up recognition result may include:
  • wake-up recognition is performed on the first target collected voice to obtain a first wake-up recognition result.
  • the preset wake-up voice is a voice used to trigger the wake-up of the target voice assistant.
  • the preset wake-up voice is preset in combination with actual application scenarios.
  • the wake-up recognition of the first target collected voice based on the preset wake-up voice includes: matching the preset wake-up voice with the first target collected voice, and when the first target collected voice includes the preset wake-up voice In this case, the first wake-up recognition result is to wake up the target voice assistant; when the first target collected voice does not include the preset wake-up voice, the first wake-up recognition result is not to wake up the target voice assistant.
  • the terminal is provided with a local voice wake-up component, and the local voice wake-up component is used for local wake-up recognition.
  • the wake-up recognition is performed on the first target collected voice, which can avoid false triggering of voice interaction and improve the accuracy of voice interaction.
  • a second wake-up recognition is performed in conjunction with the server; correspondingly, as shown in FIG.
  • the process of waking up the recognition result includes the following steps:
  • step S301 a preset wake-up voice is acquired.
  • step S303 wake-up recognition is performed on the first target collected voice based on the preset wake-up voice to obtain a third wake-up recognition result.
  • step S305 if the third wake-up recognition result is to wake up the target voice assistant, send the first target collected voice to the server.
  • step S307 the first wake-up identification result sent by the server is received.
  • step S301 and step S303 refer to the relevant description above for the above step S301 and step S303, and details are not repeated here.
  • the above-mentioned first wake-up recognition result is obtained by the server performing wake-up recognition processing on the text corresponding to the voice collected by the first target based on a preset wake-up recognition model.
  • the preset wake-up recognition model is obtained by training a preset deep learning model based on the sample voice and the wake-up label information corresponding to the sample voice.
  • the sample speech includes a positive sample speech and a negative sample speech; the wake-up marking information corresponding to the positive sample voice is to wake up the target voice assistant, and the wake-up marking information corresponding to the negative sample voice is not to wake up the target voice assistant.
  • the server after receiving the first target collected voice, the server converts the first target collected voice into text information, and inputs the text information into a preset wake-up recognition model for wake-up recognition processing to obtain a first wake-up recognition result.
  • the first target collected voice is not sent to the server, thereby reducing the pressure on the server.
  • the secondary wake-up recognition is performed in combination with the server, which improves the accuracy of wake-up recognition and avoids falsely triggered voice interaction.
  • step S205 if the first wake-up recognition result is to wake up the target voice assistant, preset prompt information is displayed on the play page corresponding to the target video.
  • the preset prompt information is used to prompt the target voice assistant to be awakened successfully, and to control the interactive operation associated with the target video based on the voice.
  • the target voice assistant is a voice assistant that controls the interactive operation associated with the target video based on voice. After the target voice assistant is successfully awakened, the user can control the interactive operation associated with the target video based on voice.
  • the information format of the preset prompt information includes but is not limited to text, voice, image, etc., and can be set according to actual application requirements.
  • the interactive operations associated with the target video are different.
  • the interactive operations associated with the target video include but are not limited to commenting, following the corresponding host, giving virtual resources, and so on.
  • the target video is a pre-recorded video such as a film and television drama
  • the interactive operations associated with the target video include but are not limited to posting barrage, anthology, and adjusting resolution.
  • the target video is a pre-recorded video such as a short video
  • the interactive operations associated with the target video include but are not limited to like, follow, and so on.
  • the collection of voice is continued, and when a new voice is collected during the playback of the target video, according to the above steps S201 to S205, based on The flow of the new voice for voice interaction.
  • FIG. 4 is a schematic diagram of a playback page showing preset prompt information according to an exemplary embodiment, and the information corresponding to 400 in FIG. 4 is preset prompt information.
  • displaying preset prompt information on the play page corresponding to the target video includes:
  • step S2051 if the first wake-up identification result is to wake up the target voice assistant, a prompt information acquisition request is sent to the server, and the prompt information acquisition request includes the first target collected voice.
  • step S2053 the preset prompt information sent by the server is received, and the preset prompt information is generated based on the collected voice of the first target.
  • step S2055 preset prompt information is displayed on the play page.
  • the terminal before sending the voice to the server, performs voice format conversion on the voice, so that the format-converted voice is recognizable by the server, and then sends the format-converted voice to the server.
  • the voice format of the first target collected voice before the format conversion is PCM (Pulse Code Modulation----pulse code modulation recording), and the voice format recognizable by the server is Opus (a lossy sound coding format).
  • Sending the first target collected voice by the server includes sending the converted voice to the server, that is, the voice format of the first target collected voice sent to the server is Opus.
  • the terminal is provided with a local format conversion component, and the format conversion component is used for voice format conversion.
  • the function of voice format conversion is integrated in the above-mentioned local voice wake-up component.
  • the first target collection voice includes manipulation voice
  • the above method further includes:
  • a third target interaction operation is performed.
  • the first target collection voice in addition to the preset wake-up voice, also includes voice information indicating the execution of an interactive operation associated with the target video.
  • the server can determine the first manipulation information while determining the preset prompt information by performing semantic analysis on the first target collection voice, so that subsequent terminals can information, perform the first target interaction operation.
  • the third control information is follow the instructions of the anchor.
  • the terminal after receiving the third manipulation information, the terminal automatically triggers an interactive operation of following the anchor (third target interactive operation).
  • the first wake-up recognition result is to wake up the target voice assistant
  • the first target by carrying the voice of the first target collection in the prompt information acquisition request, the first target can be obtained from the server while obtaining the preset prompt information.
  • the third control information corresponding to the control voice in the voice is collected, and then the automatic execution of the interactive operation is realized, which improves the convenience and efficiency of the interaction.
  • the playback page corresponding to the target video displays the preset prompt information for prompting the target voice assistant to be awakened successfully, and the interactive operation associated with the target video based on voice control, which can realize the voice-based and target video
  • the interaction improves the convenience and efficiency of interaction, and can also improve the interaction between users and anchors in live broadcast and other scenarios.
  • the above method further includes:
  • step S601 a second collected voice and a second played voice are acquired, and the second played voice is the voice played in the target video when the second collected voice is collected.
  • step S603 based on the second playing voice, echo cancellation is performed on the second collected voice to obtain a second target collected voice.
  • step S605 a first manipulation information acquisition request is sent to the server, where the first manipulation information acquisition request includes the voice collected by the second target.
  • step S607 the first manipulation information sent by the server is received, and the first manipulation information corresponds to the voice collected by the second target.
  • step S609 based on the first manipulation information, a first target interaction operation is performed.
  • the first target interactive operation is an operation corresponding to the second collected voice, and is also an operation associated with the target video.
  • step S601 and step S603 are the same as the above step S201, and will not be repeated here.
  • the second target collected voice is a voice obtained after the target voice assistant is awakened, and the second target collected voice is a control voice.
  • a first manipulation information acquisition request carrying the second target collected voice is sent to the server.
  • the server determines the second manipulation information by performing semantic analysis on the collected voice of the second target, and returns it to the terminal, so that the terminal can execute the first target interactive operation based on the second manipulation information.
  • the text corresponding to the preset wake-up voice is "little k”
  • the text corresponding to the second target collection voice is "I want to follow the anchor”
  • the second control information is an instruction to follow the anchor .
  • the terminal after receiving the second manipulation information, the terminal automatically triggers an interactive operation of following the anchor (second target interactive operation).
  • the second collected voice is used as the second target collected voice.
  • the volume threshold is an arbitrary value.
  • the acoustic echo cancellation process is performed on the second collected voice, which can ensure the validity of the control voice (the second target collected voice), and ensure the Accuracy of the obtained second control information, and then on the basis of improving the convenience and efficiency of the interaction, the accuracy of the voice interaction is improved.
  • the above method further includes:
  • the first target collected voice includes the target interaction instruction voice
  • the target interaction indicates that the voice indicates multiple rounds of interaction
  • the service mode in the first state (which may be referred to as a single-round interaction mode) indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video; After the target voice assistant is woken up, after performing an interactive operation based on voice control associated with the target video, turn off the target voice assistant.
  • the service mode in the second state (which may be referred to as the multi-round interaction mode for short) indicates that during the wake-up of the target voice assistant, perform at least one voice-based interactive operation associated with the target video. That is, after the target voice assistant wakes up, one or more voice-based interactive operations associated with the target video can be performed.
  • the target interaction instruction voice indicates multiple rounds of interaction, that is, the target interaction instruction voice indicates to enable the multi-round interaction mode.
  • the target interaction instruction voice is a preset specific voice for starting multiple rounds of interaction modes. For example, the specific voice is "open multiple rounds of interaction mode", the specific voice is recognized in the first target collected voice, and it is determined that the first target collected voice includes the target interaction instruction voice.
  • the target interaction indication voice is voice information with semantics requiring multiple interactions. For example, the target interaction indicates that the speech is "I want to send a gift.”
  • interactive recognition is performed on the first target collected speech to determine whether the first target collected speech includes the target interaction instruction speech.
  • the preset interaction recognition model is obtained by training the preset deep learning model based on the sample speech and the interaction annotation information corresponding to the sample speech.
  • the sample speech corresponding to the preset interaction recognition model includes positive sample speech and negative sample speech
  • the interaction annotation information corresponding to the positive sample speech is the target interaction instruction speech
  • the interaction annotation information corresponding to the negative sample speech is the target interaction
  • Other interaction indication voices other than the indication voice indicate that multiple rounds of interaction are not to be performed.
  • the server when the server receives the first target collection voice for the first time, it converts the first target collection voice into text information, and inputs the text information into a preset interactive recognition model for interactive recognition, so as to determine the first target collection voice. Whether the voice includes target interaction indication voice.
  • the service mode of the target voice assistant is updated from the first state to the second state.
  • the first target voice collection includes the target interaction instruction voice
  • by updating the service mode of the target voice assistant from the first state to the second state so that during the wake-up of the target voice assistant, at least An interactive operation based on voice control associated with the target video improves the convenience and efficiency of voice interactive operations, and also improves the diversity of voice interactive operations.
  • the above method further includes:
  • step S701 a third collected voice and a third played voice are obtained, and the third played voice is the voice played in the target video when the third collected voice is collected.
  • step S703 based on the third playing voice, echo cancellation is performed on the third collected voice to obtain the third target collected voice;
  • step S705 perform wake-up recognition on the collected voice of the third target, and obtain a second wake-up recognition result
  • step S707 if the second wake-up recognition result is not to wake up the target voice assistant, send a second manipulation information acquisition request to the server, where the second manipulation information acquisition request includes the third target voice collection;
  • step S709 the third manipulation information sent by the server is received, the second manipulation information corresponds to the voice collected by the third target;
  • step S711 based on the second manipulation information, a second target interaction operation is performed.
  • the second target interactive operation is an operation corresponding to the third collected voice, and also an operation associated with the target video.
  • the second wake-up recognition result is not to wake up the target voice assistant, which means that the third target voice collection is only the voice for controlling the operation related to the target video, that is, the third target voice collection is only when the target voice assistant is in multiple rounds.
  • the control voice obtained in the interactive mode.
  • step S701 to step S711 is the same as the above step S601 to step S609, and step S203, and will not be repeated here.
  • the third collected voice is used as the third target collected voice.
  • the volume of the voice in the target video being played is low, if the volume of the voice of the target video being played is less than the volume threshold, it means that the third collection voice collected is clear enough, that is, the The voice information sent by the user in the third collected voice is clear enough, therefore, it is not necessary to perform echo cancellation on the third collected voice, and the third collected voice can be used as the third target collected voice.
  • the volume threshold is an arbitrary value.
  • the newly acquired third voice collection is subjected to acoustic echo cancellation processing, which can ensure the effectiveness of the control voice (the third target voice collection) and improve It not only improves the convenience and efficiency of interaction, but also improves the accuracy of voice interaction.
  • the above method also includes:
  • the second wake-up recognition result is to wake up the target voice assistant
  • the terminal in order to support the service mode of the second state, creates two instances of recognition engines at the same time, wherein one recognition engine is used for wake-up recognition, and the other recognition engine is used for semantic recognition of multiple rounds of interactions.
  • the recognition engine used for wake-up recognition recognizes that the preset wake-up voice has been collected again, that is, when the second wake-up recognition result is to wake up the target voice assistant, it will interrupt
  • the multi-round interaction mode of the target voice assistant makes the target voice assistant re-enter the service mode of the first state.
  • the multi-round interaction mode in response to re-awakening the target voice assistant, the multi-round interaction mode is interrupted, and the single-round interaction mode is re-entered to realize flexible switching between the two interaction modes.
  • the above method also includes:
  • the response voice corresponds to the first target collection voice
  • the target voice assistant After the target voice assistant is awakened, it obtains the corresponding response voice from the server.
  • the response voice prompts the user that the target voice assistant has been awakened in the form of voice, and the content of the response voice is preset in combination with the actual application.
  • the text corresponding to the preset wake-up voice is "little k”
  • the first target collected voice is "little k”
  • the text corresponding to the response voice is "in'”.
  • the text corresponding to the preset wake-up voice is "Little K”
  • the first target collects the voice “Little K, I want a gift”
  • the text corresponding to the response voice is "Yes, please say”.
  • the interactivity with the user can be improved, thereby improving the user experience.
  • the above method further includes:
  • the preset prompt information displayed on the playback page is updated to the closing prompt information of the target voice assistant.
  • the newly collected voice is the voice collected after the target voice assistant is woken up, or is the voice after the acoustic echo cancellation process is performed on the collected voice when the target voice assistant is woken up.
  • the interaction waiting time is set in advance. Once the interaction waiting time is exceeded, the target voice assistant will be turned off, and the target voice assistant needs to be awakened again.
  • the waiting time for interaction is an arbitrary time set in advance, and the waiting time for interaction is the upper limit time for waiting for the newly collected voice from the time when the target voice assistant is woken up.
  • the preset time period is determined by combining the preset interaction waiting time and the time when the target voice assistant is woken up. In some embodiments, if the target voice assistant is woken up and the newly collected voice is not acquired within the interaction waiting time, it is determined that the target voice assistant is closed due to timeout, and the preset prompt information displayed on the playback page is updated to The closing prompt information of the target voice assistant. Wherein, the waiting time for interaction after the target voice assistant is awakened is the preset time period.
  • the preset prompt information displayed on the playback page is updated to the closing prompt information of the target voice assistant , which can avoid long-term invalid standby and reduce device resource consumption; and combined with the display of closing prompt information, it can remind the user of the target voice assistant relationship, which improves the user experience.
  • Fig. 8 is a flow chart of another voice interaction method shown according to an exemplary embodiment. As shown in Fig. 8, the voice interaction method is executed by an electronic device such as a terminal, and includes the following steps:
  • step S801 in the process of playing the target video and when the target voice assistant is successfully awakened, the second collected voice and the second played voice are obtained, and the second played voice is played in the target video when the second collected voice is collected voice;
  • step S803 based on the second playing voice, echo cancellation is performed on the second collected voice to obtain the second target collected voice;
  • step S805 sending a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;
  • step S807 the first control information sent by the server is received, and the first control information corresponds to the voice collected by the second target;
  • step S809 based on the first manipulation information, a first target interaction operation is performed.
  • steps S801 to S809 are the same as the above steps S601 to S609 , and will not be repeated here.
  • the acoustic echo cancellation process is performed on the second collected voice, which can ensure that the control voice (the second target collected voice) ) ensures the accuracy of the second control information obtained from the server, and improves the accuracy of voice interaction on the basis of improving the convenience and efficiency of interaction, and also realizes the interaction between voice and target video. Improved interaction convenience and efficiency.
  • Fig. 9 is a block diagram of a voice interaction device according to an exemplary embodiment.
  • the device includes:
  • the first target collection voice acquisition module 910 is configured to acquire the first target collection voice during the playback of the target video
  • the first wake-up recognition module 920 is configured to perform voice assistant wake-up recognition on the first target collected voice to obtain a first wake-up recognition result
  • the preset prompt information display module 930 is configured to display preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant, and the preset prompt information is used to prompt the target voice assistant to be activated.
  • the wake-up is successful, and the interactive operation associated with the target video is controlled based on voice.
  • the first target acquisition voice acquisition module 910 includes:
  • the first voice acquisition unit is configured to acquire the first collected voice and the first played voice during the playback of the target video, where the first played voice is the voice played in the target video when the first collected voice is collected;
  • the first acoustic echo cancellation processing unit is configured to perform echo cancellation on the first collected speech based on the first played speech to obtain the first target collected speech.
  • the above-mentioned device also includes:
  • the second voice acquisition module is configured to acquire the second collected voice and the second played voice
  • the second acoustic echo cancellation processing module is configured to perform acoustic echo cancellation processing on the second collected voice based on the second broadcast voice to obtain the second target collected voice, and the second played voice is played in the target video when collecting the second collected voice voice;
  • the first manipulation information acquisition request sending module is configured to send a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;
  • the second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;
  • the second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.
  • the above-mentioned device also includes:
  • the first service mode update module is configured to update the service mode of the target voice assistant from the first state to the second state when the first target collected voice includes the target interaction indication voice, and the target interaction indication voice indicates multiple rounds of interaction , the service mode in the first state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode in the second state indicates that during the wake-up of the target voice assistant, perform at least one voice-based control Interactions associated with the target video.
  • the above-mentioned device also includes:
  • the third voice acquisition module is configured to acquire the third collection voice and the third playback voice, the third playback voice is the voice played in the target video when collecting the third collection voice;
  • the third acoustic echo cancellation processing module is configured to perform echo cancellation on the third collected speech based on the third playback speech, to obtain a third target collected speech;
  • the second wake-up identification module is configured to perform wake-up identification on the voice collected by the third target to obtain a second wake-up identification result
  • the second manipulation information acquisition request sending module is configured to send a second manipulation information acquisition request to the server when the second wake-up recognition result is that the target voice assistant is not awakened, and the second manipulation information acquisition request includes the third target voice collection ;
  • the third manipulation information receiving module is configured to receive the second manipulation information sent by the server, and the second manipulation information corresponds to the voice collected by the third target;
  • the third target interactive operation execution module is configured to execute the second target interactive operation based on the second manipulation information.
  • the above-mentioned device also includes:
  • the second service mode update module is configured to update the service mode of the target voice assistant from the second state to the first state when the second wake-up recognition result is to wake up the target voice assistant.
  • the preset reminder information display module 930 includes:
  • the first prompt information acquisition request sending unit is configured to send a prompt information acquisition request to the server when the first wake-up recognition result is to wake up the target voice assistant, and the prompt information acquisition request includes the first target collected voice;
  • the preset prompt information receiving unit is configured to receive the preset prompt information sent by the server, and the preset prompt information is generated based on the voice collected by the first target;
  • the preset prompt information display unit is configured to display preset prompt information on the playback page.
  • the first target collection voice includes manipulation voice
  • the above-mentioned device also includes:
  • the first manipulation information receiving module is configured to receive the third manipulation information sent by the server, the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute the third target interactive operation associated with the target video;
  • the first manipulation information execution module is configured to execute a third target interaction operation based on the third manipulation information.
  • the first wake-up identification module 920 includes:
  • the first preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice
  • the first wake-up identification unit is configured to perform wake-up identification on the first target collected voice based on the preset wake-up voice, and obtain a first wake-up identification result.
  • the first wake-up identification module 920 includes:
  • the second preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice
  • the second wake-up recognition unit is configured to perform wake-up recognition on the first target collected voice based on the preset wake-up voice, and obtain a third wake-up recognition result;
  • the first target collection voice sending unit is configured to send the first target collection voice to the server when the third wake-up recognition result is to wake up the target voice assistant;
  • the first wake-up recognition result receiving unit is configured to receive the first wake-up recognition result sent by the server.
  • the first wake-up recognition result is obtained by performing wake-up recognition processing on text corresponding to the first target collected voice based on a preset wake-up recognition model.
  • the above-mentioned device also includes:
  • the voice response request sending module is configured to send a voice response request to the server, and the voice response request includes the first target collection voice;
  • the response voice receiving module is configured to receive the response voice sent by the server, and the response voice corresponds to the first target collection voice
  • the response voice playing module is configured to play the response voice.
  • the above-mentioned device also includes:
  • the closing prompt module is configured to update the preset prompt information displayed on the playback page to the close prompt information of the target voice assistant when no newly collected voice is obtained within a preset time period.
  • Fig. 10 is a block diagram of another voice interaction device according to an exemplary embodiment.
  • the device includes:
  • the second voice acquisition module 1010 is configured to acquire the second collected voice and the second played voice when the target voice assistant is successfully awakened during the playing of the target video, and the second played voice is when the second collected voice is collected The voice played in the target video;
  • the second acoustic echo cancellation processing module 1020 is configured to perform echo cancellation on the second collected speech based on the second playback speech, to obtain the second target collected speech;
  • the first manipulation information acquisition request sending module 1030 is configured to send a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;
  • the second manipulation information receiving module 1040 is configured to receive the first manipulation information sent by the server, where the first control information corresponds to the voice collected by the second target;
  • the second target interactive operation execution module 1050 is configured to execute the first target interactive operation based on the first manipulation information.
  • Fig. 11 is a block diagram of an electronic device for voice interaction according to an exemplary embodiment.
  • the electronic device may be a terminal, and its internal structure may be as shown in Fig. 11 .
  • the electronic device includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. Among them, the processor of the electronic device is used to provide calculation and control capabilities.
  • the memory of the electronic device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system and computer programs.
  • the internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium.
  • the network interface of the electronic device is used to communicate with an external terminal through a network connection.
  • the display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen
  • the input device of the electronic device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the housing of the electronic device , and can also be an external keyboard, touchpad or mouse.
  • FIG. 11 is only a block diagram of a partial structure related to the disclosed solution, and does not constitute a limitation on the electronic device to which the disclosed solution is applied.
  • the specific electronic device can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
  • an electronic device including: a processor; a memory for storing instructions executable by the processor; wherein, the processor is configured to execute the instructions, so as to implement The voice interaction method in the example.
  • a computer-readable storage medium is also provided, and when instructions in the storage medium are executed by a processor of the electronic device, the electronic device can execute the voice interaction method in the embodiments of the present disclosure.
  • a computer program product including a computer program, and when the computer program is executed by a processor, the voice interaction method in the embodiment of the present disclosure is implemented.
  • Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM random access memory
  • RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

A voice interaction method and apparatus, an electronic device, and a storage medium, relating to the technical field of the Internet. The method comprises: during playback of a target video, acquiring a first target collected voice (S201); performing wake-up recognition on the first target collected voice to obtain a first wake-up recognition result (S203); and in the case that the first wake-up recognition result is to wake up a target voice assistant, displaying preset prompt information on a playback page corresponding to the target video (S205), the preset prompt information being used for prompting that the target voice assistant is waken up successfully, and on the basis of the voice, controlling an interaction operation associated with the target video.

Description

语音交互方法及电子设备Voice interaction method and electronic device
本公开基于申请日为2021年08月24日、申请号为202110973383.0的中国专利申请,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本公开作为参考。This disclosure is based on a Chinese patent application with an application date of August 24, 2021 and application number 202110973383.0, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this disclosure.
技术领域technical field
本公开涉及互联网技术领域,尤其涉及一种语音交互方法及电子设备。The present disclosure relates to the technical field of the Internet, and in particular to a voice interaction method and electronic equipment.
背景技术Background technique
随着互联网技术的发展和移动设备的普及,利用移动设备查看影视剧、直播等视频已经成为人们日常生活中的一部分,目前在视频播放过程中,用户往往会对视频进行评论、发送弹幕等交互操作。With the development of Internet technology and the popularization of mobile devices, it has become a part of people's daily life to use mobile devices to view videos such as film and television dramas and live broadcasts. Currently, during video playback, users often comment on videos, send barrage, etc. Interoperate.
发明内容Contents of the invention
本公开提供一种语音交互方法及电子设备。本公开的技术方案如下:The disclosure provides a voice interaction method and electronic equipment. The disclosed technical scheme is as follows:
根据本公开实施例的一方面,提供一种语音交互方法,包括:According to an aspect of an embodiment of the present disclosure, a voice interaction method is provided, including:
在目标视频播放过程中,获取第一目标采集语音;During the playing of the target video, obtain the first target collection voice;
对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于指示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。In the case where the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to indicate that the target voice assistant has been successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
根据本公开实施例的另一方面,提供一种语音交互方法,包括:According to another aspect of the embodiments of the present disclosure, a voice interaction method is provided, including:
在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
根据本公开实施例的另一方面,提供一种语音交互装置,包括:According to another aspect of the embodiments of the present disclosure, a voice interaction device is provided, including:
第一目标采集语音获取模块,被配置为在目标视频播放过程中,获取第一目标采集语音;The first target acquisition voice acquisition module is configured to acquire the first target acquisition voice during the playback of the target video;
第一唤醒识别模块,被配置为对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;The first wake-up recognition module is configured to perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
预设提示信息展示模块,被配置为在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。The preset prompt information display module is configured to display preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant, and the preset prompt information is used for Prompting that the target voice assistant is successfully awakened, and controlling an interactive operation associated with the target video based on voice.
根据本公开实施例的另一方面,提供一种语音交互装置,包括:According to another aspect of the embodiments of the present disclosure, a voice interaction device is provided, including:
第二语音获取模块,被配置为在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;The second voice acquisition module is configured to acquire a second collection voice and a second playback voice when the target voice assistant is successfully awakened during the playback of the target video, and the second playback voice is for collecting the second The voice played in the target video when collecting the voice;
第二声学回声消除处理模块,被配置为基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;The second acoustic echo cancellation processing module is configured to perform echo cancellation on the second collected speech based on the second played speech, to obtain a second target collected speech;
第一操控信息获取请求发送模块,被配置为向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;A first manipulation information acquisition request sending module, configured to send a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
第二操控信息接收模块,被配置为接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;The second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;
第二目标交互操作执行模块,被配置为基于所述第一操控信息,执行第一目标交互操作。The second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.
根据本公开实施例的另一方面,提供一种电子设备,包括:处理器;用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现如下步骤:According to another aspect of an embodiment of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement Follow the steps below:
在目标视频播放过程中,获取第一目标采集语音;During the playing of the target video, obtain the first target collection voice;
对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
根据本公开实施例的另一方面,提供一种电子设备,包括:处理器;用于存储所述处理器可执行指令的存储器;其中,所述处理器被配置为执行所述指令,以实现如下步骤:According to another aspect of an embodiment of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement Follow the steps below:
在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
根据本公开实施例的另一方面,提供一种计算机可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行如下步骤:According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, and when instructions in the storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps:
在目标视频播放过程中,获取第一目标采集语音;During the playing of the target video, obtain the first target collection voice;
对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
根据本公开实施例的另一方面,提供一种计算机可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行如下步骤:According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, and when instructions in the storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps:
在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
根据本公开实施例的另一方面,提供一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行如下步骤:According to another aspect of the embodiments of the present disclosure, a computer program product is provided, including a computer program, and the computer program is executed by a processor through the following steps:
在目标视频播放过程中,获取第一目标采集语音;During the playing of the target video, obtain the first target collection voice;
对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
根据本公开实施例的另一方面,提供一种包含指令的计算机程序产品,包括计算机程序,该计算机程序被处理器执行如下步骤:According to another aspect of the embodiments of the present disclosure, there is provided a computer program product containing instructions, including a computer program, the computer program is executed by a processor through the following steps:
在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
本公开的实施例提供的技术方案,在目标视频播放过程中,结合第一目标采集语音进行唤醒识别,能够避免误触发的语音交互情况,提升语音交互的精准性;另外,在唤醒目标语音助手的情况下,展示用于提示目标语音助手被唤醒成功,以及基于语音控制与目标视频关联的交互操作的预设提示信息,能够实现基于语音与目标视频的交互,提升了交互便捷性和交互效率,进而也能够在直播等场景中,提升用户与主播的互动性。In the technical solution provided by the embodiments of the present disclosure, during the playback of the target video, combined with the voice collected by the first target for wake-up recognition, it can avoid falsely triggered voice interaction and improve the accuracy of voice interaction; in addition, when waking up the target voice assistant In the case where the target voice assistant is awakened successfully, and the preset prompt information for the interactive operation based on the voice control and the target video is displayed, the interaction between the voice and the target video can be realized, and the convenience and efficiency of the interaction are improved. , and in turn, it can also improve the interaction between users and anchors in live broadcast and other scenarios.
附图说明Description of drawings
图1是根据一示例性实施例示出的一种应用环境的示意图;Fig. 1 is a schematic diagram showing an application environment according to an exemplary embodiment;
图2是根据一示例性实施例示出的一种语音交互方法的流程图;Fig. 2 is a flowchart of a voice interaction method according to an exemplary embodiment;
图3是根据一示例性实施例示出的一种对第一目标采集语音进行唤醒识别,得到第一唤醒识别结果的流程图;Fig. 3 is a flow chart showing a wake-up recognition of a first target collected voice to obtain a first wake-up recognition result according to an exemplary embodiment;
图4是根据一示例性实施例提供的一种展示有预设提示信息的播放页面的示意图;Fig. 4 is a schematic diagram of a playing page showing preset prompt information according to an exemplary embodiment;
图5是根据一示例性实施例示出的一种在第一唤醒识别结果为唤醒目标语音助手的情况下,在目标视频对应的播放页面展示预设提示信息的流程图;Fig. 5 is a flow chart showing preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant according to an exemplary embodiment;
图6是根据一示例性实施例示出的一种基于采集的语音执行相应的交互操作的流程图;Fig. 6 is a flowchart showing a corresponding interactive operation based on collected voice according to an exemplary embodiment;
图7是根据一示例性实施例示出的另一种基于采集的语音执行相应的交互操作的流程图;Fig. 7 is another flow chart showing corresponding interactive operations based on collected voice according to an exemplary embodiment;
图8是根据一示例性实施例示出的另一种语音交互方法的流程图;Fig. 8 is a flow chart showing another voice interaction method according to an exemplary embodiment;
图9是根据一示例性实施例示出的一种语音交互装置框图;Fig. 9 is a block diagram of a voice interaction device according to an exemplary embodiment;
图10是根据一示例性实施例示出的一种语音交互装置框图;Fig. 10 is a block diagram of a voice interaction device according to an exemplary embodiment;
图11是根据一示例性实施例示出的一种用于语音交互的电子设备的框图。Fig. 11 is a block diagram showing an electronic device for voice interaction according to an exemplary embodiment.
具体实施方式Detailed ways
需要说明的是,本公开所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于展示的数据、分析的数据等),均为经用户授权或者经过各方充分授权的信息和数据。It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for display, data for analysis, etc.) involved in this disclosure are authorized by the user. Or information and data fully authorized by the parties.
请参阅图1,图1是根据一示例性实施例示出的一种应用环境的示意图,如图1所示,该应用环境包括终端100和服务器200。Please refer to FIG. 1 . FIG. 1 is a schematic diagram showing an application environment according to an exemplary embodiment. As shown in FIG. 1 , the application environment includes a terminal 100 and a server 200 .
终端100用于面向任一用户提供直播服务和语音助手服务。在一些实施例中,终端100包括但不限于智能手机、台式计算机、平板电脑、笔记本电脑、智能音箱、数字助理、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、智能可穿戴设备等类型的电子设备。在一些实施例中,运行于上述电子设备的软体用于提供直播服务和语音助手服务,例如应用程序等。在一些实施例中,电子设备上运行的操作系统包括但不限于安卓系统、IOS系统、linux、windows等。The terminal 100 is used to provide live broadcast service and voice assistant service to any user. In some embodiments, the terminal 100 includes, but is not limited to, smartphones, desktop computers, tablet computers, notebook computers, smart speakers, digital assistants, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) devices, smart Electronic devices such as wearable devices. In some embodiments, the software running on the above-mentioned electronic devices is used to provide live broadcast services and voice assistant services, such as application programs and the like. In some embodiments, the operating system running on the electronic device includes but not limited to Android system, IOS system, linux, windows and so on.
在一些实施例中,服务器200为终端100提供后台服务。在一些实施例中,服务器200是独立的物理服务器,或者是多个物理服务器构成的服务器集群或者分布式系统,或者是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。In some embodiments, the server 200 provides background services for the terminal 100 . In some embodiments, the server 200 is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and big data and artificial intelligence platforms.
此外,需要说明的是,图1所示的仅仅是本公开提供的一种应用环境,在实际应用中,还包括其他应用环境,例如,其他应用环境包括服务器和多个终端。In addition, it should be noted that what is shown in FIG. 1 is only an application environment provided by the present disclosure, and in actual application, other application environments are also included, for example, other application environments include a server and multiple terminals.
本说明书实施例中,上述终端100以及服务器200通过有线或无线通信方式进行直接或间接地连接,本公开在此不做限制。In the embodiment of the present specification, the terminal 100 and the server 200 are connected directly or indirectly through wired or wireless communication, which is not limited in this disclosure.
图2是根据一示例性实施例示出的一种语音交互方法的流程图,如图2所示,该语音交互方法的执行主体为终端等电子设备,包括以下步骤S201至步骤S205。Fig. 2 is a flow chart of a voice interaction method according to an exemplary embodiment. As shown in Fig. 2 , the voice interaction method is executed by an electronic device such as a terminal, and includes the following steps S201 to S205.
在步骤S201中,在目标视频播放过程中,获取第一目标采集语音。In step S201, during the playing of the target video, the first target collected voice is acquired.
在一些实施例中,上述目标视频播放过程,包括目标视频对应的应用运行在前台情况下,目标视频在相应的播放页面中进行播放的过程;或者,包括对应的应用运行在后台情况下,目标视频在悬浮弹窗播放页面中进行播放的过程。In some embodiments, the above target video playback process includes the process of playing the target video on the corresponding playback page when the application corresponding to the target video is running in the foreground; or, including the corresponding application running in the background. The process of playing the video on the floating pop-up playback page.
在一些实施例中,目标视频包括但不限于直播视频、预先录制好的视频(影视剧、短视频等)。In some embodiments, the target video includes but is not limited to live video, pre-recorded video (movies, short videos, etc.).
在一些实施例中,上述获取第一目标采集语音包括:In some embodiments, the acquisition of the first target collected voice includes:
获取第一采集语音和第一播放语音,第一播放语音为采集第一采集语音时目标视频中播放的语音;Obtain the first collection voice and the first playback voice, the first playback voice is the voice played in the target video when collecting the first collection voice;
基于第一播放语音,对第一采集语音进行回声消除,得到第一目标采集语音。Based on the first playing voice, echo cancellation is performed on the first collected voice to obtain the first target collected voice.
在一些实施例中,终端上往往设置有可以采集语音的语音采集装置,例如麦克风,基于终端上的麦克风采集语音。相应的,第一采集语音为在目标视频播放过程中,基于语音采集装置采集到的语音信息。在一些实施例中,第一播放语音为采集第一采集语音采集时目标视频中播放的语音信息。在一些实施例中,目标视频是基于播放器来播放的,相应的,基于播放器来获取第一播放语音。In some embodiments, the terminal is often provided with a voice collection device capable of collecting voice, such as a microphone, and the voice is collected based on the microphone on the terminal. Correspondingly, the first collected voice is the voice information collected based on the voice collection device during the playback of the target video. In some embodiments, the first playing voice is the voice information played in the target video when the first collected voice is collected. In some embodiments, the target video is played based on the player, and correspondingly, the first playing voice is acquired based on the player.
在实际应用中,由于在采集第一采集语音的过程中,正在播放目标视频,则采集到的第一采集语音中除了用户发出的语音信息外,还会采集到目标视频播放过程中发出的语音信息。为了精准提取用户发出的语音信息,基于第一播放语音,对第一采集语音进行声学回声消除处理,得到抵消掉第一播放语音后的第一目标采集语音,进而保证后续唤醒识别的精准性。In practical applications, since the target video is being played during the process of collecting the first collected voice, in addition to the voice information sent by the user, the collected first collected voice will also collect the voice sent during the playback of the target video information. In order to accurately extract the voice information sent by the user, based on the first broadcast voice, the first collected voice is subjected to acoustic echo cancellation processing, and the first target collected voice after the first broadcast voice is offset is obtained, thereby ensuring the accuracy of subsequent wake-up recognition.
在一些实施例中,终端设置有语音处理组件,该语音处理组件用于采集语音以及进行声学回声消除处理。In some embodiments, the terminal is provided with a voice processing component, and the voice processing component is used for collecting voice and performing acoustic echo cancellation processing.
上述实施例中,结合语音采集时目标视频中播放的第一播放语音,对第一采集语音进行声学回声消除处理,能够保证用于进行语音助手唤醒识别的第一目标采集语音的有效性,进而提升后续语音唤醒识别的精准性。In the above-mentioned embodiment, combined with the first playback voice played in the target video during voice collection, the acoustic echo cancellation process is performed on the first collected voice, which can ensure the validity of the first target collected voice used for voice assistant wake-up recognition, and then Improve the accuracy of subsequent voice wake-up recognition.
在一些实施例中,在目标视频播放过程中,目标视频播放语音对采集语音的影响小的情况下,将第一采集语音作为上述第一目标采集语音。In some embodiments, in the process of playing the target video, if the target video playing voice has little influence on the collected voice, the first collected voice is used as the first target collected voice.
例如,在采集第一采集语音的过程中,正在播放的目标视频中语音的音量小,如正在播放的目标视频语音的音量小于音量阈值,则表示采集到的第一采集语音足够清晰,即该第一采集语音中用户发出的语音信息足够清晰,因此,无需再对第一采集语音进行回声消除,将该第一采集语音作为第一目标采集语音即可。其中,音量阈值为任意的数值。For example, in the process of collecting the first collected voice, the volume of the voice in the target video being played is low. If the volume of the voice of the target video being played is less than the volume threshold, it means that the first collected voice is clear enough, that is, the The voice information uttered by the user in the first collected voice is clear enough, so there is no need to perform echo cancellation on the first collected voice, and the first collected voice can be used as the first target collected voice. Wherein, the volume threshold is an arbitrary value.
在步骤S203中,对第一目标采集语音进行唤醒识别,得到第一唤醒识别结果。In step S203, wake-up recognition is performed on the collected voice of the first target to obtain a first wake-up recognition result.
其中,对第一目标采集语音进行唤醒识别,即是通过该第一目标采集语音来判断是否唤醒目标语音助手,该第一唤醒识别结果用于表示是否唤醒目标语音助手。Wherein, performing wake-up recognition on the first target collected voice means judging whether to wake up the target voice assistant based on the first target collected voice, and the first wake-up recognition result is used to indicate whether to wake up the target voice assistant.
在一些实施例中,终端本地进行语音助手唤醒识别,相应的,上述对第一目标采集语音进行唤醒识别,得到第一唤醒识别结果可以包括:In some embodiments, the terminal performs wake-up recognition of the voice assistant locally. Correspondingly, performing wake-up recognition on the first target collected voice and obtaining the first wake-up recognition result may include:
获取预设唤醒语音;Obtain the preset wake-up voice;
基于预设唤醒语音,对第一目标采集语音进行唤醒识别,得到第一唤醒识别结果。Based on the preset wake-up voice, wake-up recognition is performed on the first target collected voice to obtain a first wake-up recognition result.
在一些实施例中,预设唤醒语音为包括用于触发目标语音助手唤醒的语音。该预设唤醒语音是结合实际应用场景预先设置的。In some embodiments, the preset wake-up voice is a voice used to trigger the wake-up of the target voice assistant. The preset wake-up voice is preset in combination with actual application scenarios.
在一些实施例中,基于预设唤醒语音,对第一目标采集语音进行唤醒识别,包括:将预设唤醒语音与第一目标采集语音进行匹配,在第一目标采集语音包括预设唤醒语音的情况下,第一唤醒识别结果为唤醒目标语音助手;在第一目标采集语音不包括预设唤醒语音的情况下,第一唤醒识别结果为不唤醒目标语音助手。In some embodiments, the wake-up recognition of the first target collected voice based on the preset wake-up voice includes: matching the preset wake-up voice with the first target collected voice, and when the first target collected voice includes the preset wake-up voice In this case, the first wake-up recognition result is to wake up the target voice assistant; when the first target collected voice does not include the preset wake-up voice, the first wake-up recognition result is not to wake up the target voice assistant.
在一些实施例中,终端设置有本地语音唤醒组件,该本地语音唤醒组件用于进行本地的唤醒识别。In some embodiments, the terminal is provided with a local voice wake-up component, and the local voice wake-up component is used for local wake-up recognition.
上述实施例中,结合预设唤醒语音,对第一目标采集语音进行唤醒识别,能够避免误触发的语音交互情况,提升语音交互的精准性。In the above embodiment, combined with the preset wake-up voice, the wake-up recognition is performed on the first target collected voice, which can avoid false triggering of voice interaction and improve the accuracy of voice interaction.
在一些实施例中,在终端本地进行语音助手唤醒识别基础上,结合服务器进行二次唤醒识别;相应的,如图3所示,上述基于第一目标采集语音进行语音助手唤醒识别,得到第一唤醒识别结果的过程包括以下步骤:In some embodiments, on the basis of the wake-up recognition of the voice assistant performed locally on the terminal, a second wake-up recognition is performed in conjunction with the server; correspondingly, as shown in FIG. The process of waking up the recognition result includes the following steps:
在步骤S301中,获取预设唤醒语音。In step S301, a preset wake-up voice is acquired.
在步骤S303中,基于预设唤醒语音对第一目标采集语音进行唤醒识别,得到第三唤醒识别结果。In step S303, wake-up recognition is performed on the first target collected voice based on the preset wake-up voice to obtain a third wake-up recognition result.
在步骤S305中,在第三唤醒识别结果为唤醒目标语音助手的情况下,向服务器发送第一目标采集语音。In step S305, if the third wake-up recognition result is to wake up the target voice assistant, send the first target collected voice to the server.
在步骤S307中,接收服务器发送的第一唤醒识别结果。In step S307, the first wake-up identification result sent by the server is received.
在一些实施例中,上述步骤S301和步骤S303参见上述相关描述,在此不再赘述。In some embodiments, refer to the relevant description above for the above step S301 and step S303, and details are not repeated here.
在一些实施例中,上述第一唤醒识别结果是由服务器基于预设唤醒识别模型,对第一目标采集语音对应的文本进行唤醒识别处理得到的。在一些实施例中,预设唤醒识别模型是基于样本语音和样本语音对应的唤醒标注信息,对预设深度学习模型进行训练得到的。在一些实施例中,样本语音包括正样本语音和负样本语音;正样本语音对应的唤醒标注信息为唤醒目标语音助手,负样本语音对应的唤醒标注信息为不唤醒目标语音助手。In some embodiments, the above-mentioned first wake-up recognition result is obtained by the server performing wake-up recognition processing on the text corresponding to the voice collected by the first target based on a preset wake-up recognition model. In some embodiments, the preset wake-up recognition model is obtained by training a preset deep learning model based on the sample voice and the wake-up label information corresponding to the sample voice. In some embodiments, the sample speech includes a positive sample speech and a negative sample speech; the wake-up marking information corresponding to the positive sample voice is to wake up the target voice assistant, and the wake-up marking information corresponding to the negative sample voice is not to wake up the target voice assistant.
在一些实施例中,服务器在接收到第一目标采集语音后,将第一目标采集语音转换为文本信息,并将文本信息输入预设唤醒识别模型进行唤醒识别处理,得到第一唤醒识别结果。In some embodiments, after receiving the first target collected voice, the server converts the first target collected voice into text information, and inputs the text information into a preset wake-up recognition model for wake-up recognition processing to obtain a first wake-up recognition result.
在一些实施例中,在第三唤醒识别结果为不唤醒目标语音助手的情况下,不向服务器发送第一目标采集语音,进而能够降低服务器的压力。In some embodiments, if the third wake-up recognition result is not to wake up the target voice assistant, the first target collected voice is not sent to the server, thereby reducing the pressure on the server.
上述实施例中,在终端本地识别出唤醒目标语音助手的情况下,结合服务器进行二次唤醒识别,提升了唤醒识别的精准性,避免误触发的语音交互情况。In the above embodiment, when the terminal recognizes the wake-up target voice assistant locally, the secondary wake-up recognition is performed in combination with the server, which improves the accuracy of wake-up recognition and avoids falsely triggered voice interaction.
在步骤S205中,在第一唤醒识别结果为唤醒目标语音助手的情况下,在目标视频对应的播放页面展示预设提示信息。In step S205, if the first wake-up recognition result is to wake up the target voice assistant, preset prompt information is displayed on the play page corresponding to the target video.
其中,预设提示信息用于提示目标语音助手被唤醒成功,以及基于语音控制与目标视频关联的交互操作。目标语音助手是基于语音控制与目标视频关联的交互操作的语音助手。在目标语音助手被唤醒成功后,用户能够基于语音控制与目标视频关联的交互操作。Wherein, the preset prompt information is used to prompt the target voice assistant to be awakened successfully, and to control the interactive operation associated with the target video based on the voice. The target voice assistant is a voice assistant that controls the interactive operation associated with the target video based on voice. After the target voice assistant is successfully awakened, the user can control the interactive operation associated with the target video based on voice.
在一些实施例中,预设提示信息的信息形式包括但不限于文本、语音、图像等,能够结合实际应用需求设置。In some embodiments, the information format of the preset prompt information includes but is not limited to text, voice, image, etc., and can be set according to actual application requirements.
在一些实施例中,在不同的应用场景下,与目标视频关联的交互操作不同。例如,以目标视频为直播视频为例,与目标视频关联的交互操作包括但不限于评论、关注对应的主播、赠送虚拟资源等。再例如,以目标视频为影视剧等预先录制好的视频为例,与目标视频关联的交互操作包括但不限于发弹幕、选集、调整分辨率等。再例如,以目标视频为短视频等预先录制好的视频为例,与目标视频关联的交互操作包括但不限于点赞、关注等。In some embodiments, in different application scenarios, the interactive operations associated with the target video are different. For example, taking the target video as a live video as an example, the interactive operations associated with the target video include but are not limited to commenting, following the corresponding host, giving virtual resources, and so on. For another example, if the target video is a pre-recorded video such as a film and television drama, the interactive operations associated with the target video include but are not limited to posting barrage, anthology, and adjusting resolution. For another example, if the target video is a pre-recorded video such as a short video, the interactive operations associated with the target video include but are not limited to like, follow, and so on.
在一些实施例中,在第一唤醒识别结果为不唤醒目标语音助手的情况下,继续进行语音的采集,在目标视频播放过程中采集到新的语音时,按照上述步骤S201至步骤S205,基于新的语音进行语音交互的流程。In some embodiments, when the first wake-up recognition result is that the target voice assistant is not awakened, the collection of voice is continued, and when a new voice is collected during the playback of the target video, according to the above steps S201 to S205, based on The flow of the new voice for voice interaction.
在一些实施例中,如图4所示,图4是根据一示例性实施例提供的一种展示有预设提示信息的播放页面的示意图,图4中400对应的信息为预设提示信息。In some embodiments, as shown in FIG. 4 , FIG. 4 is a schematic diagram of a playback page showing preset prompt information according to an exemplary embodiment, and the information corresponding to 400 in FIG. 4 is preset prompt information.
在一些实施例中,如图5所示,上述在第一唤醒识别结果为唤醒目标语音助手的情况下,在目标视频对应的播放页面展示预设提示信息包括:In some embodiments, as shown in FIG. 5 , when the first wake-up recognition result is to wake up the target voice assistant, displaying preset prompt information on the play page corresponding to the target video includes:
在步骤S2051中,在第一唤醒识别结果为唤醒目标语音助手的情况下,向服务器发送提示信息获取请求,提示信息获取请求包括第一目标采集语音。In step S2051, if the first wake-up identification result is to wake up the target voice assistant, a prompt information acquisition request is sent to the server, and the prompt information acquisition request includes the first target collected voice.
在步骤S2053中,接收服务器发送的预设提示信息,预设提示信息为基于第一目标采集语音生成的。In step S2053, the preset prompt information sent by the server is received, and the preset prompt information is generated based on the collected voice of the first target.
在步骤S2055中,在播放页面展示预设提示信息。In step S2055, preset prompt information is displayed on the play page.
在一些实施例中,终端在将语音发送给服务器之前,先对语音进行语音格式转换,使格式转换后的语音为服务器可识别的语音,之后将格式转换后的语音发送给服务器。例如,格式转换前的第一目标采集语音的语音格式为PCM(Pulse Code Modulation----脉码调制录音),服务器可识别的语音格式为Opus(一个有损声音编码的格式),上述向服务器发送第一目标采集语音包括向服务器发送格式转换后的语音,也即是向服务器发送的第一目标采集语音的语音格式为Opus。In some embodiments, before sending the voice to the server, the terminal performs voice format conversion on the voice, so that the format-converted voice is recognizable by the server, and then sends the format-converted voice to the server. For example, the voice format of the first target collected voice before the format conversion is PCM (Pulse Code Modulation----pulse code modulation recording), and the voice format recognizable by the server is Opus (a lossy sound coding format). Sending the first target collected voice by the server includes sending the converted voice to the server, that is, the voice format of the first target collected voice sent to the server is Opus.
在一些实施例中,终端设置有本地格式转换组件,该格式转换组件用于进行语音格式转换。在一些实施例中,语音格式转换的功能集成在上述本地语音唤醒组件中。In some embodiments, the terminal is provided with a local format conversion component, and the format conversion component is used for voice format conversion. In some embodiments, the function of voice format conversion is integrated in the above-mentioned local voice wake-up component.
在一些实施例中,第一目标采集语音包括操控语音,在向服务器发送第一目标采集语音之后,上述方法还包括:In some embodiments, the first target collection voice includes manipulation voice, and after sending the first target collection voice to the server, the above method further includes:
接收服务器发送的第三操控信息,上述操控语音指示执行与目标视频关联的第三目标交互操作;receiving third manipulation information sent by the server, the manipulation voice indicating to execute a third target interactive operation associated with the target video;
基于第三操控信息,执行第三目标交互操作。Based on the third manipulation information, a third target interaction operation is performed.
在一些实施例中,第一目标采集语音中除了包括预设唤醒语音外,还包括指示执行目标视频关联的交互操作的语音信息。通过在提示信息获取请求中携带第一目标采集语音, 以便服务器通过对第一目标采集语音进行语义分析,在确定预设提示信息的同时,确定出第一操控信息,以便后续终端基于第一操控信息,执行第一目标交互操作。In some embodiments, in addition to the preset wake-up voice, the first target collection voice also includes voice information indicating the execution of an interactive operation associated with the target video. By carrying the first target collection voice in the prompt information acquisition request, the server can determine the first manipulation information while determining the preset prompt information by performing semantic analysis on the first target collection voice, so that subsequent terminals can information, perform the first target interaction operation.
在一些实施例中,以直播场景为例,假设预设唤醒语音对应的文本为“小k”,且第一目标采集语音对应文本为“小k,我要关注主播”,第三操控信息为关注主播的指令。在一些实施例中,终端在接收到第三操控信息后,自动触发关注主播的交互操作(第三目标交互操作)。In some embodiments, taking the live broadcast scene as an example, assuming that the text corresponding to the preset wake-up voice is "Little K", and the text corresponding to the first target collected voice is "Little K, I want to pay attention to the anchor", the third control information is Follow the instructions of the anchor. In some embodiments, after receiving the third manipulation information, the terminal automatically triggers an interactive operation of following the anchor (third target interactive operation).
在上述实施例中,在第一唤醒识别结果为唤醒目标语音助手的情况下,通过在提示信息获取请求中携带第一目标采集语音,以便获取预设提示信息的同时,从服务器获取第一目标采集语音中操控语音对应的第三操控信息,进而实现交互操作的自动执行,提升了交互便捷性和效率。In the above-mentioned embodiment, when the first wake-up recognition result is to wake up the target voice assistant, by carrying the voice of the first target collection in the prompt information acquisition request, the first target can be obtained from the server while obtaining the preset prompt information. The third control information corresponding to the control voice in the voice is collected, and then the automatic execution of the interactive operation is realized, which improves the convenience and efficiency of the interaction.
由以上本公开实施例提供的技术方案可见,在目标视频播放过程中,结合第一目标采集语音进行语音助手唤醒识别,能够避免误触发的语音交互情况,提升语音交互的精准性;另外,在唤醒目标语音助手的情况下,在目标视频对应的播放页面展示用于提示目标语音助手被唤醒成功,以及基于语音控制与目标视频关联的交互操作的预设提示信息,能够实现基于语音与目标视频的交互,提升了交互便捷性和交互效率,进而也能够在直播等场景中,提升用户与主播的互动性。It can be seen from the technical solutions provided by the above embodiments of the present disclosure that during the playback of the target video, combined with the first target collected voice to perform voice assistant wake-up recognition, it is possible to avoid falsely triggered voice interaction and improve the accuracy of voice interaction; in addition, in the In the case of waking up the target voice assistant, the playback page corresponding to the target video displays the preset prompt information for prompting the target voice assistant to be awakened successfully, and the interactive operation associated with the target video based on voice control, which can realize the voice-based and target video The interaction improves the convenience and efficiency of interaction, and can also improve the interaction between users and anchors in live broadcast and other scenarios.
在一些实施例中,在目标视频对应的播放页面展示预设提示信息之后,还能够基于采集的语音执行相应的交互操作,相应的,如图6所示,上述方法还包括:In some embodiments, after the preset prompt information is displayed on the playback page corresponding to the target video, corresponding interactive operations can also be performed based on the collected voice. Correspondingly, as shown in FIG. 6 , the above method further includes:
在步骤S601中,获取第二采集语音和第二播放语音,第二播放语音为采集该第二采集语音时该目标视频中播放的语音。In step S601, a second collected voice and a second played voice are acquired, and the second played voice is the voice played in the target video when the second collected voice is collected.
在步骤S603中,基于第二播放语音,对第二采集语音进行回声消除,得到第二目标采集语音。In step S603, based on the second playing voice, echo cancellation is performed on the second collected voice to obtain a second target collected voice.
在步骤S605中,向服务器发送第一操控信息获取请求,第一操控信息获取请求包括第二目标采集语音。In step S605, a first manipulation information acquisition request is sent to the server, where the first manipulation information acquisition request includes the voice collected by the second target.
在步骤S607中,接收服务器发送的第一操控信息,该第一操控信息与该第二目标采集语音对应。In step S607, the first manipulation information sent by the server is received, and the first manipulation information corresponds to the voice collected by the second target.
在步骤S609中,基于第一操控信息,执行第一目标交互操作。In step S609, based on the first manipulation information, a first target interaction operation is performed.
其中,该第一目标交互操作是与第二采集语音对应的操作,也是与目标视频关联的操作。Wherein, the first target interactive operation is an operation corresponding to the second collected voice, and is also an operation associated with the target video.
在一些实施例中,上述步骤S601和步骤S603,与上述步骤S201同理,在此不再赘述。In some embodiments, the above step S601 and step S603 are the same as the above step S201, and will not be repeated here.
在一些实施例中,该第二目标采集语音是在目标语音助手被唤醒后得到的语音,该第二目标采集语音为一种操控语音。在获取到第二目标采集语音后,向服务器发送携带第二目标采集语音的第一操控信息获取请求。服务器接收到第一操控信息获取请求后,通过对第二目标采集语音进行语义分析,确定出第二操控信息,并返回给终端,以便终端基于第二操控信息,执行第一目标交互操作。In some embodiments, the second target collected voice is a voice obtained after the target voice assistant is awakened, and the second target collected voice is a control voice. After the second target collected voice is acquired, a first manipulation information acquisition request carrying the second target collected voice is sent to the server. After receiving the request for obtaining the first manipulation information, the server determines the second manipulation information by performing semantic analysis on the collected voice of the second target, and returns it to the terminal, so that the terminal can execute the first target interactive operation based on the second manipulation information.
在一些实施例中,以直播场景为例,预设唤醒语音对应的文本为“小k”,且第二目标采集语音对应文本为“我要关注主播”,第二操控信息为关注主播的指令。在一些实施例中,终端在接收到第二操控信息后,自动触发关注主播的交互操作(第二目标交互操作)。In some embodiments, taking the live broadcast scene as an example, the text corresponding to the preset wake-up voice is "little k", and the text corresponding to the second target collection voice is "I want to follow the anchor", and the second control information is an instruction to follow the anchor . In some embodiments, after receiving the second manipulation information, the terminal automatically triggers an interactive operation of following the anchor (second target interactive operation).
在一些实施例中,在目标视频播放过程中,目标视频播放语音对采集语音的影响小的情况下,将第二采集语音作为上述第二目标采集语音。In some embodiments, in the process of playing the target video, if the target video playback voice has little influence on the collected voice, the second collected voice is used as the second target collected voice.
例如,在采集第二采集语音的过程中,正在播放的目标视频中语音的音量小,如正在播放的目标视频语音的音量小于音量阈值,则表示采集到的第二采集语音足够清晰,即该第二采集语音中用户发出的语音信息足够清晰,因此,无需再对第二采集语音进行回声消除,将该第二采集语音作为第二目标采集语音即可。其中,音量阈值为任意的数值。For example, in the process of collecting the second collected voice, if the volume of the voice in the target video being played is low, if the volume of the voice of the target video being played is less than the volume threshold, it means that the collected second collected voice is clear enough, that is, the The voice information uttered by the user in the second collected voice is clear enough, so there is no need to perform echo cancellation on the second collected voice, and the second collected voice can be used as the second target collected voice. Wherein, the volume threshold is an arbitrary value.
上述实施例中,在目标语音助手被唤醒成功后,结合第二播放语音,对第二采集语音进行声学回声消除处理,能够保证操控语音(第二目标采集语音)的有效性,保证了从服务器获取到的第二操控信息的准确性,进而在提升交互便捷性和效率的基础上,提升语音交互的精准性。In the above-mentioned embodiment, after the target voice assistant is successfully awakened, combined with the second playback voice, the acoustic echo cancellation process is performed on the second collected voice, which can ensure the validity of the control voice (the second target collected voice), and ensure the Accuracy of the obtained second control information, and then on the basis of improving the convenience and efficiency of the interaction, the accuracy of the voice interaction is improved.
在一些实施例中,在目标视频对应的播放页面展示预设提示信息之后,上述方法还包括:In some embodiments, after the preset prompt information is displayed on the play page corresponding to the target video, the above method further includes:
在第一目标采集语音包括目标交互指示语音的情况下,将目标语音助手的服务模式由第一状态更新为第二状态。In the case that the first target collected voice includes the target interaction instruction voice, update the service mode of the target voice assistant from the first state to the second state.
其中,目标交互指示语音指示多轮交互,第一状态的服务模式(可以简称为单轮交互模式)指示在目标语音助手被唤醒期间,执行一次基于语音控制与目标视频关联的交互操作;即在目标语音助手被唤醒后,执行一次基于语音控制与目标视频关联的交互操作后,关闭目标语音助手。Wherein, the target interaction indicates that the voice indicates multiple rounds of interaction, and the service mode in the first state (which may be referred to as a single-round interaction mode) indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video; After the target voice assistant is woken up, after performing an interactive operation based on voice control associated with the target video, turn off the target voice assistant.
第二状态的服务模式(可以简称为多轮交互模式)指示在目标语音助手被唤醒期间,执行至少一次基于语音控制与目标视频关联的交互操作。即在目标语音助手唤醒后,可以执行一次或多次基于语音控制与目标视频关联的交互操作。The service mode in the second state (which may be referred to as the multi-round interaction mode for short) indicates that during the wake-up of the target voice assistant, perform at least one voice-based interactive operation associated with the target video. That is, after the target voice assistant wakes up, one or more voice-based interactive operations associated with the target video can be performed.
目标交互指示语音指示多轮交互,也即是,目标交互指示语音指示开启多轮交互模式。在一些实施例中,目标交互指示语音为预先设置的用于开启多轮交互模式的特定语音。例如,特定语音为“打开多轮交互模式”,在第一目标采集语音中识别到特定语音,确定第一目标采集语音包括目标交互指示语音。The target interaction instruction voice indicates multiple rounds of interaction, that is, the target interaction instruction voice indicates to enable the multi-round interaction mode. In some embodiments, the target interaction instruction voice is a preset specific voice for starting multiple rounds of interaction modes. For example, the specific voice is "open multiple rounds of interaction mode", the specific voice is recognized in the first target collected voice, and it is determined that the first target collected voice includes the target interaction instruction voice.
在一些实施例中,目标交互指示语音为具有需要进行多次交互语义的语音信息。例如,目标交互指示语音为“我要送个礼物”。在一些实施例中,基于预设交互识别模型,对第一目标采集语音进行交互识别,以确定第一目标采集语音是否包括目标交互指示语音。In some embodiments, the target interaction indication voice is voice information with semantics requiring multiple interactions. For example, the target interaction indicates that the speech is "I want to send a gift." In some embodiments, based on a preset interaction recognition model, interactive recognition is performed on the first target collected speech to determine whether the first target collected speech includes the target interaction instruction speech.
在一些实施例中,预设交互识别模型是基于样本语音和样本语音对应的交互标注信息,对预设深度学习模型进行训练得到的。在一些实施例中,预设交互识别模型对应的样本语音包括正样本语音和负样本语音,正样本语音对应的交互标注信息为目标交互指示语音,负样本语音对应的交互标注信息为除目标交互指示语音外的其他交互指示语音,该其他交互指示语音指示不进行多轮交互。In some embodiments, the preset interaction recognition model is obtained by training the preset deep learning model based on the sample speech and the interaction annotation information corresponding to the sample speech. In some embodiments, the sample speech corresponding to the preset interaction recognition model includes positive sample speech and negative sample speech, the interaction annotation information corresponding to the positive sample speech is the target interaction instruction speech, and the interaction annotation information corresponding to the negative sample speech is the target interaction Other interaction indication voices other than the indication voice indicate that multiple rounds of interaction are not to be performed.
在一些实施例中,服务器在首次接收到第一目标采集语音的情况下,将第一目标采集语音转换为文本信息,将文本信息输入预设交互识别模型进行交互识别,以确定第一目标采集语音是否包括目标交互指示语音。In some embodiments, when the server receives the first target collection voice for the first time, it converts the first target collection voice into text information, and inputs the text information into a preset interactive recognition model for interactive recognition, so as to determine the first target collection voice. Whether the voice includes target interaction indication voice.
在一些实施例中,在第二目标采集语音包括目标交互指示语音的情况下,将目标语音助手的服务模式由第一状态更新为第二状态。In some embodiments, when the second target collected voice includes the target interaction instruction voice, the service mode of the target voice assistant is updated from the first state to the second state.
在上述实施例中,在第一目标采集语音包括目标交互指示语音的情况下,通过将目标语音助手的服务模式由第一状态更新为第二状态,以便在目标语音助手被唤醒期间,执行至少一次基于语音控制与目标视频关联的交互操作,提升了语音交互操作的便利性和效率,同时也提升了语音交互操作的多样性。In the above embodiment, when the first target voice collection includes the target interaction instruction voice, by updating the service mode of the target voice assistant from the first state to the second state, so that during the wake-up of the target voice assistant, at least An interactive operation based on voice control associated with the target video improves the convenience and efficiency of voice interactive operations, and also improves the diversity of voice interactive operations.
在一些实施例中,在第二状态的服务模式开启后,基于采集到的语音执行相应的交互操作,相应的,如图7所示,上述方法还包括:In some embodiments, after the service mode in the second state is turned on, corresponding interactive operations are performed based on the collected voice. Correspondingly, as shown in FIG. 7 , the above method further includes:
在步骤S701中,获取第三采集语音和第三播放语音,该第三播放语音为采集该第三采集语音时该目标视频中播放的语音。In step S701, a third collected voice and a third played voice are obtained, and the third played voice is the voice played in the target video when the third collected voice is collected.
在步骤S703中,基于第三播放语音,对第三采集语音进行回声消除,得到第三目标采集语音;In step S703, based on the third playing voice, echo cancellation is performed on the third collected voice to obtain the third target collected voice;
在步骤S705中,对第三目标采集语音进行唤醒识别,得到第二唤醒识别结果;In step S705, perform wake-up recognition on the collected voice of the third target, and obtain a second wake-up recognition result;
在步骤S707中,在第二唤醒识别结果为不唤醒目标语音助手的情况下,向服务器发送第二操控信息获取请求,第二操控信息获取请求包括第三目标采集语音;In step S707, if the second wake-up recognition result is not to wake up the target voice assistant, send a second manipulation information acquisition request to the server, where the second manipulation information acquisition request includes the third target voice collection;
在步骤S709中,接收服务器发送的第三操控信息,该第二操控信息与该第三目标采集语音对应;In step S709, the third manipulation information sent by the server is received, the second manipulation information corresponds to the voice collected by the third target;
在步骤S711中,基于第二操控信息,执行第二目标交互操作。In step S711, based on the second manipulation information, a second target interaction operation is performed.
其中,该第二目标交互操作是与第三采集语音对应的操作,也是与目标视频关联的操作。Wherein, the second target interactive operation is an operation corresponding to the third collected voice, and also an operation associated with the target video.
第二唤醒识别结果不为唤醒目标语音助手,表示该第三目标采集语音仅是控制与目标视频相关的操作的语音,也即是,该第三目标采集语音仅是在目标语音助手处于多轮交互模式的情况下获取到的操控语音。The second wake-up recognition result is not to wake up the target voice assistant, which means that the third target voice collection is only the voice for controlling the operation related to the target video, that is, the third target voice collection is only when the target voice assistant is in multiple rounds. The control voice obtained in the interactive mode.
其中,上述步骤S701至步骤S711,与上述步骤S601至步骤S609、以及步骤S203同理,在此不再赘述。Wherein, the above step S701 to step S711 is the same as the above step S601 to step S609, and step S203, and will not be repeated here.
在一些实施例中,在目标视频播放过程中,目标视频播放语音对采集语音的影响小的情况,将第三采集语音作为上述第三目标采集语音。In some embodiments, in the process of playing the target video, if the voice played by the target video has little influence on the collected voice, the third collected voice is used as the third target collected voice.
例如,在采集第三采集语音的过程中,正在播放的目标视频中语音的音量小,如正在播放的目标视频语音的音量小于音量阈值,则表示采集到的第三采集语音足够清晰,即该第三采集语音中用户发出的语音信息足够清晰,因此,无需再对第三采集语音进行回声消除,将该第三采集语音作为第三目标采集语音即可。其中,音量阈值为任意的数值。For example, in the process of collecting the third collection voice, the volume of the voice in the target video being played is low, if the volume of the voice of the target video being played is less than the volume threshold, it means that the third collection voice collected is clear enough, that is, the The voice information sent by the user in the third collected voice is clear enough, therefore, it is not necessary to perform echo cancellation on the third collected voice, and the third collected voice can be used as the third target collected voice. Wherein, the volume threshold is an arbitrary value.
上述实施例中,在多轮交互模式开启后,结合第三播放语音,对新采集到的第三采集语音进行声学回声消除处理,能够保证操控语音(第三目标采集语音)的有效性,提升了交互便捷性和效率,也提升了语音交互的精准性。In the above-mentioned embodiment, after the multi-round interactive mode is turned on, combined with the third playback voice, the newly acquired third voice collection is subjected to acoustic echo cancellation processing, which can ensure the effectiveness of the control voice (the third target voice collection) and improve It not only improves the convenience and efficiency of interaction, but also improves the accuracy of voice interaction.
在一些实施例中,上述方法还包括:In some embodiments, the above method also includes:
在第二唤醒识别结果为唤醒目标语音助手的情况下,将目标语音助手的服务模式由第二状态更新为第一状态。In the case that the second wake-up recognition result is to wake up the target voice assistant, update the service mode of the target voice assistant from the second state to the first state.
在一些实施例中,为了支持第二状态的服务模式,终端同时创建两个识别引擎的实例,其中一个识别引擎用于进行唤醒识别,另一识别引擎用于进行多轮交互的语义识别,在目标语音助手处于第二状态的服务模式的情况下,用于进行唤醒识别的识别引擎识别出重新采集到预设唤醒语音,即第二唤醒识别结果为唤醒目标语音助手的情况下,会打断目标语音助手的多轮交互模式,使目标语音助手重新进入第一状态的服务模式。In some embodiments, in order to support the service mode of the second state, the terminal creates two instances of recognition engines at the same time, wherein one recognition engine is used for wake-up recognition, and the other recognition engine is used for semantic recognition of multiple rounds of interactions. When the target voice assistant is in the service mode of the second state, the recognition engine used for wake-up recognition recognizes that the preset wake-up voice has been collected again, that is, when the second wake-up recognition result is to wake up the target voice assistant, it will interrupt The multi-round interaction mode of the target voice assistant makes the target voice assistant re-enter the service mode of the first state.
上述实施例中,在多轮交互模式下,响应于重新唤醒目标语音助手,打断多轮交互模式,重新进入单轮交互模式,实现两种交互模式的灵活切换。In the above embodiment, in the multi-round interaction mode, in response to re-awakening the target voice assistant, the multi-round interaction mode is interrupted, and the single-round interaction mode is re-entered to realize flexible switching between the two interaction modes.
在一些实施例中,上述方法还包括:In some embodiments, the above method also includes:
向服务器发送语音响应请求,语音响应请求包括第一目标采集语音;Send a voice response request to the server, where the voice response request includes the voice collected by the first target;
接收服务器发送的响应语音,该响应语音与第一目标采集语音对应;Receiving the response voice sent by the server, the response voice corresponds to the first target collection voice;
播放响应语音。Play the response voice.
为了提升用户体验,目标语音助手被唤醒后,从服务器获取相应的响应语音。在一些实施例中,响应语音以语音的形式提示用户目标语音助手已唤醒,该响应语音的内容是结合实际应用预先设置的。In order to improve the user experience, after the target voice assistant is awakened, it obtains the corresponding response voice from the server. In some embodiments, the response voice prompts the user that the target voice assistant has been awakened in the form of voice, and the content of the response voice is preset in combination with the actual application.
例如,预设唤醒语音对应的文本为“小k”,且第一目标采集语音为“小k”,响应语音对应的文本为“在的”。For example, the text corresponding to the preset wake-up voice is "little k", the first target collected voice is "little k", and the text corresponding to the response voice is "in'".
再例如,预设唤醒语音对应的文本为“小k”,且第一目标采集语音“小k,我要个礼物”,响应语音对应的文本为“在的,请说”。For another example, the text corresponding to the preset wake-up voice is "Little K", and the first target collects the voice "Little K, I want a gift", and the text corresponding to the response voice is "Yes, please say".
上述实施例中,通过播放第一目标采集语音对应的响应语音,能够提升与用户间的交互性,进而改善用户体验。In the above embodiment, by playing the response voice corresponding to the first target collection voice, the interactivity with the user can be improved, thereby improving the user experience.
在一些实施例中,在目标视频对应的播放页面展示预设提示信息之后,上述方法还包括:In some embodiments, after the preset prompt information is displayed on the play page corresponding to the target video, the above method further includes:
在预设时间段内未获取到新增采集语音的情况下,将播放页面展示的预设提示信息,更新为目标语音助手的关闭提示信息。In the case that the newly collected voice is not obtained within the preset time period, the preset prompt information displayed on the playback page is updated to the closing prompt information of the target voice assistant.
在一些实施例中,新增采集语音为目标语音助手被唤醒后采集到的语音,或者为在目标语音助手被唤醒的情况下,对采集到的语音进行声学回声消除处理后的语音。In some embodiments, the newly collected voice is the voice collected after the target voice assistant is woken up, or is the voice after the acoustic echo cancellation process is performed on the collected voice when the target voice assistant is woken up.
为了避免语音助手长时间的无效待机,预先设置交互等待时长,一旦超出交互等待时长,就会关闭目标语音助手,需要重新唤醒目标语音助手。在一些实施例中,交互等待时长是预先设置的任意时长,交互等待时长为从目标语音助手被唤醒的时间开始,等待获取到新增采集语音的上限时长。In order to avoid long-term invalid standby of the voice assistant, the interaction waiting time is set in advance. Once the interaction waiting time is exceeded, the target voice assistant will be turned off, and the target voice assistant needs to be awakened again. In some embodiments, the waiting time for interaction is an arbitrary time set in advance, and the waiting time for interaction is the upper limit time for waiting for the newly collected voice from the time when the target voice assistant is woken up.
在一些实施例中,预设时间段是结合预设的交互等待时长和目标语音助手被唤醒时的时间确定。在一些实施例中,在目标语音助手被唤醒后的交互等待时长内未获取到新增采集语音的情况下,确定目标语音助手因超时关闭,则将播放页面展示的预设提示信息,更新为目标语音助手的关闭提示信息。其中,目标语音助手被唤醒后的交互等待时长即为该预设时间段。In some embodiments, the preset time period is determined by combining the preset interaction waiting time and the time when the target voice assistant is woken up. In some embodiments, if the target voice assistant is woken up and the newly collected voice is not acquired within the interaction waiting time, it is determined that the target voice assistant is closed due to timeout, and the preset prompt information displayed on the playback page is updated to The closing prompt information of the target voice assistant. Wherein, the waiting time for interaction after the target voice assistant is awakened is the preset time period.
上述实施例中,在目标语音助手被唤醒后,在预设时间段内未获取到新增采集语音的情况下,通过将播放页面展示的预设提示信息,更新为目标语音助手的关闭提示信息,能够避免长时间的无效待机,降低设备资源消耗;且结合关闭提示信息的展示,能够提醒用户目标语音助手关系,提升了用户体验。In the above embodiment, after the target voice assistant is woken up, if the newly collected voice is not obtained within the preset time period, the preset prompt information displayed on the playback page is updated to the closing prompt information of the target voice assistant , which can avoid long-term invalid standby and reduce device resource consumption; and combined with the display of closing prompt information, it can remind the user of the target voice assistant relationship, which improves the user experience.
图8是根据一示例性实施例示出的另一种语音交互方法的流程图,如图8所示,该语音交互方法的执行主体为终端等电子设备,包括以下步骤:Fig. 8 is a flow chart of another voice interaction method shown according to an exemplary embodiment. As shown in Fig. 8, the voice interaction method is executed by an electronic device such as a terminal, and includes the following steps:
在步骤S801中,在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,第二播放语音为采集第二采集语音时目标视频中播放的语音;In step S801, in the process of playing the target video and when the target voice assistant is successfully awakened, the second collected voice and the second played voice are obtained, and the second played voice is played in the target video when the second collected voice is collected voice;
在步骤S803中,基于第二播放语音,对第二采集语音进行回声消除,得到第二目标采集语音;In step S803, based on the second playing voice, echo cancellation is performed on the second collected voice to obtain the second target collected voice;
在步骤S805中,向服务器发送第一操控信息获取请求,第一操控信息获取请求包括第二目标采集语音;In step S805, sending a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;
在步骤S807中,接收服务器发送的第一操控信息,第一控制信息与第二目标采集语音对应;In step S807, the first control information sent by the server is received, and the first control information corresponds to the voice collected by the second target;
在步骤S809中,基于第一操控信息,执行第一目标交互操作。In step S809, based on the first manipulation information, a first target interaction operation is performed.
在本公开实施例中,步骤S801至步骤S809,与上述步骤S601至步骤S609同理,在此不再赘述。In the embodiment of the present disclosure, steps S801 to S809 are the same as the above steps S601 to S609 , and will not be repeated here.
本公开实施例提供的方法,在目标视频播放过程中,目标语音助手被唤醒成功后,结合第二播放语音,对第二采集语音进行声学回声消除处理,能够保证操控语音(第二目标采集语音)的有效性,保证了从服务器获取到的第二操控信息的准确性,进而在提升交互便捷性和效率的基础上,提升语音交互的精准性,也实现了基于语音与目标视频的交互,提升了交互便捷性和交互效率。In the method provided by the embodiments of the present disclosure, during the target video playback process, after the target voice assistant is successfully awakened, combined with the second playback voice, the acoustic echo cancellation process is performed on the second collected voice, which can ensure that the control voice (the second target collected voice) ) ensures the accuracy of the second control information obtained from the server, and improves the accuracy of voice interaction on the basis of improving the convenience and efficiency of interaction, and also realizes the interaction between voice and target video. Improved interaction convenience and efficiency.
图9是根据一示例性实施例示出的一种语音交互装置框图。参照图9,该装置包括:Fig. 9 is a block diagram of a voice interaction device according to an exemplary embodiment. Referring to Figure 9, the device includes:
第一目标采集语音获取模块910,被配置为在目标视频播放过程中,获取第一目标采集语音;The first target collection voice acquisition module 910 is configured to acquire the first target collection voice during the playback of the target video;
第一唤醒识别模块920,被配置为对第一目标采集语音进行语音助手唤醒识别,得到第一唤醒识别结果;The first wake-up recognition module 920 is configured to perform voice assistant wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
预设提示信息展示模块930,被配置为在第一唤醒识别结果为唤醒目标语音助手的情况下,在目标视频对应的播放页面展示预设提示信息,预设提示信息用于提示目标语音助手被唤醒成功,以及基于语音控制与目标视频关联的交互操作。The preset prompt information display module 930 is configured to display preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant, and the preset prompt information is used to prompt the target voice assistant to be activated. The wake-up is successful, and the interactive operation associated with the target video is controlled based on voice.
在一些实施例中,第一目标采集语音获取模块910包括:In some embodiments, the first target acquisition voice acquisition module 910 includes:
第一语音获取单元,被配置为在目标视频播放过程中,获取第一采集语音和第一播放语音,第一播放语音为采集第一采集语音时目标视频中播放的语音;The first voice acquisition unit is configured to acquire the first collected voice and the first played voice during the playback of the target video, where the first played voice is the voice played in the target video when the first collected voice is collected;
第一声学回声消除处理单元,被配置为基于第一播放语音,对第一采集语音进行回声消除,得到第一目标采集语音。The first acoustic echo cancellation processing unit is configured to perform echo cancellation on the first collected speech based on the first played speech to obtain the first target collected speech.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned device also includes:
第二语音获取模块,被配置为获取第二采集语音和第二播放语音;The second voice acquisition module is configured to acquire the second collected voice and the second played voice;
第二声学回声消除处理模块,被配置为基于第二播放语音对第二采集语音进行声学回声消除处理,得到第二目标采集语音,第二播放语音为采集第二采集语音时目标视频中播放的语音;The second acoustic echo cancellation processing module is configured to perform acoustic echo cancellation processing on the second collected voice based on the second broadcast voice to obtain the second target collected voice, and the second played voice is played in the target video when collecting the second collected voice voice;
第一操控信息获取请求发送模块,被配置为向服务器发送第一操控信息获取请求,第一操控信息获取请求包括第二目标采集语音;The first manipulation information acquisition request sending module is configured to send a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;
第二操控信息接收模块,被配置为接收服务器发送的第一操控信息,第一操控信息与第二目标采集语音对应;The second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;
第二目标交互操作执行模块,被配置为基于第一操控信息,执行第一目标交互操作。The second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned device also includes:
第一服务模式更新模块,被配置为在第一目标采集语音包括目标交互指示语音的情况下,将目标语音助手的服务模式由第一状态更新为第二状态,目标交互指示语音指示多轮交互,第一状态的服务模式指示在目标语音助手被唤醒期间,执行一次基于语音控制与目标视频关联的交互操作,第二状态的服务模式指示在目标语音助手被唤醒期间,执行至少一次基于语音控制与目标视频关联的交互操作。The first service mode update module is configured to update the service mode of the target voice assistant from the first state to the second state when the first target collected voice includes the target interaction indication voice, and the target interaction indication voice indicates multiple rounds of interaction , the service mode in the first state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode in the second state indicates that during the wake-up of the target voice assistant, perform at least one voice-based control Interactions associated with the target video.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned device also includes:
第三语音获取模块,被配置为获取第三采集语音和第三播放语音,第三播放语音为采集第三采集语音时目标视频中播放的语音;The third voice acquisition module is configured to acquire the third collection voice and the third playback voice, the third playback voice is the voice played in the target video when collecting the third collection voice;
第三声学回声消除处理模块,被配置为基于第三播放语音,对第三采集语音进行回声消除,得到第三目标采集语音;The third acoustic echo cancellation processing module is configured to perform echo cancellation on the third collected speech based on the third playback speech, to obtain a third target collected speech;
第二唤醒识别模块,被配置为对第三目标采集语音进行唤醒识别,得到第二唤醒识别结果;The second wake-up identification module is configured to perform wake-up identification on the voice collected by the third target to obtain a second wake-up identification result;
第二操控信息获取请求发送模块,被配置为在第二唤醒识别结果为不唤醒目标语音助手的情况下,向服务器发送第二操控信息获取请求,第二操控信息获取请求包括第三目标采集语音;The second manipulation information acquisition request sending module is configured to send a second manipulation information acquisition request to the server when the second wake-up recognition result is that the target voice assistant is not awakened, and the second manipulation information acquisition request includes the third target voice collection ;
第三操控信息接收模块,被配置为接收服务器发送的第二操控信息,第二操控信息与第三目标采集语音对应;The third manipulation information receiving module is configured to receive the second manipulation information sent by the server, and the second manipulation information corresponds to the voice collected by the third target;
第三目标交互操作执行模块,被配置为基于第二操控信息,执行第二目标交互操作。The third target interactive operation execution module is configured to execute the second target interactive operation based on the second manipulation information.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned device also includes:
第二服务模式更新模块,被配置为在第二唤醒识别结果为唤醒目标语音助手的情况下,将目标语音助手的服务模式由第二状态更新为第一状态。The second service mode update module is configured to update the service mode of the target voice assistant from the second state to the first state when the second wake-up recognition result is to wake up the target voice assistant.
在一些实施例中,预设提示信息展示模块930包括:In some embodiments, the preset reminder information display module 930 includes:
第一提示信息获取请求发送单元,被配置为在第一唤醒识别结果为唤醒目标语音助手的情况下,向服务器发送提示信息获取请求,提示信息获取请求包括第一目标采集语音;The first prompt information acquisition request sending unit is configured to send a prompt information acquisition request to the server when the first wake-up recognition result is to wake up the target voice assistant, and the prompt information acquisition request includes the first target collected voice;
预设提示信息接收单元,被配置为接收服务器发送的预设提示信息,预设提示信息为基于第一目标采集语音生成的;The preset prompt information receiving unit is configured to receive the preset prompt information sent by the server, and the preset prompt information is generated based on the voice collected by the first target;
预设提示信息展示单元,被配置为在播放页面展示预设提示信息。The preset prompt information display unit is configured to display preset prompt information on the playback page.
在一些实施例中,第一目标采集语音包括操控语音,上述装置还包括:In some embodiments, the first target collection voice includes manipulation voice, and the above-mentioned device also includes:
第一操控信息接收模块,被配置为接收服务器发送的第三操控信息,第三操控信息与操控语音对应,操控语音指示执行与目标视频关联的第三目标交互操作;The first manipulation information receiving module is configured to receive the third manipulation information sent by the server, the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute the third target interactive operation associated with the target video;
第一操控信息执行模块,被配置为基于第三操控信息,执行第三目标交互操作。The first manipulation information execution module is configured to execute a third target interaction operation based on the third manipulation information.
在一些实施例中,第一唤醒识别模块920包括:In some embodiments, the first wake-up identification module 920 includes:
第一预设唤醒语音获取单元,被配置为获取预设唤醒语音;The first preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice;
第一唤醒识别单元,被配置为基于预设唤醒语音,对第一目标采集语音进行唤醒识别,得到第一唤醒识别结果。The first wake-up identification unit is configured to perform wake-up identification on the first target collected voice based on the preset wake-up voice, and obtain a first wake-up identification result.
在一些实施例中,第一唤醒识别模块920包括:In some embodiments, the first wake-up identification module 920 includes:
第二预设唤醒语音获取单元,被配置为获取预设唤醒语音;The second preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice;
第二唤醒识别单元,被配置为基于预设唤醒语音,对第一目标采集语音进行唤醒识别,得到第三唤醒识别结果;The second wake-up recognition unit is configured to perform wake-up recognition on the first target collected voice based on the preset wake-up voice, and obtain a third wake-up recognition result;
第一目标采集语音发送单元,被配置为在第三唤醒识别结果为唤醒目标语音助手的情况下,向服务器发送第一目标采集语音;The first target collection voice sending unit is configured to send the first target collection voice to the server when the third wake-up recognition result is to wake up the target voice assistant;
第一唤醒识别结果接收单元,被配置为接收服务器发送的第一唤醒识别结果,第一唤醒识别结果是基于预设唤醒识别模型,对第一目标采集语音对应的文本进行唤醒识别处理得到的。The first wake-up recognition result receiving unit is configured to receive the first wake-up recognition result sent by the server. The first wake-up recognition result is obtained by performing wake-up recognition processing on text corresponding to the first target collected voice based on a preset wake-up recognition model.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned device also includes:
语音响应请求发送模块,被配置为向服务器发送语音响应请求,语音响应请求包括第一目标采集语音;The voice response request sending module is configured to send a voice response request to the server, and the voice response request includes the first target collection voice;
响应语音接收模块,被配置为接收服务器发送的响应语音,响应语音与第一目标采集语音对应;The response voice receiving module is configured to receive the response voice sent by the server, and the response voice corresponds to the first target collection voice;
响应语音播放模块,被配置为播放响应语音。The response voice playing module is configured to play the response voice.
在一些实施例中,上述装置还包括:In some embodiments, the above-mentioned device also includes:
关闭提示模块,被配置为在预设时间段内未获取到新增采集语音的情况下,将播放页面展示的预设提示信息,更新为目标语音助手的关闭提示信息。The closing prompt module is configured to update the preset prompt information displayed on the playback page to the close prompt information of the target voice assistant when no newly collected voice is obtained within a preset time period.
图10是根据一示例性实施例示出的另一种语音交互装置框图。参照图10,该装置包括:Fig. 10 is a block diagram of another voice interaction device according to an exemplary embodiment. Referring to Figure 10, the device includes:
第二语音获取模块1010,被配置为在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,第二播放语音为采集第二采集语音时目标视频中播放的语音;The second voice acquisition module 1010 is configured to acquire the second collected voice and the second played voice when the target voice assistant is successfully awakened during the playing of the target video, and the second played voice is when the second collected voice is collected The voice played in the target video;
第二声学回声消除处理模块1020,被配置为基于第二播放语音,对第二采集语音进行回声消除,得到第二目标采集语音;The second acoustic echo cancellation processing module 1020 is configured to perform echo cancellation on the second collected speech based on the second playback speech, to obtain the second target collected speech;
第一操控信息获取请求发送模块1030,被配置为向服务器发送第一操控信息获取请求,第一操控信息获取请求包括第二目标采集语音;The first manipulation information acquisition request sending module 1030 is configured to send a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;
第二操控信息接收模块1040,被配置为接收服务器发送的第一操控信息,第一控制信息与第二目标采集语音对应;The second manipulation information receiving module 1040 is configured to receive the first manipulation information sent by the server, where the first control information corresponds to the voice collected by the second target;
第二目标交互操作执行模块1050,被配置为基于第一操控信息,执行第一目标交互操作。The second target interactive operation execution module 1050 is configured to execute the first target interactive operation based on the first manipulation information.
图11是根据一示例性实施例示出的一种用于语音交互的电子设备的框图,该电子设备可以是终端,其内部结构图可以如图11所示。该电子设备包括通过系统总线连接的处理器、存储器、网络接口、显示屏和输入装置。其中,该电子设备的处理器用于提供计算 和控制能力。该电子设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该电子设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种语音交互方法。该电子设备的显示屏可以是液晶显示屏或者电子墨水显示屏,该电子设备的输入装置可以是显示屏上覆盖的触摸层,也可以是电子设备外壳上设置的按键、轨迹球或触控板,还可以是外接的键盘、触控板或鼠标等。Fig. 11 is a block diagram of an electronic device for voice interaction according to an exemplary embodiment. The electronic device may be a terminal, and its internal structure may be as shown in Fig. 11 . The electronic device includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. Among them, the processor of the electronic device is used to provide calculation and control capabilities. The memory of the electronic device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a voice interaction method is realized. The display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the housing of the electronic device , and can also be an external keyboard, touchpad or mouse.
本领域技术人员可以理解,图11中示出的结构,仅仅是与本公开方案相关的部分结构的框图,并不构成对本公开方案所应用于其上的电子设备的限定,具体的电子设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 11 is only a block diagram of a partial structure related to the disclosed solution, and does not constitute a limitation on the electronic device to which the disclosed solution is applied. The specific electronic device can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.
在示例性实施例中,还提供了一种电子设备,包括:处理器;用于存储该处理器可执行指令的存储器;其中,该处理器被配置为执行该指令,以实现如本公开实施例中的语音交互方法。In an exemplary embodiment, there is also provided an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein, the processor is configured to execute the instructions, so as to implement The voice interaction method in the example.
在示例性实施例中,还提供了一种计算机可读存储介质,当该存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行本公开实施例中的语音交互方法。In an exemplary embodiment, a computer-readable storage medium is also provided, and when instructions in the storage medium are executed by a processor of the electronic device, the electronic device can execute the voice interaction method in the embodiments of the present disclosure.
在示例性实施例中,还提供了一种计算机程序产品,包括计算机程序,该计算机程序被处理器执行时实现本公开实施例中的语音交互方法。In an exemplary embodiment, a computer program product is also provided, including a computer program, and when the computer program is executed by a processor, the voice interaction method in the embodiment of the present disclosure is implemented.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一非易失性计算机可读取存储介质中,该计算机程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be realized by instructing related hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium , when the computer program is executed, it may include the procedures of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media used in the various embodiments provided in the present application may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
本公开所有实施例均可以单独被执行,也可以与其他实施例相结合被执行,均视为本公开要求的保护范围。All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the scope of protection required by the present disclosure.

Claims (43)

  1. 一种语音交互方法,包括:A voice interaction method, comprising:
    在目标视频播放过程中,获取第一目标采集语音;During the playing of the target video, obtain the first target collection voice;
    对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
    在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
  2. 根据权利要求1所述的语音交互方法,其中,所述获取第一目标采集语音包括:The voice interaction method according to claim 1, wherein said acquiring the first target voice collection comprises:
    获取第一采集语音和第一播放语音,所述第一播放语音为采集所述第一采集语音时所述目标视频中播放的语音;Obtain the first collected voice and the first played voice, the first played voice is the voice played in the target video when collecting the first collected voice;
    基于所述第一播放语音,对所述第一采集语音进行回声消除,得到所述第一目标采集语音。Based on the first playing voice, perform echo cancellation on the first collected voice to obtain the first target collected voice.
  3. 根据权利要求1所述的语音交互方法,其中,所述方法还包括:The voice interaction method according to claim 1, wherein the method further comprises:
    获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;Obtain a second collection voice and a second playback voice, the second playback voice is the voice played in the target video when collecting the second collection voice;
    基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
    向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
    接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
    基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
  4. 根据权利要求1至3任一所述的语音交互方法,其中,所述方法还包括:The voice interaction method according to any one of claims 1 to 3, wherein the method further comprises:
    在所述第一目标采集语音包括目标交互指示语音的情况下,将所述目标语音助手的服务模式由第一状态更新为第二状态,所述目标交互指示语音指示多轮交互,所述第一状态的服务模式指示在所述目标语音助手被唤醒期间,执行一次基于语音控制与所述目标视频关联的交互操作,所述第二状态的服务模式指示在所述目标语音助手被唤醒期间,执行至少一次基于语音控制与所述目标视频关联的交互操作。In the case where the first target collection voice includes a target interaction instruction voice, update the service mode of the target voice assistant from the first state to the second state, the target interaction instruction voice indicates multiple rounds of interaction, and the second The service mode of a state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode of the second state indicates that during the wake-up of the target voice assistant, Performing at least one voice-based interactive operation associated with the target video.
  5. 根据权利要求4所述的语音交互方法,其中,所述方法还包括:The voice interaction method according to claim 4, wherein the method further comprises:
    获取第三采集语音和第三播放语音,所述第三播放语音为采集所述第三采集语音时所述目标视频中播放的语音;Acquiring the third collected voice and the third playing voice, the third playing voice is the voice played in the target video when collecting the third collected voice;
    基于所述第三播放语音,对所述第三采集语音进行回声消除,得到第三目标采集语音;Based on the third playback voice, perform echo cancellation on the third collected voice to obtain a third target collected voice;
    对所述第三目标采集语音进行唤醒识别,得到第二唤醒识别结果;Perform wake-up recognition on the third target collected voice to obtain a second wake-up recognition result;
    在所述第二唤醒识别结果为不唤醒所述目标语音助手的情况下,向服务器发送第二操控信息获取请求,所述第二操控信息获取请求包括所述第三目标采集语音;When the second wake-up recognition result is not to wake up the target voice assistant, send a second manipulation information acquisition request to the server, where the second manipulation information acquisition request includes the third target voice collection;
    接收所述服务器发送的第二操控信息,所述第二操控信息与所述第三目标采集语音对应;receiving second manipulation information sent by the server, where the second manipulation information corresponds to the voice collected by the third target;
    基于所述第二操控信息,执行第二目标交互操作。Based on the second manipulation information, a second target interaction operation is performed.
  6. 根据权利要求5所述的语音交互方法,其中,所述方法还包括:The voice interaction method according to claim 5, wherein the method further comprises:
    在所述第二唤醒识别结果为唤醒所述目标语音助手的情况下,将所述目标语音助手的服务模式由所述第二状态更新为所述第一状态。If the second wake-up identification result is to wake up the target voice assistant, updating the service mode of the target voice assistant from the second state to the first state.
  7. 根据权利要求1至3任一所述的语音交互方法,其中,所述在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息包括:The voice interaction method according to any one of claims 1 to 3, wherein, in the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video include:
    在所述第一唤醒识别结果为唤醒所述目标语音助手的情况下,向服务器发送提示信息获取请求,所述提示信息获取请求包括所述第一目标采集语音;When the first wake-up recognition result is to wake up the target voice assistant, send a prompt information acquisition request to the server, where the prompt information acquisition request includes the first target voice collection;
    接收所述服务器发送的所述预设提示信息,所述预设提示信息为基于所述第一目标采集语音生成的;receiving the preset prompt information sent by the server, where the preset prompt information is generated based on the collected voice of the first target;
    在所述播放页面展示所述预设提示信息。The preset prompt information is displayed on the playing page.
  8. 根据权利要求7所述的语音交互方法,其中,所述第一目标采集语音包括操控语音,所述方法还包括:The voice interaction method according to claim 7, wherein the voice collected by the first target includes manipulation voice, and the method further comprises:
    接收所述服务器发送的第三操控信息,所述第三操控信息与所述操控语音对应,所述操控语音指示执行与所述目标视频关联的第三目标交互操作;receiving third manipulation information sent by the server, where the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute a third target interactive operation associated with the target video;
    基于所述第三操控信息,执行所述第三目标交互操作。Based on the third manipulation information, execute the third target interaction operation.
  9. 根据权利要求1至3任一所述的语音交互方法,其中,所述对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果包括:The voice interaction method according to any one of claims 1 to 3, wherein performing wake-up recognition on the first target collected voice and obtaining a first wake-up recognition result includes:
    获取预设唤醒语音;Obtain the preset wake-up voice;
    基于所述预设唤醒语音,对所述第一目标采集语音进行唤醒识别,得到所述第一唤醒识别结果。Based on the preset wake-up voice, wake-up recognition is performed on the first target collected voice to obtain the first wake-up recognition result.
  10. 根据权利要求1至3任一所述的语音交互方法,其中,所述对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果包括:The voice interaction method according to any one of claims 1 to 3, wherein performing wake-up recognition on the first target collected voice and obtaining a first wake-up recognition result includes:
    获取预设唤醒语音;Obtain the preset wake-up voice;
    基于所述预设唤醒语音,对所述第一目标采集语音进行唤醒识别,得到第三唤醒识别结果;Based on the preset wake-up voice, perform wake-up recognition on the first target collected voice to obtain a third wake-up recognition result;
    在所述第三唤醒识别结果为唤醒所述目标语音助手的情况下,向服务器发送所述第一目标采集语音;In the case that the third wake-up recognition result is to wake up the target voice assistant, sending the first target voice collection to the server;
    接收所述服务器发送的所述第一唤醒识别结果,所述第一唤醒识别结果是基于预设唤醒识别模型,对所述第一目标采集语音对应的文本进行唤醒识别得到的。receiving the first wake-up recognition result sent by the server, where the first wake-up recognition result is obtained by performing wake-up recognition on the text corresponding to the voice collected by the first target based on a preset wake-up recognition model.
  11. 根据权利要求1至3任一所述的语音交互方法,其中,所述方法还包括:The voice interaction method according to any one of claims 1 to 3, wherein the method further comprises:
    向服务器发送语音响应请求,所述语音响应请求包括所述第一目标采集语音;Sending a voice response request to the server, where the voice response request includes the voice collected by the first target;
    接收所述服务器发送的响应语音,所述响应语音与所述第一目标采集语音对应;receiving a response voice sent by the server, the response voice corresponding to the first target collection voice;
    播放所述响应语音。Play the response voice.
  12. 根据权利要求1至3任一所述的语音交互方法,其中,所述方法还包括:The voice interaction method according to any one of claims 1 to 3, wherein the method further comprises:
    在预设时间段内未获取到新增采集语音的情况下,将所述播放页面展示的所述预设提示信息,更新为所述目标语音助手的关闭提示信息。In the case that the newly collected voice is not acquired within the preset time period, the preset prompt information displayed on the playing page is updated with the closing prompt information of the target voice assistant.
  13. 一种语音交互方法,包括:A voice interaction method, comprising:
    在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
    基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
    向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
    接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
    基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
  14. 一种语音交互装置,包括:A voice interaction device, comprising:
    第一目标采集语音获取模块,被配置为在目标视频播放过程中,获取第一目标采集语音;The first target acquisition voice acquisition module is configured to acquire the first target acquisition voice during the playback of the target video;
    第一唤醒识别模块,被配置为对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;The first wake-up recognition module is configured to perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
    预设提示信息展示模块,被配置为在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。The preset prompt information display module is configured to display preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant, and the preset prompt information is used for Prompting that the target voice assistant is successfully awakened, and controlling an interactive operation associated with the target video based on voice.
  15. 根据权利要求14所述的语音交互装置,其中,所述第一目标采集语音获取模块包括:The voice interaction device according to claim 14, wherein the first target voice acquisition module includes:
    第一语音获取单元,被配置为在目标视频播放过程中,获取第一采集语音和第一播放语音,所述第一播放语音为采集所述第一采集语音时所述目标视频中播放的语音;The first voice acquiring unit is configured to acquire a first collected voice and a first played voice during the playback of the target video, and the first played voice is the voice played in the target video when the first collected voice is collected ;
    第一声学回声消除处理单元,被配置为基于所述第一播放语音,对所述第一采集语音进行回声消除,得到所述第一目标采集语音。The first acoustic echo cancellation processing unit is configured to perform echo cancellation on the first collected speech based on the first played speech to obtain the first target collected speech.
  16. 根据权利要求14所述的语音交互装置,其中,所述装置还包括:The voice interaction device according to claim 14, wherein the device further comprises:
    第二语音获取模块,被配置为获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;The second voice acquiring module is configured to acquire a second collected voice and a second playing voice, the second playing voice is the voice played in the target video when collecting the second collected voice;
    第二声学回声消除处理模块,被配置为基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;The second acoustic echo cancellation processing module is configured to perform echo cancellation on the second collected speech based on the second played speech, to obtain a second target collected speech;
    第一操控信息获取请求发送模块,被配置为向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;A first manipulation information acquisition request sending module, configured to send a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
    第二操控信息接收模块,被配置为接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;The second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;
    第二目标交互操作执行模块,被配置为基于所述第一操控信息,执行第一目标交互操作。The second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.
  17. 根据权利要求14至16任一所述的语音交互装置,其中,所述装置还包括:The voice interaction device according to any one of claims 14 to 16, wherein the device further comprises:
    第一服务模式更新模块,被配置为在所述第一目标采集语音包括目标交互指示语音的情况下,将所述目标语音助手的服务模式由第一状态更新为第二状态,所述目标交互指示语音指示多轮交互,所述第一状态的服务模式指示在所述目标语音助手被唤醒期间,执行一次基于语音控制与所述目标视频关联的交互操作,所述第二状态的服务模式指示在所述目标语音助手被唤醒期间,执行至少一次基于语音控制与所述目标视频关联的交互操作。The first service mode update module is configured to update the service mode of the target voice assistant from the first state to the second state when the first target collected voice includes target interaction instruction voice, and the target interaction The instruction voice indicates multiple rounds of interaction, the service mode of the first state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode of the second state indicates During the wake-up of the target voice assistant, at least one interactive operation associated with the target video based on voice control is performed.
  18. 根据权利要求17所述的语音交互装置,其中,所述装置还包括:The voice interaction device according to claim 17, wherein the device further comprises:
    第三语音获取模块,被配置为获取第三采集语音和第三播放语音,所述第三播放语音为采集所述第三采集语音时所述目标视频中播放的语音;The third voice acquisition module is configured to acquire a third voice collection and a third playback voice, where the third playback voice is the voice played in the target video when the third voice collection is collected;
    第三声学回声消除处理模块,被配置为基于所述第三播放语音,对所述第三采集语音进行回声消除,得到第三目标采集语音;The third acoustic echo cancellation processing module is configured to perform echo cancellation on the third collected speech based on the third playback speech to obtain a third target collected speech;
    第二唤醒识别模块,被配置为对所述第三目标采集语音进行唤醒识别,得到第二唤醒识别结果;The second wake-up identification module is configured to perform wake-up identification on the third target collected voice to obtain a second wake-up identification result;
    第二操控信息获取请求发送模块,被配置为在所述第二唤醒识别结果为不唤醒所述目标语音助手的情况下,向服务器发送第二操控信息获取请求,所述第二操控信息获取请求包括所述第三目标采集语音;The second manipulation information acquisition request sending module is configured to send a second manipulation information acquisition request to the server when the second wake-up recognition result is not to wake up the target voice assistant, the second manipulation information acquisition request Including the third target collection voice;
    第三操控信息接收模块,被配置为接收所述服务器发送的第二操控信息,所述第二操控信息与所述第三目标采集语音对应;The third manipulation information receiving module is configured to receive the second manipulation information sent by the server, the second manipulation information corresponds to the voice collected by the third target;
    第三目标交互操作执行模块,被配置为基于所述第二操控信息,执行第二目标交互操作。The third target interactive operation executing module is configured to execute the second target interactive operation based on the second manipulation information.
  19. 根据权利要求18所述的语音交互装置,其中,所述装置还包括:The voice interaction device according to claim 18, wherein the device further comprises:
    第二服务模式更新模块,被配置为在所述第二唤醒识别结果为唤醒所述目标语音助手的情况下,将所述目标语音助手的服务模式由所述第二状态更新为所述第一状态。The second service mode updating module is configured to update the service mode of the target voice assistant from the second state to the first when the second wake-up recognition result is to wake up the target voice assistant. state.
  20. 根据权利要求14至16任一所述的语音交互装置,其中,所述预设提示信息展示模块包括:The voice interaction device according to any one of claims 14 to 16, wherein the preset prompt information display module includes:
    第一提示信息获取请求发送单元,被配置为在所述第一唤醒识别结果为唤醒所述目标语音助手的情况下,向服务器发送提示信息获取请求,所述提示信息获取请求包括所述第一目标采集语音;The first prompt information acquisition request sending unit is configured to send a prompt information acquisition request to the server when the first wake-up recognition result is to wake up the target voice assistant, and the prompt information acquisition request includes the first Target voice collection;
    预设提示信息接收单元,被配置为接收所述服务器发送的所述预设提示信息,所述预设提示信息为基于所述第一目标采集语音生成的;The preset prompt information receiving unit is configured to receive the preset prompt information sent by the server, the preset prompt information is generated based on the collected voice of the first target;
    预设提示信息展示单元,被配置为在所述播放页面展示所述预设提示信息。The preset prompt information display unit is configured to display the preset prompt information on the playing page.
  21. 根据权利要求20所述的语音交互装置,其中,所述第一目标采集语音包括操控语音,所述装置还包括:The voice interaction device according to claim 20, wherein the first target collected voice includes manipulation voice, and the device further comprises:
    第一操控信息接收模块,被配置为接收所述服务器发送的第三操控信息,所述第三操控信息与所述操控语音对应,所述操控语音指示执行与所述目标视频关联的第三目标交互操作;The first manipulation information receiving module is configured to receive third manipulation information sent by the server, the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute a third target associated with the target video interactive operation;
    第一操控信息执行模块,被配置为基于所述第三操控信息,执行所述第三目标交互操作。The first manipulation information execution module is configured to execute the third target interaction operation based on the third manipulation information.
  22. 根据权利要求14至16任一所述的语音交互装置,其中,所述第一唤醒识别模块包括:The voice interaction device according to any one of claims 14 to 16, wherein the first wake-up identification module comprises:
    第一预设唤醒语音获取单元,被配置为获取预设唤醒语音;The first preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice;
    第一唤醒识别单元,被配置为基于所述预设唤醒语音,对所述第一目标采集语音进行唤醒识别,得到所述第一唤醒识别结果。The first wake-up identification unit is configured to perform wake-up identification on the first target collected voice based on the preset wake-up voice, and obtain the first wake-up identification result.
  23. 根据权利要求14至16任一所述的语音交互装置,其中,所述第一唤醒识别模块包括:The voice interaction device according to any one of claims 14 to 16, wherein the first wake-up identification module comprises:
    第二预设唤醒语音获取单元,被配置为获取预设唤醒语音;The second preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice;
    第二唤醒识别单元,被配置为基于所述预设唤醒语音,对所述第一目标采集语音进行唤醒识别,得到第三唤醒识别结果;The second wake-up recognition unit is configured to perform wake-up recognition on the first target collected voice based on the preset wake-up voice, and obtain a third wake-up recognition result;
    第一目标采集语音发送单元,被配置为在所述第三唤醒识别结果为唤醒所述目标语音助手的情况下,向服务器发送所述第一目标采集语音;The first target collected voice sending unit is configured to send the first target collected voice to a server when the third wake-up recognition result is to wake up the target voice assistant;
    第一唤醒识别结果接收单元,被配置为接收所述服务器发送的所述第一唤醒识别结果,所述第一唤醒识别结果是基于预设唤醒识别模型,对所述第一目标采集语音对应的文本进行唤醒识别得到的。The first wake-up recognition result receiving unit is configured to receive the first wake-up recognition result sent by the server, the first wake-up recognition result is based on a preset wake-up recognition model, corresponding to the first target collected voice The text is obtained by wake-up recognition.
  24. 根据权利要求14至16任一所述的语音交互装置,其中,所述装置还包括:The voice interaction device according to any one of claims 14 to 16, wherein the device further comprises:
    语音响应请求发送模块,被配置为向服务器发送语音响应请求,所述语音响应请求包括所述第一目标采集语音;The voice response request sending module is configured to send a voice response request to the server, the voice response request including the voice collected by the first target;
    响应语音接收模块,被配置为接收所述服务器发送的响应语音,所述响应语音与所述第一目标采集语音对应;The response voice receiving module is configured to receive the response voice sent by the server, the response voice corresponds to the first target collection voice;
    响应语音播放模块,被配置为播放所述响应语音。The response voice playing module is configured to play the response voice.
  25. 根据权利要求14至16任一所述的语音交互装置,其中,所述装置还包括:The voice interaction device according to any one of claims 14 to 16, wherein the device further comprises:
    关闭提示模块,被配置为在预设时间段内未获取到新增采集语音的情况下,将所述播放页面展示的所述预设提示信息,更新为所述目标语音助手的关闭提示信息。The closing prompting module is configured to update the preset prompting information displayed on the playing page to the closing prompting information of the target voice assistant when no newly collected voice is acquired within a preset time period.
  26. 一种语音交互装置,包括:A voice interaction device, comprising:
    第二语音获取模块,被配置为在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;The second voice acquisition module is configured to acquire a second collection voice and a second playback voice when the target voice assistant is successfully awakened during the playback of the target video, and the second playback voice is for collecting the second The voice played in the target video when collecting the voice;
    第二声学回声消除处理模块,被配置为基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;The second acoustic echo cancellation processing module is configured to perform echo cancellation on the second collected speech based on the second played speech, to obtain a second target collected speech;
    第一操控信息获取请求发送模块,被配置为向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;A first manipulation information acquisition request sending module, configured to send a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
    第二操控信息接收模块,被配置为接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;The second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;
    第二目标交互操作执行模块,被配置为基于所述第一操控信息,执行第一目标交互操作。The second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.
  27. 一种电子设备,包括:An electronic device comprising:
    处理器;processor;
    用于存储所述处理器可执行指令的存储器;memory for storing said processor-executable instructions;
    其中,所述处理器被配置为执行所述指令,以实现如下步骤:Wherein, the processor is configured to execute the instructions to implement the following steps:
    在目标视频播放过程中,获取第一目标采集语音;During the playing of the target video, obtain the first target collection voice;
    对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
    在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
  28. 根据权利要求27所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 27, wherein the processor is configured to execute the instructions to implement the following steps:
    获取第一采集语音和第一播放语音,所述第一播放语音为采集所述第一采集语音时所述目标视频中播放的语音;Obtain the first collected voice and the first played voice, the first played voice is the voice played in the target video when collecting the first collected voice;
    基于所述第一播放语音,对所述第一采集语音进行回声消除,得到所述第一目标采集语音。Based on the first playing voice, perform echo cancellation on the first collected voice to obtain the first target collected voice.
  29. 根据权利要求27所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 27, wherein the processor is configured to execute the instructions to implement the following steps:
    获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;Obtain a second collection voice and a second playback voice, the second playback voice is the voice played in the target video when collecting the second collection voice;
    基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
    向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
    接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
    基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
  30. 根据权利要求27至29任一所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:
    在所述第一目标采集语音包括目标交互指示语音的情况下,将所述目标语音助手的服务模式由第一状态更新为第二状态,所述目标交互指示语音指示多轮交互,所述第一状态的服务模式指示在所述目标语音助手被唤醒期间,执行一次基于语音控制与所述目标视频关联的交互操作,所述第二状态的服务模式指示在所述目标语音助手被唤醒期间,执行至少一次基于语音控制与所述目标视频关联的交互操作。In the case where the first target collection voice includes a target interaction instruction voice, update the service mode of the target voice assistant from the first state to the second state, the target interaction instruction voice indicates multiple rounds of interaction, and the second The service mode of a state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode of the second state indicates that during the wake-up of the target voice assistant, Performing at least one voice-based interactive operation associated with the target video.
  31. 根据权利要求30所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 30, wherein the processor is configured to execute the instructions to implement the following steps:
    获取第三采集语音和第三播放语音,所述第三播放语音为采集所述第三采集语音时所述目标视频中播放的语音;Acquiring the third collected voice and the third playing voice, the third playing voice is the voice played in the target video when collecting the third collected voice;
    基于所述第三播放语音,对所述第三采集语音进行回声消除,得到第三目标采集语音;Based on the third playback voice, perform echo cancellation on the third collected voice to obtain a third target collected voice;
    对所述第三目标采集语音进行唤醒识别,得到第二唤醒识别结果;Perform wake-up recognition on the third target collected voice to obtain a second wake-up recognition result;
    在所述第二唤醒识别结果为不唤醒所述目标语音助手的情况下,向服务器发送第二操控信息获取请求,所述第二操控信息获取请求包括所述第三目标采集语音;When the second wake-up recognition result is not to wake up the target voice assistant, send a second manipulation information acquisition request to the server, where the second manipulation information acquisition request includes the third target voice collection;
    接收所述服务器发送的第二操控信息,所述第二操控信息与所述第三目标采集语音对应;receiving second manipulation information sent by the server, where the second manipulation information corresponds to the voice collected by the third target;
    基于所述第二操控信息,执行第二目标交互操作。Based on the second manipulation information, a second target interaction operation is performed.
  32. 根据权利要求31所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 31, wherein the processor is configured to execute the instructions to implement the following steps:
    在所述第二唤醒识别结果为唤醒所述目标语音助手的情况下,将所述目标语音助手的服务模式由所述第二状态更新为所述第一状态。If the second wake-up identification result is to wake up the target voice assistant, updating the service mode of the target voice assistant from the second state to the first state.
  33. 根据权利要求27至29任一所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:
    在所述第一唤醒识别结果为唤醒所述目标语音助手的情况下,向服务器发送提示信息获取请求,所述提示信息获取请求包括所述第一目标采集语音;When the first wake-up recognition result is to wake up the target voice assistant, send a prompt information acquisition request to the server, where the prompt information acquisition request includes the first target voice collection;
    接收所述服务器发送的所述预设提示信息,所述预设提示信息为基于所述第一目标采集语音生成的;receiving the preset prompt information sent by the server, where the preset prompt information is generated based on the collected voice of the first target;
    在所述播放页面展示所述预设提示信息。The preset prompt information is displayed on the playing page.
  34. 根据权利要求33所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to claim 33, wherein the processor is configured to execute the instructions to implement the following steps:
    接收所述服务器发送的第三操控信息,所述第三操控信息与所述操控语音对应,所述操控语音指示执行与所述目标视频关联的第三目标交互操作;receiving third manipulation information sent by the server, where the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute a third target interactive operation associated with the target video;
    基于所述第三操控信息,执行所述第三目标交互操作。Based on the third manipulation information, execute the third target interaction operation.
  35. 根据权利要求27至29任一所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:
    获取预设唤醒语音;Obtain the preset wake-up voice;
    基于所述预设唤醒语音,对所述第一目标采集语音进行唤醒识别,得到所述第一唤醒识别结果。Based on the preset wake-up voice, wake-up recognition is performed on the first target collected voice to obtain the first wake-up recognition result.
  36. 根据权利要求27至29任一所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:
    获取预设唤醒语音;Obtain the preset wake-up voice;
    基于所述预设唤醒语音,对所述第一目标采集语音进行唤醒识别,得到第三唤醒识别结果;Based on the preset wake-up voice, perform wake-up recognition on the first target collected voice to obtain a third wake-up recognition result;
    在所述第三唤醒识别结果为唤醒所述目标语音助手的情况下,向服务器发送所述第一目标采集语音;In the case that the third wake-up recognition result is to wake up the target voice assistant, sending the first target voice collection to the server;
    接收所述服务器发送的所述第一唤醒识别结果,所述第一唤醒识别结果是基于预设唤醒识别模型,对所述第一目标采集语音对应的文本进行唤醒识别得到的。receiving the first wake-up recognition result sent by the server, where the first wake-up recognition result is obtained by performing wake-up recognition on the text corresponding to the voice collected by the first target based on a preset wake-up recognition model.
  37. 根据权利要求27至29任一所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:
    向服务器发送语音响应请求,所述语音响应请求包括所述第一目标采集语音;Sending a voice response request to the server, where the voice response request includes the voice collected by the first target;
    接收所述服务器发送的响应语音,所述响应语音与所述第一目标采集语音对应;receiving a response voice sent by the server, the response voice corresponding to the first target collection voice;
    播放所述响应语音。Play the response voice.
  38. 根据权利要求27至29任一所述的电子设备,其中,所述处理器被配置为执行所述指令,以实现如下步骤:The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:
    在预设时间段内未获取到新增采集语音的情况下,将所述播放页面展示的所述预设提示信息,更新为所述目标语音助手的关闭提示信息。In the case that the newly collected voice is not acquired within the preset time period, the preset prompt information displayed on the playing page is updated with the closing prompt information of the target voice assistant.
  39. 一种电子设备,包括:An electronic device comprising:
    处理器;processor;
    用于存储所述处理器可执行指令的存储器;memory for storing said processor-executable instructions;
    其中,所述处理器被配置为执行所述指令,以实现如下步骤:Wherein, the processor is configured to execute the instructions to implement the following steps:
    在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
    基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
    向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
    接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
    基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
  40. 一种计算机可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行如下步骤:A computer-readable storage medium, when instructions in the storage medium are executed by a processor of the electronic device, the electronic device can perform the following steps:
    在目标视频播放过程中,获取第一目标采集语音;During the playing of the target video, obtain the first target collection voice;
    对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
    在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
  41. 一种计算机可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得所述电子设备能够执行如下步骤:A computer-readable storage medium, when instructions in the storage medium are executed by a processor of the electronic device, the electronic device can perform the following steps:
    在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
    基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
    向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
    接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
    基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
  42. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行如下步骤:A computer program product comprising a computer program executed by a processor in the following steps:
    在目标视频播放过程中,获取第一目标采集语音;During the playing of the target video, obtain the first target collection voice;
    对所述第一目标采集语音进行唤醒识别,得到第一唤醒识别结果;Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;
    在所述第一唤醒识别结果为唤醒目标语音助手的情况下,在所述目标视频对应的播放页面展示预设提示信息,所述预设提示信息用于提示所述目标语音助手被唤醒成功,以及基于语音控制与所述目标视频关联的交互操作。In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
  43. 一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行如下步骤:A computer program product comprising a computer program executed by a processor in the following steps:
    在目标视频播放过程中、且目标语音助手被唤醒成功的情况下,获取第二采集语音和第二播放语音,所述第二播放语音为采集所述第二采集语音时所述目标视频中播放的语音;During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;
    基于所述第二播放语音,对所述第二采集语音进行回声消除,得到第二目标采集语音;Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;
    向服务器发送第一操控信息获取请求,所述第一操控信息获取请求包括所述第二目标采集语音;Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;
    接收所述服务器发送的第一操控信息,所述第一操控信息与所述第二目标采集语音对应;receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;
    基于所述第一操控信息,执行第一目标交互操作。Based on the first manipulation information, a first target interaction operation is performed.
PCT/CN2022/077091 2021-08-24 2022-02-21 Voice interaction method and electronic device WO2023024455A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110973383.0 2021-08-24
CN202110973383.0A CN113628622A (en) 2021-08-24 2021-08-24 Voice interaction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2023024455A1 true WO2023024455A1 (en) 2023-03-02

Family

ID=78387377

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/077091 WO2023024455A1 (en) 2021-08-24 2022-02-21 Voice interaction method and electronic device

Country Status (2)

Country Link
CN (1) CN113628622A (en)
WO (1) WO2023024455A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113628622A (en) * 2021-08-24 2021-11-09 北京达佳互联信息技术有限公司 Voice interaction method and device, electronic equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140006825A1 (en) * 2012-06-30 2014-01-02 David Shenhav Systems and methods to wake up a device from a power conservation state
CN109348275A (en) * 2018-10-30 2019-02-15 百度在线网络技术(北京)有限公司 Method for processing video frequency and device
CN110545475A (en) * 2019-08-26 2019-12-06 北京奇艺世纪科技有限公司 video playing method and device and electronic equipment
CN111916068A (en) * 2019-05-07 2020-11-10 北京地平线机器人技术研发有限公司 Audio detection method and device
CN112530419A (en) * 2019-09-19 2021-03-19 百度在线网络技术(北京)有限公司 Voice recognition control method and device, electronic equipment and readable storage medium
CN112634897A (en) * 2020-12-31 2021-04-09 青岛海尔科技有限公司 Equipment awakening method and device, storage medium and electronic device
WO2021072914A1 (en) * 2019-10-14 2021-04-22 苏州思必驰信息科技有限公司 Human-machine conversation processing method
CN113628622A (en) * 2021-08-24 2021-11-09 北京达佳互联信息技术有限公司 Voice interaction method and device, electronic equipment and storage medium

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100221871B1 (en) * 1997-04-29 1999-09-15 전주범 Recording method and device of vtr
CN104363517B (en) * 2014-11-12 2018-05-11 科大讯飞股份有限公司 Method for switching languages and system based on tv scene and voice assistant
CN105892902A (en) * 2015-12-11 2016-08-24 乐视网信息技术(北京)股份有限公司 Operation method of mobile equipment live application, and mobile client
CN106101796A (en) * 2016-06-29 2016-11-09 乐视控股(北京)有限公司 Method and device is appreciated in beating of a kind of net cast
CN106303658B (en) * 2016-08-19 2018-11-30 百度在线网络技术(北京)有限公司 Exchange method and device applied to net cast
CN106375864B (en) * 2016-08-25 2019-04-26 广州华多网络科技有限公司 Virtual objects distribute control method, device and mobile terminal
WO2018083511A1 (en) * 2016-11-03 2018-05-11 北京金锐德路科技有限公司 Audio playing apparatus and method
US10405064B2 (en) * 2017-10-17 2019-09-03 Kuma LLC Systems and methods for prompting and incorporating unscripted user content into live broadcast programming
CN108335696A (en) * 2018-02-09 2018-07-27 百度在线网络技术(北京)有限公司 Voice awakening method and device
KR102093030B1 (en) * 2018-07-27 2020-03-24 (주)휴맥스 Smart projector and method for controlling thereof
US10971160B2 (en) * 2018-11-13 2021-04-06 Comcast Cable Communications, Llc Methods and systems for determining a wake word
CN112346695A (en) * 2019-08-09 2021-02-09 华为技术有限公司 Method for controlling equipment through voice and electronic equipment
CN110610699B (en) * 2019-09-03 2023-03-24 北京达佳互联信息技术有限公司 Voice signal processing method, device, terminal, server and storage medium
CN110706703A (en) * 2019-10-16 2020-01-17 珠海格力电器股份有限公司 Voice wake-up method, device, medium and equipment
CN111653276B (en) * 2020-06-22 2022-04-12 四川长虹电器股份有限公司 Voice awakening system and method
CN112311635B (en) * 2020-11-05 2022-05-17 深圳市奥谷奇技术有限公司 Voice interruption awakening method and device and computer readable storage medium
CN112911324B (en) * 2021-01-29 2022-10-28 北京达佳互联信息技术有限公司 Content display method and device for live broadcast room, server and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140006825A1 (en) * 2012-06-30 2014-01-02 David Shenhav Systems and methods to wake up a device from a power conservation state
CN109348275A (en) * 2018-10-30 2019-02-15 百度在线网络技术(北京)有限公司 Method for processing video frequency and device
CN111916068A (en) * 2019-05-07 2020-11-10 北京地平线机器人技术研发有限公司 Audio detection method and device
CN110545475A (en) * 2019-08-26 2019-12-06 北京奇艺世纪科技有限公司 video playing method and device and electronic equipment
CN112530419A (en) * 2019-09-19 2021-03-19 百度在线网络技术(北京)有限公司 Voice recognition control method and device, electronic equipment and readable storage medium
WO2021072914A1 (en) * 2019-10-14 2021-04-22 苏州思必驰信息科技有限公司 Human-machine conversation processing method
CN112634897A (en) * 2020-12-31 2021-04-09 青岛海尔科技有限公司 Equipment awakening method and device, storage medium and electronic device
CN113628622A (en) * 2021-08-24 2021-11-09 北京达佳互联信息技术有限公司 Voice interaction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113628622A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
US11289100B2 (en) Selective enrollment with an automated assistant
US9542956B1 (en) Systems and methods for responding to human spoken audio
JP7386878B2 (en) Dynamically adapting assistant responses
WO2020187121A1 (en) Applet start method, device, and computer storage medium
EP3642833B1 (en) Dynamic and/or context-specific hot words to invoke automated assistant
US20220335930A1 (en) Utilizing pre-event and post-event input streams to engage an automated assistant
US11966764B2 (en) Adapting client application of feature phone based on experiment parameters
JP2023029973A (en) Speaker diarization using speaker embedding and trained generation model
JP2023103313A (en) Invoking automated assistant functions based on detected gesture and gaze
US20200020334A1 (en) Electronic device for processing user speech and operating method therefor
US10930278B2 (en) Trigger sound detection in ambient audio to provide related functionality on a user interface
CN107516526B (en) Sound source tracking and positioning method, device, equipment and computer readable storage medium
CN108055617B (en) Microphone awakening method and device, terminal equipment and storage medium
TW201629949A (en) A caching apparatus for serving phonetic pronunciations
CN112292724A (en) Dynamic and/or context-specific hotwords for invoking automated assistants
US11393490B2 (en) Method, apparatus, device and computer-readable storage medium for voice interaction
KR20220088926A (en) Use of Automated Assistant Function Modifications for On-Device Machine Learning Model Training
US11972766B2 (en) Detecting and suppressing commands in media that may trigger another automated assistant
JP7250900B2 (en) Hot word recognition and passive assistance
CN111640434A (en) Method and apparatus for controlling voice device
KR20190068133A (en) Electronic device and method for speech recognition
WO2023024455A1 (en) Voice interaction method and electronic device
CN106980640B (en) Interaction method, device and computer-readable storage medium for photos
CN112652304B (en) Voice interaction method and device of intelligent equipment and electronic equipment
CN112420043A (en) Intelligent awakening method and device based on voice, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22859814

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE