WO2023024455A1

WO2023024455A1 - Voice interaction method and electronic device

Info

Publication number: WO2023024455A1
Application number: PCT/CN2022/077091
Authority: WO
Inventors: 程益君; 徐昕媚
Original assignee: 北京达佳互联信息技术有限公司
Priority date: 2021-08-24
Filing date: 2022-02-21
Publication date: 2023-03-02
Also published as: CN113628622A

Abstract

A voice interaction method and apparatus, an electronic device, and a storage medium, relating to the technical field of the Internet. The method comprises: during playback of a target video, acquiring a first target collected voice (S201); performing wake-up recognition on the first target collected voice to obtain a first wake-up recognition result (S203); and in the case that the first wake-up recognition result is to wake up a target voice assistant, displaying preset prompt information on a playback page corresponding to the target video (S205), the preset prompt information being used for prompting that the target voice assistant is waken up successfully, and on the basis of the voice, controlling an interaction operation associated with the target video.

Description

Voice interaction method and electronic device

This disclosure is based on a Chinese patent application with an application date of August 24, 2021 and application number 202110973383.0, and claims the priority of this Chinese patent application. The entire content of this Chinese patent application is hereby incorporated by reference into this disclosure.

technical field

The present disclosure relates to the technical field of the Internet, and in particular to a voice interaction method and electronic equipment.

Background technique

With the development of Internet technology and the popularization of mobile devices, it has become a part of people's daily life to use mobile devices to view videos such as film and television dramas and live broadcasts. Currently, during video playback, users often comment on videos, send barrage, etc. Interoperate.

Contents of the invention

The disclosure provides a voice interaction method and electronic equipment. The disclosed technical scheme is as follows:

According to an aspect of an embodiment of the present disclosure, a voice interaction method is provided, including:

During the playing of the target video, obtain the first target collection voice;

Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;

In the case where the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to indicate that the target voice assistant has been successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.

According to another aspect of the embodiments of the present disclosure, a voice interaction method is provided, including:

During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;

Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;

Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;

Based on the first manipulation information, a first target interaction operation is performed.

According to another aspect of the embodiments of the present disclosure, a voice interaction device is provided, including:

The first target acquisition voice acquisition module is configured to acquire the first target acquisition voice during the playback of the target video;

The first wake-up recognition module is configured to perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;

The preset prompt information display module is configured to display preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant, and the preset prompt information is used for Prompting that the target voice assistant is successfully awakened, and controlling an interactive operation associated with the target video based on voice.

The second voice acquisition module is configured to acquire a second collection voice and a second playback voice when the target voice assistant is successfully awakened during the playback of the target video, and the second playback voice is for collecting the second The voice played in the target video when collecting the voice;

The second acoustic echo cancellation processing module is configured to perform echo cancellation on the second collected speech based on the second played speech, to obtain a second target collected speech;

A first manipulation information acquisition request sending module, configured to send a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

The second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;

The second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.

According to another aspect of an embodiment of the present disclosure, there is provided an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to execute the instructions to implement Follow the steps below:

In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.

According to another aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, and when instructions in the storage medium are executed by a processor of an electronic device, the electronic device can perform the following steps:

According to another aspect of the embodiments of the present disclosure, a computer program product is provided, including a computer program, and the computer program is executed by a processor through the following steps:

According to another aspect of the embodiments of the present disclosure, there is provided a computer program product containing instructions, including a computer program, the computer program is executed by a processor through the following steps:

In the technical solution provided by the embodiments of the present disclosure, during the playback of the target video, combined with the voice collected by the first target for wake-up recognition, it can avoid falsely triggered voice interaction and improve the accuracy of voice interaction; in addition, when waking up the target voice assistant In the case where the target voice assistant is awakened successfully, and the preset prompt information for the interactive operation based on the voice control and the target video is displayed, the interaction between the voice and the target video can be realized, and the convenience and efficiency of the interaction are improved. , and in turn, it can also improve the interaction between users and anchors in live broadcast and other scenarios.

Description of drawings

Fig. 1 is a schematic diagram showing an application environment according to an exemplary embodiment;

Fig. 2 is a flowchart of a voice interaction method according to an exemplary embodiment;

Fig. 3 is a flow chart showing a wake-up recognition of a first target collected voice to obtain a first wake-up recognition result according to an exemplary embodiment;

Fig. 4 is a schematic diagram of a playing page showing preset prompt information according to an exemplary embodiment;

Fig. 5 is a flow chart showing preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant according to an exemplary embodiment;

Fig. 6 is a flowchart showing a corresponding interactive operation based on collected voice according to an exemplary embodiment;

Fig. 7 is another flow chart showing corresponding interactive operations based on collected voice according to an exemplary embodiment;

Fig. 8 is a flow chart showing another voice interaction method according to an exemplary embodiment;

Fig. 9 is a block diagram of a voice interaction device according to an exemplary embodiment;

Fig. 10 is a block diagram of a voice interaction device according to an exemplary embodiment;

Fig. 11 is a block diagram showing an electronic device for voice interaction according to an exemplary embodiment.

Detailed ways

It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for display, data for analysis, etc.) involved in this disclosure are authorized by the user. Or information and data fully authorized by the parties.

Please refer to FIG. 1 . FIG. 1 is a schematic diagram showing an application environment according to an exemplary embodiment. As shown in FIG. 1 , the application environment includes a terminal 100 and a server 200 .

The terminal 100 is used to provide live broadcast service and voice assistant service to any user. In some embodiments, the terminal 100 includes, but is not limited to, smartphones, desktop computers, tablet computers, notebook computers, smart speakers, digital assistants, augmented reality (augmented reality, AR)/virtual reality (virtual reality, VR) devices, smart Electronic devices such as wearable devices. In some embodiments, the software running on the above-mentioned electronic devices is used to provide live broadcast services and voice assistant services, such as application programs and the like. In some embodiments, the operating system running on the electronic device includes but not limited to Android system, IOS system, linux, windows and so on.

In some embodiments, the server 200 provides background services for the terminal 100 . In some embodiments, the server 200 is an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, CDN (Content Delivery Network, content distribution network), and big data and artificial intelligence platforms.

In addition, it should be noted that what is shown in FIG. 1 is only an application environment provided by the present disclosure, and in actual application, other application environments are also included, for example, other application environments include a server and multiple terminals.

In the embodiment of the present specification, the terminal 100 and the server 200 are connected directly or indirectly through wired or wireless communication, which is not limited in this disclosure.

Fig. 2 is a flow chart of a voice interaction method according to an exemplary embodiment. As shown in Fig. 2 , the voice interaction method is executed by an electronic device such as a terminal, and includes the following steps S201 to S205.

In step S201, during the playing of the target video, the first target collected voice is acquired.

In some embodiments, the above target video playback process includes the process of playing the target video on the corresponding playback page when the application corresponding to the target video is running in the foreground; or, including the corresponding application running in the background. The process of playing the video on the floating pop-up playback page.

In some embodiments, the target video includes but is not limited to live video, pre-recorded video (movies, short videos, etc.).

In some embodiments, the acquisition of the first target collected voice includes:

Obtain the first collection voice and the first playback voice, the first playback voice is the voice played in the target video when collecting the first collection voice;

Based on the first playing voice, echo cancellation is performed on the first collected voice to obtain the first target collected voice.

In some embodiments, the terminal is often provided with a voice collection device capable of collecting voice, such as a microphone, and the voice is collected based on the microphone on the terminal. Correspondingly, the first collected voice is the voice information collected based on the voice collection device during the playback of the target video. In some embodiments, the first playing voice is the voice information played in the target video when the first collected voice is collected. In some embodiments, the target video is played based on the player, and correspondingly, the first playing voice is acquired based on the player.

In practical applications, since the target video is being played during the process of collecting the first collected voice, in addition to the voice information sent by the user, the collected first collected voice will also collect the voice sent during the playback of the target video information. In order to accurately extract the voice information sent by the user, based on the first broadcast voice, the first collected voice is subjected to acoustic echo cancellation processing, and the first target collected voice after the first broadcast voice is offset is obtained, thereby ensuring the accuracy of subsequent wake-up recognition.

In some embodiments, the terminal is provided with a voice processing component, and the voice processing component is used for collecting voice and performing acoustic echo cancellation processing.

In the above-mentioned embodiment, combined with the first playback voice played in the target video during voice collection, the acoustic echo cancellation process is performed on the first collected voice, which can ensure the validity of the first target collected voice used for voice assistant wake-up recognition, and then Improve the accuracy of subsequent voice wake-up recognition.

In some embodiments, in the process of playing the target video, if the target video playing voice has little influence on the collected voice, the first collected voice is used as the first target collected voice.

For example, in the process of collecting the first collected voice, the volume of the voice in the target video being played is low. If the volume of the voice of the target video being played is less than the volume threshold, it means that the first collected voice is clear enough, that is, the The voice information uttered by the user in the first collected voice is clear enough, so there is no need to perform echo cancellation on the first collected voice, and the first collected voice can be used as the first target collected voice. Wherein, the volume threshold is an arbitrary value.

In step S203, wake-up recognition is performed on the collected voice of the first target to obtain a first wake-up recognition result.

Wherein, performing wake-up recognition on the first target collected voice means judging whether to wake up the target voice assistant based on the first target collected voice, and the first wake-up recognition result is used to indicate whether to wake up the target voice assistant.

In some embodiments, the terminal performs wake-up recognition of the voice assistant locally. Correspondingly, performing wake-up recognition on the first target collected voice and obtaining the first wake-up recognition result may include:

Obtain the preset wake-up voice;

Based on the preset wake-up voice, wake-up recognition is performed on the first target collected voice to obtain a first wake-up recognition result.

In some embodiments, the preset wake-up voice is a voice used to trigger the wake-up of the target voice assistant. The preset wake-up voice is preset in combination with actual application scenarios.

In some embodiments, the wake-up recognition of the first target collected voice based on the preset wake-up voice includes: matching the preset wake-up voice with the first target collected voice, and when the first target collected voice includes the preset wake-up voice In this case, the first wake-up recognition result is to wake up the target voice assistant; when the first target collected voice does not include the preset wake-up voice, the first wake-up recognition result is not to wake up the target voice assistant.

In some embodiments, the terminal is provided with a local voice wake-up component, and the local voice wake-up component is used for local wake-up recognition.

In the above embodiment, combined with the preset wake-up voice, the wake-up recognition is performed on the first target collected voice, which can avoid false triggering of voice interaction and improve the accuracy of voice interaction.

In some embodiments, on the basis of the wake-up recognition of the voice assistant performed locally on the terminal, a second wake-up recognition is performed in conjunction with the server; correspondingly, as shown in FIG. The process of waking up the recognition result includes the following steps:

In step S301, a preset wake-up voice is acquired.

In step S303, wake-up recognition is performed on the first target collected voice based on the preset wake-up voice to obtain a third wake-up recognition result.

In step S305, if the third wake-up recognition result is to wake up the target voice assistant, send the first target collected voice to the server.

In step S307, the first wake-up identification result sent by the server is received.

In some embodiments, refer to the relevant description above for the above step S301 and step S303, and details are not repeated here.

In some embodiments, the above-mentioned first wake-up recognition result is obtained by the server performing wake-up recognition processing on the text corresponding to the voice collected by the first target based on a preset wake-up recognition model. In some embodiments, the preset wake-up recognition model is obtained by training a preset deep learning model based on the sample voice and the wake-up label information corresponding to the sample voice. In some embodiments, the sample speech includes a positive sample speech and a negative sample speech; the wake-up marking information corresponding to the positive sample voice is to wake up the target voice assistant, and the wake-up marking information corresponding to the negative sample voice is not to wake up the target voice assistant.

In some embodiments, after receiving the first target collected voice, the server converts the first target collected voice into text information, and inputs the text information into a preset wake-up recognition model for wake-up recognition processing to obtain a first wake-up recognition result.

In some embodiments, if the third wake-up recognition result is not to wake up the target voice assistant, the first target collected voice is not sent to the server, thereby reducing the pressure on the server.

In the above embodiment, when the terminal recognizes the wake-up target voice assistant locally, the secondary wake-up recognition is performed in combination with the server, which improves the accuracy of wake-up recognition and avoids falsely triggered voice interaction.

In step S205, if the first wake-up recognition result is to wake up the target voice assistant, preset prompt information is displayed on the play page corresponding to the target video.

Wherein, the preset prompt information is used to prompt the target voice assistant to be awakened successfully, and to control the interactive operation associated with the target video based on the voice. The target voice assistant is a voice assistant that controls the interactive operation associated with the target video based on voice. After the target voice assistant is successfully awakened, the user can control the interactive operation associated with the target video based on voice.

In some embodiments, the information format of the preset prompt information includes but is not limited to text, voice, image, etc., and can be set according to actual application requirements.

In some embodiments, in different application scenarios, the interactive operations associated with the target video are different. For example, taking the target video as a live video as an example, the interactive operations associated with the target video include but are not limited to commenting, following the corresponding host, giving virtual resources, and so on. For another example, if the target video is a pre-recorded video such as a film and television drama, the interactive operations associated with the target video include but are not limited to posting barrage, anthology, and adjusting resolution. For another example, if the target video is a pre-recorded video such as a short video, the interactive operations associated with the target video include but are not limited to like, follow, and so on.

In some embodiments, when the first wake-up recognition result is that the target voice assistant is not awakened, the collection of voice is continued, and when a new voice is collected during the playback of the target video, according to the above steps S201 to S205, based on The flow of the new voice for voice interaction.

In some embodiments, as shown in FIG. 4 , FIG. 4 is a schematic diagram of a playback page showing preset prompt information according to an exemplary embodiment, and the information corresponding to 400 in FIG. 4 is preset prompt information.

In some embodiments, as shown in FIG. 5 , when the first wake-up recognition result is to wake up the target voice assistant, displaying preset prompt information on the play page corresponding to the target video includes:

In step S2051, if the first wake-up identification result is to wake up the target voice assistant, a prompt information acquisition request is sent to the server, and the prompt information acquisition request includes the first target collected voice.

In step S2053, the preset prompt information sent by the server is received, and the preset prompt information is generated based on the collected voice of the first target.

In step S2055, preset prompt information is displayed on the play page.

In some embodiments, before sending the voice to the server, the terminal performs voice format conversion on the voice, so that the format-converted voice is recognizable by the server, and then sends the format-converted voice to the server. For example, the voice format of the first target collected voice before the format conversion is PCM (Pulse Code Modulation----pulse code modulation recording), and the voice format recognizable by the server is Opus (a lossy sound coding format). Sending the first target collected voice by the server includes sending the converted voice to the server, that is, the voice format of the first target collected voice sent to the server is Opus.

In some embodiments, the terminal is provided with a local format conversion component, and the format conversion component is used for voice format conversion. In some embodiments, the function of voice format conversion is integrated in the above-mentioned local voice wake-up component.

In some embodiments, the first target collection voice includes manipulation voice, and after sending the first target collection voice to the server, the above method further includes:

receiving third manipulation information sent by the server, the manipulation voice indicating to execute a third target interactive operation associated with the target video;

Based on the third manipulation information, a third target interaction operation is performed.

In some embodiments, in addition to the preset wake-up voice, the first target collection voice also includes voice information indicating the execution of an interactive operation associated with the target video. By carrying the first target collection voice in the prompt information acquisition request, the server can determine the first manipulation information while determining the preset prompt information by performing semantic analysis on the first target collection voice, so that subsequent terminals can information, perform the first target interaction operation.

In some embodiments, taking the live broadcast scene as an example, assuming that the text corresponding to the preset wake-up voice is "Little K", and the text corresponding to the first target collected voice is "Little K, I want to pay attention to the anchor", the third control information is Follow the instructions of the anchor. In some embodiments, after receiving the third manipulation information, the terminal automatically triggers an interactive operation of following the anchor (third target interactive operation).

In the above-mentioned embodiment, when the first wake-up recognition result is to wake up the target voice assistant, by carrying the voice of the first target collection in the prompt information acquisition request, the first target can be obtained from the server while obtaining the preset prompt information. The third control information corresponding to the control voice in the voice is collected, and then the automatic execution of the interactive operation is realized, which improves the convenience and efficiency of the interaction.

It can be seen from the technical solutions provided by the above embodiments of the present disclosure that during the playback of the target video, combined with the first target collected voice to perform voice assistant wake-up recognition, it is possible to avoid falsely triggered voice interaction and improve the accuracy of voice interaction; in addition, in the In the case of waking up the target voice assistant, the playback page corresponding to the target video displays the preset prompt information for prompting the target voice assistant to be awakened successfully, and the interactive operation associated with the target video based on voice control, which can realize the voice-based and target video The interaction improves the convenience and efficiency of interaction, and can also improve the interaction between users and anchors in live broadcast and other scenarios.

In some embodiments, after the preset prompt information is displayed on the playback page corresponding to the target video, corresponding interactive operations can also be performed based on the collected voice. Correspondingly, as shown in FIG. 6 , the above method further includes:

In step S601, a second collected voice and a second played voice are acquired, and the second played voice is the voice played in the target video when the second collected voice is collected.

In step S603, based on the second playing voice, echo cancellation is performed on the second collected voice to obtain a second target collected voice.

In step S605, a first manipulation information acquisition request is sent to the server, where the first manipulation information acquisition request includes the voice collected by the second target.

In step S607, the first manipulation information sent by the server is received, and the first manipulation information corresponds to the voice collected by the second target.

In step S609, based on the first manipulation information, a first target interaction operation is performed.

Wherein, the first target interactive operation is an operation corresponding to the second collected voice, and is also an operation associated with the target video.

In some embodiments, the above step S601 and step S603 are the same as the above step S201, and will not be repeated here.

In some embodiments, the second target collected voice is a voice obtained after the target voice assistant is awakened, and the second target collected voice is a control voice. After the second target collected voice is acquired, a first manipulation information acquisition request carrying the second target collected voice is sent to the server. After receiving the request for obtaining the first manipulation information, the server determines the second manipulation information by performing semantic analysis on the collected voice of the second target, and returns it to the terminal, so that the terminal can execute the first target interactive operation based on the second manipulation information.

In some embodiments, taking the live broadcast scene as an example, the text corresponding to the preset wake-up voice is "little k", and the text corresponding to the second target collection voice is "I want to follow the anchor", and the second control information is an instruction to follow the anchor . In some embodiments, after receiving the second manipulation information, the terminal automatically triggers an interactive operation of following the anchor (second target interactive operation).

In some embodiments, in the process of playing the target video, if the target video playback voice has little influence on the collected voice, the second collected voice is used as the second target collected voice.

For example, in the process of collecting the second collected voice, if the volume of the voice in the target video being played is low, if the volume of the voice of the target video being played is less than the volume threshold, it means that the collected second collected voice is clear enough, that is, the The voice information uttered by the user in the second collected voice is clear enough, so there is no need to perform echo cancellation on the second collected voice, and the second collected voice can be used as the second target collected voice. Wherein, the volume threshold is an arbitrary value.

In the above-mentioned embodiment, after the target voice assistant is successfully awakened, combined with the second playback voice, the acoustic echo cancellation process is performed on the second collected voice, which can ensure the validity of the control voice (the second target collected voice), and ensure the Accuracy of the obtained second control information, and then on the basis of improving the convenience and efficiency of the interaction, the accuracy of the voice interaction is improved.

In some embodiments, after the preset prompt information is displayed on the play page corresponding to the target video, the above method further includes:

In the case that the first target collected voice includes the target interaction instruction voice, update the service mode of the target voice assistant from the first state to the second state.

Wherein, the target interaction indicates that the voice indicates multiple rounds of interaction, and the service mode in the first state (which may be referred to as a single-round interaction mode) indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video; After the target voice assistant is woken up, after performing an interactive operation based on voice control associated with the target video, turn off the target voice assistant.

The service mode in the second state (which may be referred to as the multi-round interaction mode for short) indicates that during the wake-up of the target voice assistant, perform at least one voice-based interactive operation associated with the target video. That is, after the target voice assistant wakes up, one or more voice-based interactive operations associated with the target video can be performed.

The target interaction instruction voice indicates multiple rounds of interaction, that is, the target interaction instruction voice indicates to enable the multi-round interaction mode. In some embodiments, the target interaction instruction voice is a preset specific voice for starting multiple rounds of interaction modes. For example, the specific voice is "open multiple rounds of interaction mode", the specific voice is recognized in the first target collected voice, and it is determined that the first target collected voice includes the target interaction instruction voice.

In some embodiments, the target interaction indication voice is voice information with semantics requiring multiple interactions. For example, the target interaction indicates that the speech is "I want to send a gift." In some embodiments, based on a preset interaction recognition model, interactive recognition is performed on the first target collected speech to determine whether the first target collected speech includes the target interaction instruction speech.

In some embodiments, the preset interaction recognition model is obtained by training the preset deep learning model based on the sample speech and the interaction annotation information corresponding to the sample speech. In some embodiments, the sample speech corresponding to the preset interaction recognition model includes positive sample speech and negative sample speech, the interaction annotation information corresponding to the positive sample speech is the target interaction instruction speech, and the interaction annotation information corresponding to the negative sample speech is the target interaction Other interaction indication voices other than the indication voice indicate that multiple rounds of interaction are not to be performed.

In some embodiments, when the server receives the first target collection voice for the first time, it converts the first target collection voice into text information, and inputs the text information into a preset interactive recognition model for interactive recognition, so as to determine the first target collection voice. Whether the voice includes target interaction indication voice.

In some embodiments, when the second target collected voice includes the target interaction instruction voice, the service mode of the target voice assistant is updated from the first state to the second state.

In the above embodiment, when the first target voice collection includes the target interaction instruction voice, by updating the service mode of the target voice assistant from the first state to the second state, so that during the wake-up of the target voice assistant, at least An interactive operation based on voice control associated with the target video improves the convenience and efficiency of voice interactive operations, and also improves the diversity of voice interactive operations.

In some embodiments, after the service mode in the second state is turned on, corresponding interactive operations are performed based on the collected voice. Correspondingly, as shown in FIG. 7 , the above method further includes:

In step S701, a third collected voice and a third played voice are obtained, and the third played voice is the voice played in the target video when the third collected voice is collected.

In step S703, based on the third playing voice, echo cancellation is performed on the third collected voice to obtain the third target collected voice;

In step S705, perform wake-up recognition on the collected voice of the third target, and obtain a second wake-up recognition result;

In step S707, if the second wake-up recognition result is not to wake up the target voice assistant, send a second manipulation information acquisition request to the server, where the second manipulation information acquisition request includes the third target voice collection;

In step S709, the third manipulation information sent by the server is received, the second manipulation information corresponds to the voice collected by the third target;

In step S711, based on the second manipulation information, a second target interaction operation is performed.

Wherein, the second target interactive operation is an operation corresponding to the third collected voice, and also an operation associated with the target video.

The second wake-up recognition result is not to wake up the target voice assistant, which means that the third target voice collection is only the voice for controlling the operation related to the target video, that is, the third target voice collection is only when the target voice assistant is in multiple rounds. The control voice obtained in the interactive mode.

Wherein, the above step S701 to step S711 is the same as the above step S601 to step S609, and step S203, and will not be repeated here.

In some embodiments, in the process of playing the target video, if the voice played by the target video has little influence on the collected voice, the third collected voice is used as the third target collected voice.

For example, in the process of collecting the third collection voice, the volume of the voice in the target video being played is low, if the volume of the voice of the target video being played is less than the volume threshold, it means that the third collection voice collected is clear enough, that is, the The voice information sent by the user in the third collected voice is clear enough, therefore, it is not necessary to perform echo cancellation on the third collected voice, and the third collected voice can be used as the third target collected voice. Wherein, the volume threshold is an arbitrary value.

In the above-mentioned embodiment, after the multi-round interactive mode is turned on, combined with the third playback voice, the newly acquired third voice collection is subjected to acoustic echo cancellation processing, which can ensure the effectiveness of the control voice (the third target voice collection) and improve It not only improves the convenience and efficiency of interaction, but also improves the accuracy of voice interaction.

In some embodiments, the above method also includes:

In the case that the second wake-up recognition result is to wake up the target voice assistant, update the service mode of the target voice assistant from the second state to the first state.

In some embodiments, in order to support the service mode of the second state, the terminal creates two instances of recognition engines at the same time, wherein one recognition engine is used for wake-up recognition, and the other recognition engine is used for semantic recognition of multiple rounds of interactions. When the target voice assistant is in the service mode of the second state, the recognition engine used for wake-up recognition recognizes that the preset wake-up voice has been collected again, that is, when the second wake-up recognition result is to wake up the target voice assistant, it will interrupt The multi-round interaction mode of the target voice assistant makes the target voice assistant re-enter the service mode of the first state.

In the above embodiment, in the multi-round interaction mode, in response to re-awakening the target voice assistant, the multi-round interaction mode is interrupted, and the single-round interaction mode is re-entered to realize flexible switching between the two interaction modes.

In some embodiments, the above method also includes:

Send a voice response request to the server, where the voice response request includes the voice collected by the first target;

Receiving the response voice sent by the server, the response voice corresponds to the first target collection voice;

Play the response voice.

In order to improve the user experience, after the target voice assistant is awakened, it obtains the corresponding response voice from the server. In some embodiments, the response voice prompts the user that the target voice assistant has been awakened in the form of voice, and the content of the response voice is preset in combination with the actual application.

For example, the text corresponding to the preset wake-up voice is "little k", the first target collected voice is "little k", and the text corresponding to the response voice is "in'".

For another example, the text corresponding to the preset wake-up voice is "Little K", and the first target collects the voice "Little K, I want a gift", and the text corresponding to the response voice is "Yes, please say".

In the above embodiment, by playing the response voice corresponding to the first target collection voice, the interactivity with the user can be improved, thereby improving the user experience.

In the case that the newly collected voice is not obtained within the preset time period, the preset prompt information displayed on the playback page is updated to the closing prompt information of the target voice assistant.

In some embodiments, the newly collected voice is the voice collected after the target voice assistant is woken up, or is the voice after the acoustic echo cancellation process is performed on the collected voice when the target voice assistant is woken up.

In order to avoid long-term invalid standby of the voice assistant, the interaction waiting time is set in advance. Once the interaction waiting time is exceeded, the target voice assistant will be turned off, and the target voice assistant needs to be awakened again. In some embodiments, the waiting time for interaction is an arbitrary time set in advance, and the waiting time for interaction is the upper limit time for waiting for the newly collected voice from the time when the target voice assistant is woken up.

In some embodiments, the preset time period is determined by combining the preset interaction waiting time and the time when the target voice assistant is woken up. In some embodiments, if the target voice assistant is woken up and the newly collected voice is not acquired within the interaction waiting time, it is determined that the target voice assistant is closed due to timeout, and the preset prompt information displayed on the playback page is updated to The closing prompt information of the target voice assistant. Wherein, the waiting time for interaction after the target voice assistant is awakened is the preset time period.

In the above embodiment, after the target voice assistant is woken up, if the newly collected voice is not obtained within the preset time period, the preset prompt information displayed on the playback page is updated to the closing prompt information of the target voice assistant , which can avoid long-term invalid standby and reduce device resource consumption; and combined with the display of closing prompt information, it can remind the user of the target voice assistant relationship, which improves the user experience.

Fig. 8 is a flow chart of another voice interaction method shown according to an exemplary embodiment. As shown in Fig. 8, the voice interaction method is executed by an electronic device such as a terminal, and includes the following steps:

In step S801, in the process of playing the target video and when the target voice assistant is successfully awakened, the second collected voice and the second played voice are obtained, and the second played voice is played in the target video when the second collected voice is collected voice;

In step S803, based on the second playing voice, echo cancellation is performed on the second collected voice to obtain the second target collected voice;

In step S805, sending a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;

In step S807, the first control information sent by the server is received, and the first control information corresponds to the voice collected by the second target;

In step S809, based on the first manipulation information, a first target interaction operation is performed.

In the embodiment of the present disclosure, steps S801 to S809 are the same as the above steps S601 to S609 , and will not be repeated here.

In the method provided by the embodiments of the present disclosure, during the target video playback process, after the target voice assistant is successfully awakened, combined with the second playback voice, the acoustic echo cancellation process is performed on the second collected voice, which can ensure that the control voice (the second target collected voice) ) ensures the accuracy of the second control information obtained from the server, and improves the accuracy of voice interaction on the basis of improving the convenience and efficiency of interaction, and also realizes the interaction between voice and target video. Improved interaction convenience and efficiency.

Fig. 9 is a block diagram of a voice interaction device according to an exemplary embodiment. Referring to Figure 9, the device includes:

The first target collection voice acquisition module 910 is configured to acquire the first target collection voice during the playback of the target video;

The first wake-up recognition module 920 is configured to perform voice assistant wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;

The preset prompt information display module 930 is configured to display preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant, and the preset prompt information is used to prompt the target voice assistant to be activated. The wake-up is successful, and the interactive operation associated with the target video is controlled based on voice.

In some embodiments, the first target acquisition voice acquisition module 910 includes:

The first voice acquisition unit is configured to acquire the first collected voice and the first played voice during the playback of the target video, where the first played voice is the voice played in the target video when the first collected voice is collected;

The first acoustic echo cancellation processing unit is configured to perform echo cancellation on the first collected speech based on the first played speech to obtain the first target collected speech.

In some embodiments, the above-mentioned device also includes:

The second voice acquisition module is configured to acquire the second collected voice and the second played voice;

The second acoustic echo cancellation processing module is configured to perform acoustic echo cancellation processing on the second collected voice based on the second broadcast voice to obtain the second target collected voice, and the second played voice is played in the target video when collecting the second collected voice voice;

The first manipulation information acquisition request sending module is configured to send a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;

In some embodiments, the above-mentioned device also includes:

The first service mode update module is configured to update the service mode of the target voice assistant from the first state to the second state when the first target collected voice includes the target interaction indication voice, and the target interaction indication voice indicates multiple rounds of interaction , the service mode in the first state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode in the second state indicates that during the wake-up of the target voice assistant, perform at least one voice-based control Interactions associated with the target video.

In some embodiments, the above-mentioned device also includes:

The third voice acquisition module is configured to acquire the third collection voice and the third playback voice, the third playback voice is the voice played in the target video when collecting the third collection voice;

The third acoustic echo cancellation processing module is configured to perform echo cancellation on the third collected speech based on the third playback speech, to obtain a third target collected speech;

The second wake-up identification module is configured to perform wake-up identification on the voice collected by the third target to obtain a second wake-up identification result;

The second manipulation information acquisition request sending module is configured to send a second manipulation information acquisition request to the server when the second wake-up recognition result is that the target voice assistant is not awakened, and the second manipulation information acquisition request includes the third target voice collection ;

The third manipulation information receiving module is configured to receive the second manipulation information sent by the server, and the second manipulation information corresponds to the voice collected by the third target;

The third target interactive operation execution module is configured to execute the second target interactive operation based on the second manipulation information.

In some embodiments, the above-mentioned device also includes:

The second service mode update module is configured to update the service mode of the target voice assistant from the second state to the first state when the second wake-up recognition result is to wake up the target voice assistant.

In some embodiments, the preset reminder information display module 930 includes:

The first prompt information acquisition request sending unit is configured to send a prompt information acquisition request to the server when the first wake-up recognition result is to wake up the target voice assistant, and the prompt information acquisition request includes the first target collected voice;

The preset prompt information receiving unit is configured to receive the preset prompt information sent by the server, and the preset prompt information is generated based on the voice collected by the first target;

The preset prompt information display unit is configured to display preset prompt information on the playback page.

In some embodiments, the first target collection voice includes manipulation voice, and the above-mentioned device also includes:

The first manipulation information receiving module is configured to receive the third manipulation information sent by the server, the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute the third target interactive operation associated with the target video;

The first manipulation information execution module is configured to execute a third target interaction operation based on the third manipulation information.

In some embodiments, the first wake-up identification module 920 includes:

The first preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice;

The first wake-up identification unit is configured to perform wake-up identification on the first target collected voice based on the preset wake-up voice, and obtain a first wake-up identification result.

In some embodiments, the first wake-up identification module 920 includes:

The second preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice;

The second wake-up recognition unit is configured to perform wake-up recognition on the first target collected voice based on the preset wake-up voice, and obtain a third wake-up recognition result;

The first target collection voice sending unit is configured to send the first target collection voice to the server when the third wake-up recognition result is to wake up the target voice assistant;

The first wake-up recognition result receiving unit is configured to receive the first wake-up recognition result sent by the server. The first wake-up recognition result is obtained by performing wake-up recognition processing on text corresponding to the first target collected voice based on a preset wake-up recognition model.

In some embodiments, the above-mentioned device also includes:

The voice response request sending module is configured to send a voice response request to the server, and the voice response request includes the first target collection voice;

The response voice receiving module is configured to receive the response voice sent by the server, and the response voice corresponds to the first target collection voice;

The response voice playing module is configured to play the response voice.

In some embodiments, the above-mentioned device also includes:

The closing prompt module is configured to update the preset prompt information displayed on the playback page to the close prompt information of the target voice assistant when no newly collected voice is obtained within a preset time period.

Fig. 10 is a block diagram of another voice interaction device according to an exemplary embodiment. Referring to Figure 10, the device includes:

The second voice acquisition module 1010 is configured to acquire the second collected voice and the second played voice when the target voice assistant is successfully awakened during the playing of the target video, and the second played voice is when the second collected voice is collected The voice played in the target video;

The second acoustic echo cancellation processing module 1020 is configured to perform echo cancellation on the second collected speech based on the second playback speech, to obtain the second target collected speech;

The first manipulation information acquisition request sending module 1030 is configured to send a first manipulation information acquisition request to the server, where the first manipulation information acquisition request includes the second target collection voice;

The second manipulation information receiving module 1040 is configured to receive the first manipulation information sent by the server, where the first control information corresponds to the voice collected by the second target;

The second target interactive operation execution module 1050 is configured to execute the first target interactive operation based on the first manipulation information.

Fig. 11 is a block diagram of an electronic device for voice interaction according to an exemplary embodiment. The electronic device may be a terminal, and its internal structure may be as shown in Fig. 11 . The electronic device includes a processor, a memory, a network interface, a display screen and an input device connected through a system bus. Among them, the processor of the electronic device is used to provide calculation and control capabilities. The memory of the electronic device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the electronic device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a voice interaction method is realized. The display screen of the electronic device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the electronic device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the housing of the electronic device , and can also be an external keyboard, touchpad or mouse.

Those skilled in the art can understand that the structure shown in FIG. 11 is only a block diagram of a partial structure related to the disclosed solution, and does not constitute a limitation on the electronic device to which the disclosed solution is applied. The specific electronic device can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.

In an exemplary embodiment, there is also provided an electronic device, including: a processor; a memory for storing instructions executable by the processor; wherein, the processor is configured to execute the instructions, so as to implement The voice interaction method in the example.

In an exemplary embodiment, a computer-readable storage medium is also provided, and when instructions in the storage medium are executed by a processor of the electronic device, the electronic device can execute the voice interaction method in the embodiments of the present disclosure.

In an exemplary embodiment, a computer program product is also provided, including a computer program, and when the computer program is executed by a processor, the voice interaction method in the embodiment of the present disclosure is implemented.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be realized by instructing related hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage medium , when the computer program is executed, it may include the procedures of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media used in the various embodiments provided in the present application may include non-volatile and/or volatile memory. Nonvolatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in many forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Chain Synchlink DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

All the embodiments of the present disclosure can be implemented independently or in combination with other embodiments, which are all regarded as the scope of protection required by the present disclosure.

Claims

A voice interaction method, comprising:

During the playing of the target video, obtain the first target collection voice;

Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;

In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
The voice interaction method according to claim 1, wherein said acquiring the first target voice collection comprises:

Obtain the first collected voice and the first played voice, the first played voice is the voice played in the target video when collecting the first collected voice;

Based on the first playing voice, perform echo cancellation on the first collected voice to obtain the first target collected voice.
The voice interaction method according to claim 1, wherein the method further comprises:

Obtain a second collection voice and a second playback voice, the second playback voice is the voice played in the target video when collecting the second collection voice;

Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;

Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;

Based on the first manipulation information, a first target interaction operation is performed.
The voice interaction method according to any one of claims 1 to 3, wherein the method further comprises:

In the case where the first target collection voice includes a target interaction instruction voice, update the service mode of the target voice assistant from the first state to the second state, the target interaction instruction voice indicates multiple rounds of interaction, and the second The service mode of a state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode of the second state indicates that during the wake-up of the target voice assistant, Performing at least one voice-based interactive operation associated with the target video.
The voice interaction method according to claim 4, wherein the method further comprises:

Acquiring the third collected voice and the third playing voice, the third playing voice is the voice played in the target video when collecting the third collected voice;

Based on the third playback voice, perform echo cancellation on the third collected voice to obtain a third target collected voice;

Perform wake-up recognition on the third target collected voice to obtain a second wake-up recognition result;

When the second wake-up recognition result is not to wake up the target voice assistant, send a second manipulation information acquisition request to the server, where the second manipulation information acquisition request includes the third target voice collection;

receiving second manipulation information sent by the server, where the second manipulation information corresponds to the voice collected by the third target;

Based on the second manipulation information, a second target interaction operation is performed.
The voice interaction method according to claim 5, wherein the method further comprises:

If the second wake-up identification result is to wake up the target voice assistant, updating the service mode of the target voice assistant from the second state to the first state.
The voice interaction method according to any one of claims 1 to 3, wherein, in the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video include:

When the first wake-up recognition result is to wake up the target voice assistant, send a prompt information acquisition request to the server, where the prompt information acquisition request includes the first target voice collection;

receiving the preset prompt information sent by the server, where the preset prompt information is generated based on the collected voice of the first target;

The preset prompt information is displayed on the playing page.
The voice interaction method according to claim 7, wherein the voice collected by the first target includes manipulation voice, and the method further comprises:

receiving third manipulation information sent by the server, where the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute a third target interactive operation associated with the target video;

Based on the third manipulation information, execute the third target interaction operation.
The voice interaction method according to any one of claims 1 to 3, wherein performing wake-up recognition on the first target collected voice and obtaining a first wake-up recognition result includes:

Obtain the preset wake-up voice;

Based on the preset wake-up voice, wake-up recognition is performed on the first target collected voice to obtain the first wake-up recognition result.
The voice interaction method according to any one of claims 1 to 3, wherein performing wake-up recognition on the first target collected voice and obtaining a first wake-up recognition result includes:

Obtain the preset wake-up voice;

Based on the preset wake-up voice, perform wake-up recognition on the first target collected voice to obtain a third wake-up recognition result;

In the case that the third wake-up recognition result is to wake up the target voice assistant, sending the first target voice collection to the server;

receiving the first wake-up recognition result sent by the server, where the first wake-up recognition result is obtained by performing wake-up recognition on the text corresponding to the voice collected by the first target based on a preset wake-up recognition model.
The voice interaction method according to any one of claims 1 to 3, wherein the method further comprises:

Sending a voice response request to the server, where the voice response request includes the voice collected by the first target;

receiving a response voice sent by the server, the response voice corresponding to the first target collection voice;

Play the response voice.
The voice interaction method according to any one of claims 1 to 3, wherein the method further comprises:

In the case that the newly collected voice is not acquired within the preset time period, the preset prompt information displayed on the playing page is updated with the closing prompt information of the target voice assistant.
A voice interaction method, comprising:

During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;

Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;

Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;

Based on the first manipulation information, a first target interaction operation is performed.
A voice interaction device, comprising:

The first target acquisition voice acquisition module is configured to acquire the first target acquisition voice during the playback of the target video;

The first wake-up recognition module is configured to perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;

The preset prompt information display module is configured to display preset prompt information on the play page corresponding to the target video when the first wake-up recognition result is to wake up the target voice assistant, and the preset prompt information is used for Prompting that the target voice assistant is successfully awakened, and controlling an interactive operation associated with the target video based on voice.
The voice interaction device according to claim 14, wherein the first target voice acquisition module includes:

The first voice acquiring unit is configured to acquire a first collected voice and a first played voice during the playback of the target video, and the first played voice is the voice played in the target video when the first collected voice is collected ;

The first acoustic echo cancellation processing unit is configured to perform echo cancellation on the first collected speech based on the first played speech to obtain the first target collected speech.
The voice interaction device according to claim 14, wherein the device further comprises:

The second voice acquiring module is configured to acquire a second collected voice and a second playing voice, the second playing voice is the voice played in the target video when collecting the second collected voice;

The second acoustic echo cancellation processing module is configured to perform echo cancellation on the second collected speech based on the second played speech, to obtain a second target collected speech;

A first manipulation information acquisition request sending module, configured to send a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

The second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;

The second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.
The voice interaction device according to any one of claims 14 to 16, wherein the device further comprises:

The first service mode update module is configured to update the service mode of the target voice assistant from the first state to the second state when the first target collected voice includes target interaction instruction voice, and the target interaction The instruction voice indicates multiple rounds of interaction, the service mode of the first state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode of the second state indicates During the wake-up of the target voice assistant, at least one interactive operation associated with the target video based on voice control is performed.
The voice interaction device according to claim 17, wherein the device further comprises:

The third voice acquisition module is configured to acquire a third voice collection and a third playback voice, where the third playback voice is the voice played in the target video when the third voice collection is collected;

The third acoustic echo cancellation processing module is configured to perform echo cancellation on the third collected speech based on the third playback speech to obtain a third target collected speech;

The second wake-up identification module is configured to perform wake-up identification on the third target collected voice to obtain a second wake-up identification result;

The second manipulation information acquisition request sending module is configured to send a second manipulation information acquisition request to the server when the second wake-up recognition result is not to wake up the target voice assistant, the second manipulation information acquisition request Including the third target collection voice;

The third manipulation information receiving module is configured to receive the second manipulation information sent by the server, the second manipulation information corresponds to the voice collected by the third target;

The third target interactive operation executing module is configured to execute the second target interactive operation based on the second manipulation information.
The voice interaction device according to claim 18, wherein the device further comprises:

The second service mode updating module is configured to update the service mode of the target voice assistant from the second state to the first when the second wake-up recognition result is to wake up the target voice assistant. state.
The voice interaction device according to any one of claims 14 to 16, wherein the preset prompt information display module includes:

The first prompt information acquisition request sending unit is configured to send a prompt information acquisition request to the server when the first wake-up recognition result is to wake up the target voice assistant, and the prompt information acquisition request includes the first Target voice collection;

The preset prompt information receiving unit is configured to receive the preset prompt information sent by the server, the preset prompt information is generated based on the collected voice of the first target;

The preset prompt information display unit is configured to display the preset prompt information on the playing page.
The voice interaction device according to claim 20, wherein the first target collected voice includes manipulation voice, and the device further comprises:

The first manipulation information receiving module is configured to receive third manipulation information sent by the server, the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute a third target associated with the target video interactive operation;

The first manipulation information execution module is configured to execute the third target interaction operation based on the third manipulation information.
The voice interaction device according to any one of claims 14 to 16, wherein the first wake-up identification module comprises:

The first preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice;

The first wake-up identification unit is configured to perform wake-up identification on the first target collected voice based on the preset wake-up voice, and obtain the first wake-up identification result.
The voice interaction device according to any one of claims 14 to 16, wherein the first wake-up identification module comprises:

The second preset wake-up voice acquisition unit is configured to acquire a preset wake-up voice;

The second wake-up recognition unit is configured to perform wake-up recognition on the first target collected voice based on the preset wake-up voice, and obtain a third wake-up recognition result;

The first target collected voice sending unit is configured to send the first target collected voice to a server when the third wake-up recognition result is to wake up the target voice assistant;

The first wake-up recognition result receiving unit is configured to receive the first wake-up recognition result sent by the server, the first wake-up recognition result is based on a preset wake-up recognition model, corresponding to the first target collected voice The text is obtained by wake-up recognition.
The voice interaction device according to any one of claims 14 to 16, wherein the device further comprises:

The voice response request sending module is configured to send a voice response request to the server, the voice response request including the voice collected by the first target;

The response voice receiving module is configured to receive the response voice sent by the server, the response voice corresponds to the first target collection voice;

The response voice playing module is configured to play the response voice.
The voice interaction device according to any one of claims 14 to 16, wherein the device further comprises:

The closing prompting module is configured to update the preset prompting information displayed on the playing page to the closing prompting information of the target voice assistant when no newly collected voice is acquired within a preset time period.
A voice interaction device, comprising:

The second voice acquisition module is configured to acquire a second collection voice and a second playback voice when the target voice assistant is successfully awakened during the playback of the target video, and the second playback voice is for collecting the second The voice played in the target video when collecting the voice;

The second acoustic echo cancellation processing module is configured to perform echo cancellation on the second collected speech based on the second played speech, to obtain a second target collected speech;

A first manipulation information acquisition request sending module, configured to send a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

The second manipulation information receiving module is configured to receive the first manipulation information sent by the server, where the first manipulation information corresponds to the voice collected by the second target;

The second target interactive operation execution module is configured to execute the first target interactive operation based on the first manipulation information.
An electronic device comprising:

processor;

memory for storing said processor-executable instructions;

Wherein, the processor is configured to execute the instructions to implement the following steps:

During the playing of the target video, obtain the first target collection voice;

Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;

In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
The electronic device according to claim 27, wherein the processor is configured to execute the instructions to implement the following steps:

Obtain the first collected voice and the first played voice, the first played voice is the voice played in the target video when collecting the first collected voice;

Based on the first playing voice, perform echo cancellation on the first collected voice to obtain the first target collected voice.
The electronic device according to claim 27, wherein the processor is configured to execute the instructions to implement the following steps:

Obtain a second collection voice and a second playback voice, the second playback voice is the voice played in the target video when collecting the second collection voice;

Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;

Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;

Based on the first manipulation information, a first target interaction operation is performed.
The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:

In the case where the first target collection voice includes a target interaction instruction voice, update the service mode of the target voice assistant from the first state to the second state, the target interaction instruction voice indicates multiple rounds of interaction, and the second The service mode of a state indicates that during the wake-up of the target voice assistant, perform an interactive operation based on voice control associated with the target video, and the service mode of the second state indicates that during the wake-up of the target voice assistant, Performing at least one voice-based interactive operation associated with the target video.
The electronic device according to claim 30, wherein the processor is configured to execute the instructions to implement the following steps:

Acquiring the third collected voice and the third playing voice, the third playing voice is the voice played in the target video when collecting the third collected voice;

Based on the third playback voice, perform echo cancellation on the third collected voice to obtain a third target collected voice;

Perform wake-up recognition on the third target collected voice to obtain a second wake-up recognition result;

When the second wake-up recognition result is not to wake up the target voice assistant, send a second manipulation information acquisition request to the server, where the second manipulation information acquisition request includes the third target voice collection;

receiving second manipulation information sent by the server, where the second manipulation information corresponds to the voice collected by the third target;

Based on the second manipulation information, a second target interaction operation is performed.
The electronic device according to claim 31, wherein the processor is configured to execute the instructions to implement the following steps:

If the second wake-up identification result is to wake up the target voice assistant, updating the service mode of the target voice assistant from the second state to the first state.
The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:

When the first wake-up recognition result is to wake up the target voice assistant, send a prompt information acquisition request to the server, where the prompt information acquisition request includes the first target voice collection;

receiving the preset prompt information sent by the server, where the preset prompt information is generated based on the collected voice of the first target;

The preset prompt information is displayed on the playing page.
The electronic device according to claim 33, wherein the processor is configured to execute the instructions to implement the following steps:

receiving third manipulation information sent by the server, where the third manipulation information corresponds to the manipulation voice, and the manipulation voice instructs to execute a third target interactive operation associated with the target video;

Based on the third manipulation information, execute the third target interaction operation.
The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:

Obtain the preset wake-up voice;

Based on the preset wake-up voice, wake-up recognition is performed on the first target collected voice to obtain the first wake-up recognition result.
The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:

Obtain the preset wake-up voice;

Based on the preset wake-up voice, perform wake-up recognition on the first target collected voice to obtain a third wake-up recognition result;

In the case that the third wake-up recognition result is to wake up the target voice assistant, sending the first target voice collection to the server;

receiving the first wake-up recognition result sent by the server, where the first wake-up recognition result is obtained by performing wake-up recognition on the text corresponding to the voice collected by the first target based on a preset wake-up recognition model.
The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:

Sending a voice response request to the server, where the voice response request includes the voice collected by the first target;

receiving a response voice sent by the server, the response voice corresponding to the first target collection voice;

Play the response voice.
The electronic device according to any one of claims 27 to 29, wherein the processor is configured to execute the instructions to implement the following steps:

In the case that the newly collected voice is not acquired within the preset time period, the preset prompt information displayed on the playing page is updated with the closing prompt information of the target voice assistant.
An electronic device comprising:

processor;

memory for storing said processor-executable instructions;

Wherein, the processor is configured to execute the instructions to implement the following steps:

During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;

Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;

Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;

Based on the first manipulation information, a first target interaction operation is performed.
A computer-readable storage medium, when instructions in the storage medium are executed by a processor of the electronic device, the electronic device can perform the following steps:

During the playing of the target video, obtain the first target collection voice;

Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;

In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
A computer-readable storage medium, when instructions in the storage medium are executed by a processor of the electronic device, the electronic device can perform the following steps:

During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;

Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;

Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;

Based on the first manipulation information, a first target interaction operation is performed.
A computer program product comprising a computer program executed by a processor in the following steps:

During the playing of the target video, obtain the first target collection voice;

Perform wake-up recognition on the first target collected voice to obtain a first wake-up recognition result;

In the case that the first wake-up recognition result is to wake up the target voice assistant, display preset prompt information on the play page corresponding to the target video, the preset prompt information is used to prompt that the target voice assistant is successfully awakened, And the interactive operation associated with the target video is controlled based on the voice.
A computer program product comprising a computer program executed by a processor in the following steps:

During the playback of the target video and if the target voice assistant is successfully awakened, the second collected voice and the second played voice are acquired, and the second played voice is played in the target video when the second collected voice is collected voice;

Based on the second playback voice, perform echo cancellation on the second collected voice to obtain a second target collected voice;

Sending a first manipulation information acquisition request to a server, where the first manipulation information acquisition request includes the second target collection voice;

receiving first manipulation information sent by the server, where the first manipulation information is corresponding to the voice collected by the second target;

Based on the first manipulation information, a first target interaction operation is performed.