CN113409778A - Voice interaction method, system and terminal - Google Patents

Voice interaction method, system and terminal Download PDF

Info

Publication number
CN113409778A
CN113409778A CN202010183403.XA CN202010183403A CN113409778A CN 113409778 A CN113409778 A CN 113409778A CN 202010183403 A CN202010183403 A CN 202010183403A CN 113409778 A CN113409778 A CN 113409778A
Authority
CN
China
Prior art keywords
voice
user
information stream
input
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010183403.XA
Other languages
Chinese (zh)
Inventor
徐贤仲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN202010183403.XA priority Critical patent/CN113409778A/en
Publication of CN113409778A publication Critical patent/CN113409778A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/225Feedback of the input speech

Abstract

A voice interaction method, system and terminal are disclosed. The voice interaction method comprises the following steps: presenting the current information stream; acquiring a voice input from a user; based on the current information stream and the speech input, the presentation content of a subsequent information stream is determined. The information stream may be an information stream comprising storyline branches or an information stream comprising a steerable avatar. Therefore, the invention provides a scheme that the user can actively influence the trend of the content through voice interaction. The user can determine the subsequent trend of the current information flow through voice input, particularly can determine the plot branches of plot games through voice input, so that the immersion and participation of the user are enhanced, and the playability of the games is improved.

Description

Voice interaction method, system and terminal
Technical Field
The present disclosure relates to a voice processing technology, and in particular, to a voice interaction method, system, and terminal.
Background
With the development of voice interaction technology, smart speakers capable of performing various controls and content acquisition using voice commands have become popular. The intelligent loudspeaker box is popular in content functions such as listening to songs and listening to stories. On the sound box with the screen, the content can be presented by combining various media such as videos, pictures, characters, audios and the like. For the content programs with the scenarios, the broadcasting can be completed at one time after the user instruction is triggered. For example, the user may speak into the smart speaker, "XXX, i want to hear the story". The intelligent sound box can broadcast a story for the user until the album is played. Although a user may perform operations such as play, pause, selection, etc., such manipulative operations may not actively affect the trend of the content, such that the user lacks a sense of immersion and engagement.
Therefore, an interactive scheme is needed in which a user can actively influence the trend of content.
Disclosure of Invention
One technical problem to be solved by the present disclosure is to provide a scheme that a user can actively influence the trend of content through voice interaction. The user can determine the subsequent trend of the current information flow through voice input, particularly can determine the plot branches of plot games through voice input, so that the immersion and participation of the user are enhanced, and the playability of the games is improved.
According to a first aspect of the present disclosure, there is provided a voice interaction method, including: presenting the current information stream; acquiring a voice input from a user; based on the current information stream and the speech input, the presentation content of a subsequent information stream is determined. The information stream may be an information stream comprising storyline branches or an information stream comprising a steerable avatar.
According to a second aspect of the present disclosure, there is provided a voice interaction method, including: broadcasting a storyline story by voice; the voice broadcast is used for triggering a plurality of options of different plot branches; acquiring voice selection of a user on one option in the multiple options; and triggering a scenario branch corresponding to the selected option based on the voice selection.
According to a third aspect of the present disclosure, there is provided a voice interaction system, a server and a plurality of terminals, wherein the terminals are configured to: presenting the information flow obtained from the server; collecting voice input from a user; uploading the voice input to the server; acquiring voice input feedback issued by the server; and presenting a subsequent information stream based on the voice input feedback, the server being configured to: issuing a current information flow for presentation; acquiring the voice input uploaded by the terminal; and generating and issuing the voice input feedback based on the voice input.
According to a fourth aspect of the present disclosure, there is provided a voice interaction terminal, comprising: presentation means for presenting a current information stream; input means for acquiring a voice input from a user; processing means for determining the presentation content of a subsequent information stream based on the current information stream and the speech input.
According to a fifth aspect of the present disclosure, there is provided a voice interaction method, comprising: presenting the current information stream; obtaining a plurality of speech inputs from a plurality of users; based on the current information stream and the plurality of speech inputs, determining presentation content for a subsequent information stream.
According to a sixth aspect of the present disclosure, there is provided a voice interaction method, comprising: presenting the current information stream; acquiring multiple rounds of voice input from a user; based on the current information stream and the multiple rounds of voice input, determining presentation content of a subsequent information stream.
According to a seventh aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method as described in the first and second aspects and the fifth and sixth aspects above.
According to an eighth aspect of the present disclosure, there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method as described in the first and second aspects and the fifth and sixth aspects above.
Therefore, the invention can realize the technical effects of influencing the trend of the plot and broadcasting different audio and video contents in a voice interaction mode. Specifically, the voice instruction can be used for triggering subsequent content broadcasting, and multidimensional information such as text information, execution time, whether to interrupt sound box broadcasting, instruction sending time, corpus emotion and the like which are recognized by voice can be used as a decision and generation basis of subsequent broadcasting content.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent by describing in greater detail exemplary embodiments thereof with reference to the attached drawings, in which like reference numerals generally represent like parts throughout.
FIG. 1 shows a schematic flow diagram of a voice interaction method according to one embodiment of the present invention.
Fig. 2 shows an example of a scenario branching structure.
Fig. 3 shows an example of triggering a scenario branch by selection.
Fig. 4 shows an example of interaction of a voice broadcast storyline.
FIG. 5 illustrates a schematic diagram of the components of a voice interaction system in which the present invention may be implemented.
Fig. 6 is a schematic diagram illustrating the composition of a voice interactive terminal according to an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
With the development of voice interaction technology, smart speakers capable of performing various controls and content acquisition using voice commands have become popular. The intelligent loudspeaker box is popular in content functions such as listening to songs and listening to stories. On the sound box with the screen, the content can be presented by combining various media such as videos, pictures, characters, audios and the like. For the content programs with the scenarios, the broadcasting can be completed at one time after the user instruction is triggered. For example, the user may speak into the smart speaker, "XXX, i want to hear the story". The intelligent sound box can broadcast a story for the user until the album is played. Although a user may perform operations such as play, pause, selection, etc., such manipulative operations may not actively affect the trend of the content, such that the user lacks a sense of immersion and engagement.
Therefore, the invention provides a scheme that the user can actively influence the trend of the content through voice interaction. The user can determine the subsequent trend of the current information flow through voice input, particularly can determine the plot branches of plot games through voice input, so that the immersion and participation of the user are enhanced, and the playability of the games is improved.
FIG. 1 shows a schematic flow diagram of a voice interaction method according to one embodiment of the present invention. In some embodiments, the method may be implemented by a voice interactive terminal through interaction with a user of the terminal. In further embodiments, the voice interaction terminal needs to implement the above scheme by using processing and/or storage capabilities of the cloud.
In step S110, the current information stream is presented. Here, "presenting" refers to making the user aware of it through various perception means via the terminal device. In one embodiment, the information stream presented may be a page of information displayed in a display screen, for example, within a cell phone-mounted APP. Alternatively or additionally, the presentation may comprise a sound presentation. At this time, the speaker or the earphone can play corresponding scene sounds, such as music, voice prompts or descriptions, or sounds simulating a real scene (e.g., rain, wind, etc.). In other embodiments, the presentation of the information stream may also be performed by other means, such as vibration.
Here, "information flow" refers to information capable of updating presentation content. For example, the smart speaker may story read via an audio stream, at which time the story read may be viewed as an "information stream". In a game scenario, the content that is converted based on the user input may also be regarded as an information stream.
Corresponding voice feedback that the user can make with respect to the presented information stream. For this, in step S120, a voice input from the user may be acquired. After the voice input of the user is obtained, the presentation content of the subsequent information stream may be determined based on the current information stream and the voice input in step S130. Therefore, active influence on subsequent information flow presentation based on voice input of the user is achieved, interaction experience of the user and the terminal equipment is increased, and immersion is increased through voice participation.
As described above, in the present invention, "information flow" refers to information capable of updating presentation content. In a preferred implementation, the information flow of the present invention may particularly refer to an information flow comprising storyline branches. For example, a game containing a plot branch, an episode (e.g., a television show, a movie, an animation, etc.) or a novel. The 'plot branch' is a play design which leads to different plots according to different choices of a user, is also one of the most classical and important interactive elements in character adventure games and interactive novels, and can lead the user to obtain great achievement in choice and change.
Fig. 2 shows an example of a scenario branching structure. The scenario may begin with a common sequence portion 20 for setting up the scenes of a storyline, introducing characters to the player, and so forth. A decision is needed at the branching point a to G as to which path the storyline takes, so that the user reaches one of four possible end points W to Z through a feasible storyline. In contrast to the branch points a-G (which may be partly implemented as interaction points mentioned below) that require a decision, some paths may also be combined at nodes H, J and K, i.e. different storyline branches may also return to the same storyline and naturally proceed forward up to one of the end points W to Z.
Here, the endpoints W to Z may connect subsequent scenarios, i.e. the branching structure shown in fig. 2 may be part of the overall branching structure of a certain game or episode. In other embodiments, endpoints W through Z may correspond to the four endings of a scenario, for example, in a simpler game. The path 22 in the figure may refer to a conventional scenario path, while the double line 24 may refer to a scenario path requiring a particular condition trigger.
Different scenario branches may be triggered after a user or an avatar controlled by the user meets a certain condition (e.g., goes up to a certain number of levels) or completes a certain task, or may be triggered at a scenario interaction point by a selection of multiple options from the user. Fig. 3 shows an example of triggering a scenario branch by selection. As shown, in a marine game, the fleet of ships that the user takes encounter insufficient food, and at the point of game interaction, the user may be shown different options for dealing with. The user selects different options and proceeds to different branches, e.g., the branch that caused the flight task to fail or succeed, etc.
The voice interaction scheme is particularly suitable for the information flow with the plot branches, and the plot trend of the information flow can be influenced by the user through voice interaction, so that the immersive interaction effect is achieved. For example, when presenting a current information stream in a voice announcement, the user may be presented with a plurality of options for triggering different storyline branches, e.g. with the voice announcement option. Subsequently, a user's voice selection of one of the plurality of options may be obtained. Whereby a scenario branch corresponding to the selected option may be triggered based on the voice selection. For example, in a detective type voice game, a detective story may be voice-announced and the user may be motivated to look for clues. For example, the smart voice device may report "are you coming to a bifurcation, the left side leading to the forest, the right side leading to the side of the river, which way you are going to go? ". The user may directly reply to "left side forest removal" by voice to make a selection and proceed to the corresponding plot branch based on the selection. From this, under the condition of voice broadcast information flow, through introducing user's pronunciation reply, increased user's degree of participation to can let the user immerse all the time in the atmosphere that voice broadcast created, promoted object for appreciation nature
In other embodiments, before presenting the options, the method may further include: the user is presented with an interaction point prompt for triggering a different storyline branch. Subsequently, a voice instruction prompted by the user to the interaction point can be obtained, and based on the voice selection, a scenario branch or branch option corresponding to the interaction point is triggered. Here, an interaction point may refer to an interaction occurrence for causing a different scenario branch. For example, the aforementioned image display options and voice options of fig. 3. When the interaction point is not necessarily passed by the scenario development, the interaction point can be prompted. For example, in the case of voice broadcasting, the voice broadcasting case may include "whether or not an investigation is required by passing a gate". At this point, the entry gate itself may not directly invoke the storyline branch, but rather there is an option to invoke the storyline branch in the user voice selection entry gate. The user may initiate subsequent scenario branches by subsequent selections of options within the door, or by learning of clues within the door.
For the acquired user voice input, firstly, the text information of the voice input can be acquired and converted into an instruction which can be understood by a machine. For example, the understanding of "left side" and/or "forest" in the above example, thereby enabling the effect of selecting based on clicking, etc., in existing game interactions, for example, to determine the presentation content of subsequent information streams. In addition to replacing existing interaction means (e.g., mouse click, finger click on a touch screen), voice input may also provide the terminal with its own unique information to aid in the determination or generation of subsequent information streams. To this end, acquiring the voice input from the user may include: acquiring text information of the voice input; obtaining voice attribute information of the voice input and determining the presentation content of the subsequent information stream may comprise: and determining the presentation content of the subsequent information flow based on the text information and the voice attribute information. Besides presenting the content, the presenting mode of the subsequent information flow can be further determined based on the text information and the voice attribute information. For example, in a voice detective game, a scenario branch to be broadcasted next and a presentation manner of the scenario branch may be determined based on various types of information included in a voice input of a user, for example, the scenario branch is broadcasted in a more mysterious or tense tone.
In particular, voice attribute information may refer to information associated with the input voice itself, in addition to the semantic text content of the voice input.
In one embodiment, the voice attribute information may be a starting time of the voice input corresponding to the current information stream. As mentioned above, in step S130, the flow direction of the subsequent information stream needs to be determined based on the traveling condition of the current information stream and the related instruction of the voice input. In this embodiment, not only what the user said, but also when it was said, it becomes the source of judgment for the generation of the subsequent information stream. For example, in the case of the branch determination based on the interaction point and selection as described above, the user's mind may be determined based on whether the user interrupts the sound box announcement, and the subsequent presentation content or the presentation manner thereof more conforming to the user's current mind may be given. In addition, in some embodiments, the user may be allowed to make voice inputs at locations other than the interaction point, such as on pathways 22 or 24 other than the a-G interaction point in fig. 2, the content and start time of which may also be used as selection criteria for subsequent storyline branches.
Alternatively or additionally, the voice attribute information may be a duration of the voice input. The speech rate or the mood of the user can be judged from the duration, and the information can also be used as the selection standard of the subsequent plot branches.
Also alternatively or additionally, the voice attribute information may be emotion information and/or intonation information of the voice input. After the voice input of the user is acquired, the input voice itself may be analyzed, the emotion information of the user is judged based on the speaking intensity, the words and the like, and a scenario branch corresponding to the emotion information (for example, a playing method with greater difficulty and the like) is given later. In addition, when expressing the same text meaning, the user can also adopt different word sending sentences or tone tones, and the tone tones can be used for generating, determining or determining the presentation mode of the subsequent plot branches. For example, when the user responds with chuanpu (chuanwa), the user may use chuanpu to perform a virtual character conversation during interaction with a follow-up game. In languages such as japanese where people with different identities communicate using different sentence patterns, the corresponding sentence pattern may be selected for subsequent information flow presentation based on word sending sentence making of the current user.
In addition, the voice attribute information may further include a user identity corresponding to the voice input. For example, the user may voiceprint compare the captured speech to determine the user's identity, such as previously entered age, gender, preferences, and credit points, and may decide on subsequent information flow presentation based on the determined user identity. For example, a certain fighting game may have different versions of R-13, R-18, etc., and after the user's voiceprint is verified, it may be determined whether to open a scenario path 24 (e.g., an adult user path) requiring a particular condition trigger as shown in FIG. 2, depending on the age of the user.
Further, the voice interaction method of the present invention may further include: environmental information at the time of the voice input generation is obtained, and based on the environmental information, presentation content of a subsequent information stream is determined. The environment information may be information of a small environment where the user performs voice input, such as a home temperature, whether there is any other person, or environment information on a larger scale, such as a time period (holiday, peak hours during work, late night), a weather condition, geographical location information, and the like. The above information can also be used to determine the generation or selection of subsequent information streams, and the manner in which they are presented.
In addition, in addition to receiving voice input, the voice interaction method of the present invention may also obtain non-voice input of the user and determine the presentation content of the subsequent information stream based on the non-voice input. For example, the terminal may obtain a somatosensory input (e.g., based on a somatosensory sensor) or a video input (e.g., based on a 3D camera) of the user, and perform comprehensive judgment in combination with a voice input. At some interaction points, for example, interaction may be via voice input, and at other input points, interaction may be based on, for example, a mouse or screen click.
As an alternative or in addition to the application and the storyline branch class information flow, the voice interaction method of the invention may also be used to present the current information flow of the avatar.
The avatar may comprise a user avatar. For example, in a classical RPG (role playing) game, a user controls a principal role of the game. At this time, the user may control his avatar through voice, e.g., "go left, go out of town", etc., thereby replacing cumbersome hand mouse clicks or keyboard controls. To this end, acquiring the voice input from the user may include: obtaining speech control of the user over the user avatar, and determining presentation content of a subsequent information stream comprises: controlling presentation of the user avatar based on the speech input.
Further, the avatars include other avatars besides the user avatar. The other virtual avatar can be a virtual character or other living body in the stand-alone game, or can be the virtual avatar of other real users in the network game or the virtual character or other living body carried in the game. At this time, acquiring the voice input from the user may include: obtaining voice interactions of the user with the other avatars, and determining presentation content of a subsequent information stream includes: controlling presentation of the other avatars based on the voice interaction. Controlling the presentation of the other avatars may include obtaining cues that trigger storyline branches; and/or obtaining interaction points that trigger storyline branching.
Specifically, the user can have a voice conversation with other avatars directly or through their virtual avatars, thereby obtaining a stronger sense of immersion than existing clicking operations. The content, duration, etc. of the user's dialog with the avatar may trigger, for example, clue characters in the game to provide clues or interaction points that directly trigger storyline branches, thereby facilitating game play.
The voice interaction scheme according to the present invention, which is described above in connection with fig. 1, may be applied to various information streams actively influenced by a user with voice interaction, which may be presented using one or more ways, e.g., by images and/or sounds, and may receive various inputs including voice interaction, which may lead to a better interaction experience.
Among other things, the present invention is particularly well suited for implementation as an interactive method for voice broadcasting a storyline. Therefore, the terminal can broadcast the storyline by voice; and voice broadcasting a plurality of options for triggering different scenario branches; acquiring voice selection of a user on one option in the multiple options; and triggering a scenario branch corresponding to the selected option based on the voice selection. Furthermore, voice input inserted by the user in other time periods of the multiple options in voice broadcast can be acquired; presenting a plot interaction point based on the voice input; acquiring voice interaction of the user aiming at the plot interaction point; and generating or triggering a subsequent plot branch based on the voice interaction.
Fig. 4 shows an example of interaction of a voice broadcast storyline. The storyline can be a detective story, for example, and the system can determine the subsequent broadcast content by referring to the user voice command, and the main flow is as follows:
the game/program begins and the content begins to be broadcast. Subsequently, receiving of the user instruction may be started. The user command may be received at a specific time point, or may start receiving in any time period, in other words, the user command may start receiving in the current broadcast, or may start receiving in the broadcast process.
The received instruction can be a voice-type instruction of a user, corresponding instruction identification can be carried out on the voice-type instruction, the voice-type instruction is converted into structured data which can be understood by a program, and the received information can comprise identified text information, voice duration time, whether the instruction is sent after broadcasting is finished, the emotion of the identified user and the like. For example, when a plurality of options of an interaction point are voice-announced (e.g., when announcing "go left forest or right river side"), explicit selection of a subsequent plot branching may be achieved according to the text content of the user voice output (e.g., "go forest"). As another example, whether a "terrorist" hidden mode (e.g., the two-line scenario path 24 shown in FIG. 2) needs to be turned on may be determined based on the age of the user as determined by the voiceprint (e.g., age 18) and the emotional state as displayed by the user's voice input to increase the user's profound experience with the game.
The received command may also be triggered by other non-voice events, such as a user's mouse, key operation, change in geographic location, etc., or by a timeout event caused by the user without any command input. For example, in the case of a speaker with a screen, content options of a voice broadcast may be displayed on the screen at the same time, and the user may also complete the selection by clicking on the touch screen. For another example, in a more deep interactive game, the subsequent game trend can be determined jointly according to the user's physical movement, user's facial expression and user's voice interaction captured by the 3D camera.
And (3) deciding to generate broadcast contents according to the instruction information and various known context information (such as credit points, located cities and the like), wherein the broadcast contents can be prepared in advance and are to be selected, and can also be dynamically generated according to decision results. The latest content may then continue to be broadcast until the end condition is met.
Therefore, the invention can realize the technical effects of influencing the trend of the plot in a voice interaction mode and broadcasting different audio and video contents. Specifically, the voice instruction can be used for triggering subsequent content broadcasting, and multidimensional information such as text information, execution time, whether to interrupt sound box broadcasting, instruction sending time, corpus emotion and the like which are recognized by voice can be used as a decision and generation basis of subsequent broadcasting content.
In a particular application scenario, the voice interaction scheme of the present invention may also be implemented as a more complex voice interaction and information flow presentation scheme involving multi-user interactions, and/or multiple rounds of interactions.
In one embodiment, the invention may be implemented as a voice interaction method comprising: presenting the current information stream; obtaining a plurality of speech inputs from a plurality of users; based on the current information stream and the plurality of speech inputs, determining presentation content for a subsequent information stream.
Here, the presenting of the information stream and the acquiring of the voice input may be a manner of acquiring one user input at a time, a scenario advances, and a distinguishing acquisition manner of acquiring a next user input, or may be a manner of acquiring a plurality of user inputs at one time. Thus, obtaining a plurality of speech inputs from a plurality of users comprises at least one of: respectively acquiring voice input from different users aiming at different current information streams presented successively; and acquiring a plurality of voice inputs from different users for one current information stream.
Further, the method may further comprise: determining that the plurality of speech inputs are from different users. Then, based on the current information stream and the plurality of speech inputs, determining the presentation content of the subsequent information stream comprises at least one of: generating a sub information stream and determining the presentation content of the sub information stream aiming at different users; and comprehensively judging the user identities and the input contents of the voice inputs to determine the presentation contents of the subsequent information streams.
In role-playing games involving multiple players, such as those involving three players A, B and C, the method may ask different players A, B and C, respectively, at different asking points, or may ask the three players simultaneously. In a simultaneous query, if three players have respective voice input devices that do not interfere with each other, e.g., each wearing a microphone, the answers can be simultaneously made for system acquisition (e.g., in the case of an online game); if three players face a voice interaction device, e.g., a smart speaker, it is preferable to obtain the voice inputs of the three players if they do not speak simultaneously, and determine the content of the subsequent information stream according to the content of the voice inputs, the successive relationship of the inputs, etc. In the case of an online game, player A, B, C may each be presented (e.g., audibly announced) with their respective sub-streams of information. In the case of local games, presentation can be in the same stream.
Alternatively or additionally, in another embodiment, the invention may be implemented as a voice interaction method comprising: presenting the current information stream; acquiring multiple rounds of voice input from a user; and determining presentation content of a subsequent information stream based on the current information stream and the multiple rounds of voice input.
The user can perform multiple rounds of voice input under the guidance of the system. To this end, obtaining multiple rounds of speech input from a user may include: presenting the interactive content of the current round according to a preset frame; acquiring the voice input of the current round generated by the user aiming at the interactive content of the current round; and presenting a next round of interactive content based on the predetermined frame and/or the current round of voice input. For example, the information stream may include a storyline story, and a storyline framework of the storyline story may be constructed based on the multiple rounds of speech input.
For example, in a scenario of storyline story broadcasting, the system can enable a user to select or determine the background of story occurrence, such as london in 19 th century, a virtual world in 22 th century in the future, and then enable the user to select story types, such as a spy reasoning class and a comedy class, and even enable the user to determine character characteristics of a host and a public class, so that the user can deeply participate in story creation, realize co-creation with the system, and further improve participation and interest.
As described above, the voice interaction method of the present invention can be implemented by the voice interaction terminal through interaction with the terminal user. In further embodiments, the voice interaction terminal needs to implement the above scheme by using processing and/or storage capabilities of the cloud.
To this end, the invention may also be implemented as a voice interaction system. FIG. 5 illustrates a schematic diagram of the components of a voice interaction system in which the present invention may be implemented. As shown, the voice interactive system may include a server 510 and a plurality of terminals 520. The server 510 may include a plurality of platforms to provide a variety of services for the mass of terminals 520 involved in voice interaction of the present invention. As shown, the terminal 520 may be various smart speakers, such as a cylindrical smart speaker, a smart speaker with a screen, or a mobile smart terminal, such as a mobile phone.
Here, the terminal may be configured to: presenting the information flow obtained from the server; collecting voice input from a user; uploading the voice input to the server; acquiring voice input feedback issued by the server; and presenting a subsequent information stream based on the voice input feedback. In one embodiment, the terminal may be a physical terminal, such as a smart speaker and a mobile terminal shown in the figures, which may independently implement functions of information stream presentation (e.g., broadcast and display), voice collection, network transmission, and subsequent presentation, including processing capabilities that may include portions capable of being executed locally. In other embodiments, the terminal may include a plurality of physical terminals, for example, the smart speaker may communicate with a locally installed smart voice sticker in a short distance, and complete voice collection and reporting based on the voice sticker, which is not limited by the present invention.
Accordingly, the server 510 may be configured to: issuing a current information flow for presentation; acquiring the voice input uploaded by the terminal; and generating and issuing the voice input feedback based on the voice input.
In some embodiments, the terminal 520 may determine the content of the subsequent information stream from the information already in the terminal or directly generate the subsequent content according to the obtained voice input feedback. In other embodiments, the server 510 may be configured to determine and issue the presentation content for the subsequent information flow, that is, the terminal directly obtains the subsequent content issued by the server.
Further, the server 510 may be configured to: acquiring text information, voice attribute information and environment information of the voice input; and determining and issuing presentation content for subsequent information streams based on the text information, the voice attribute information and the environment information.
The terminal 510 may be configured to: presenting a plurality of options to a user for triggering different storyline branches; collecting voice selection of a user on one option in the multiple options; and presenting a subsequent information stream based on the voice selection.
Further, the present invention can also be implemented as a voice interaction terminal for implementing the voice interaction method described above in conjunction with fig. 1 and 4. Fig. 6 is a schematic diagram illustrating the composition of a voice interactive terminal according to an embodiment of the present invention. The terminal may perform the voice interaction method as described above in conjunction with fig. 1 and 4, or at least may complete the execution via the terminal and with the participation of the cloud. The terminal may also be referred to as terminal 510 in the system shown in fig. 5.
In particular, the terminal 600 may comprise a presentation means 610, an input means 620 and a processing means 630.
The presentation means 610 may be used to present the current information stream. The input device 620 may be used to obtain voice input from a user. The processing means 630 may then be used to perform processing, such as determining the presentation content of a subsequent information stream based on the current information stream and the speech input.
Further, when the terminal 600 needs to interact with the server to perform voice processing and subsequent information flow determination (including selection and generation) by means of the processing capability of the cloud platform, the terminal 600 may further include a networking device 640 for acquiring information to be presented; uploading the acquired voice input; and obtaining voice input feedback for determining a subsequent information stream. The networked device 640 may also obtain information streams for presentation, if necessary, such as previous game data downloads, or real-time downloads as the information streams are played.
In different embodiments, the presentation device 610 may have different modalities. For example, in some embodiments, the presentation means may comprise display means for visually outputting the content to be presented, for example displaying the current information stream and/or a subsequent information stream. Alternatively or additionally, the presentation means may further comprise: and the voice output device is used for voice broadcasting the current information flow and the subsequent information flow.
In addition, the input device 620 may further include: and the operation control device is used for acquiring the operation control input of the user. For example, the operation control means may include a keyboard, a mouse, a touch screen, a joystick, and the like. Subsequently, the processing device 630 may be configured to: based on the operational control input, the presentation content of the subsequent information stream is determined.
Further, the terminal 600 may be implemented as a computing device including conventional computing processing capabilities. The processing means 620 may be implemented as a processor of the computing device and the computing device may further comprise a memory for storing data and instructions required for the calculation.
The processor 620 may be a multi-core processor or may include a plurality of processors. In some embodiments, processor 620 may include a general-purpose host processor and one or more special coprocessors such as a Graphics Processor (GPU), a Digital Signal Processor (DSP), or the like. In some embodiments, processor 620 may be implemented using custom circuits, such as an Application Specific Integrated Circuit (ASIC) or a Field Programmable Gate Array (FPGA).
The memory may include various types of storage units such as system memory, Read Only Memory (ROM), and permanent storage. Wherein the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage device may be a read-write storage device. The persistent storage may be a non-volatile storage device that does not lose stored instructions and data even after the computer is powered off. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the permanent storage may be a removable storage device (e.g., floppy disk, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as a dynamic random access memory. The system memory may store instructions and data that some or all of the processors require at runtime. Further, the memory may comprise any combination of computer-readable storage media, including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic and/or optical disks, may also be employed. In some embodiments, the memory may include a removable storage device that is readable and/or writable, such as a Compact Disc (CD), a read-only digital versatile disc (e.g., DVD-ROM, dual layer DVD-ROM), a read-only Blu-ray disc, an ultra-dense optical disc, a flash memory card (e.g., SD card, min SD card, Micro-SD card, etc.), a magnetic floppy disc, or the like. Computer-readable storage media do not contain carrier waves or transitory electronic signals transmitted by wireless or wired means.
The memory may also have executable code stored thereon that, when processed by the processor 620, may cause the processor 620 to perform the voice interaction methods described above.
The voice interaction method, system and terminal according to the present invention have been described in detail above with reference to the accompanying drawings. The invention enables the user to actively influence the presentation of the information flow through voice interaction, and is particularly suitable for improving the immersive experience of the user on the plot branch information flow. Furthermore, the user can perform voice conversation with the character in the game through the virtual avatar, and the substituting feeling and the playability of the game are further improved.
Furthermore, the method according to the invention may also be implemented as a computer program or computer program product comprising computer program code instructions for carrying out the above-mentioned steps defined in the above-mentioned method of the invention.
Alternatively, the present invention may also be embodied as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or a computer program, or computer instruction code) which, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the present invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims (36)

1. A voice interaction method, comprising:
presenting the current information stream;
acquiring a voice input from a user; and
based on the current information stream and the speech input, the presentation content of a subsequent information stream is determined.
2. The method of claim 1, wherein the information stream comprises an information stream comprising a plot branch.
3. The method of claim 2, wherein determining presentation content for a subsequent information stream based on the current information stream and the speech input comprises:
based on the current branch of the information flow and the speech input, a branch trend of a subsequent information flow is determined.
4. The method of claim 2, wherein the information flow comprises at least one of:
games with plot branching;
an episode with a plot branch; and
novel with plot branching.
5. The method of claim 2, wherein presenting the current information stream comprises:
the user is presented with a number of options for triggering different storyline branches,
obtaining speech input from a user includes:
obtaining a user's voice selection of one of the plurality of options, and
determining presentation content of a subsequent information stream based on the current information stream and the speech input comprises:
based on the voice selection, triggering a scenario branch corresponding to the selected option.
6. The method of claim 5, wherein presenting the current information stream comprises:
the user is presented with an interactive point prompt for triggering a different storyline branch,
obtaining speech input from a user includes:
acquiring a voice instruction prompted by the user to the interaction point, and
determining presentation content of a subsequent information stream based on the current information stream and the voice instruction comprises:
and triggering a scenario branch or branch option corresponding to the interaction point based on the voice selection.
7. The method of claim 2, wherein obtaining speech input from a user comprises:
acquiring text information of the voice input;
acquiring voice attribute information of the voice input, and
determining the presentation content of the subsequent information stream comprises:
and determining the presentation content of the subsequent information flow based on the text information and the voice attribute information.
8. The method of claim 7, further comprising:
and determining the presentation mode of the subsequent information flow based on the text information and the voice attribute information.
9. The method of claim 7, wherein the voice attribute information comprises at least one of:
the voice input corresponds to the starting time of the current information flow;
a duration of the voice input;
emotion information of the voice input;
tone information of the voice input; and
and inputting the corresponding user identity by the voice.
10. The method of claim 2, further comprising:
obtaining environmental information at the time of the voice input generation, and
determining the presentation content of the subsequent information stream comprises:
based on the environmental information, the presentation content of the subsequent information stream is determined.
11. The method of claim 2, further comprising:
obtain non-speech input of the user, and
determining the presentation content of the subsequent information stream comprises:
based on the non-speech input, the presentation content of the subsequent information stream is determined.
12. The method of claim 2, wherein determining presentation content for a subsequent information stream based on the current information stream and the speech input comprises:
altering a structure of the storyline branch based on the current branch of the information flow and the voice input.
13. The method of claim 1, wherein presenting the current information stream comprises:
a virtual avatar is presented.
14. The method of claim 13, wherein the avatar comprises a user avatar, and
obtaining speech input from a user includes:
obtaining voice control of the user over the user avatar, and
determining the presentation content of the subsequent information stream comprises:
controlling presentation of the user avatar based on the speech input.
15. The method of claim 14, wherein the avatars include other avatars than the user avatar, and
obtaining speech input from a user includes:
obtaining voice interaction of the user with the other avatars, and
determining the presentation content of the subsequent information stream comprises:
controlling presentation of the other avatars based on the voice interaction.
16. The method of claim 15, wherein controlling the presentation of the other avatars comprises at least one of:
obtaining clues for triggering the storyline branches;
and acquiring an interaction point triggering the plot branch.
17. The method of claim 1, wherein presenting the current information stream comprises:
and voice broadcasting the current information flow.
18. A voice interaction method, comprising:
broadcasting a storyline story by voice;
the voice broadcast is used for triggering a plurality of options of different plot branches;
acquiring voice selection of a user on one option in the multiple options; and
based on the voice selection, triggering a scenario branch corresponding to the selected option.
19. The method as recited in claim 18, further comprising:
acquiring voice input inserted by a user in other time periods of the voice broadcast of the options;
presenting a plot interaction point based on the voice input;
acquiring voice interaction of the user aiming at the plot interaction point; and
based on the voice interaction, a subsequent plot branch is generated or triggered.
20. A voice interactive system, a service end and a plurality of terminals, wherein,
the terminal is used for:
presenting the information flow obtained from the server;
collecting voice input from a user;
uploading the voice input to the server;
acquiring voice input feedback issued by the server; and
presenting a subsequent information stream based on the voice input feedback,
the server is used for:
issuing a current information flow for presentation;
acquiring the voice input uploaded by the terminal;
and generating and issuing the voice input feedback based on the voice input.
21. The system of claim 20, wherein the server is to:
the presentation content for the subsequent information stream is determined and delivered.
22. The system of claim 21, wherein the server is to:
acquiring text information, voice attribute information and environment information of the voice input; and
and determining and issuing presentation content for subsequent information flow based on the text information, the voice attribute information and the environment information.
23. The system of claim 20, wherein the terminal is configured to:
presenting a plurality of options to a user for triggering different storyline branches;
collecting voice selection of a user on one option in the multiple options; and
based on the voice selection, a subsequent information stream is presented.
24. A voice interaction terminal, comprising:
presentation means for presenting a current information stream;
input means for acquiring a voice input from a user;
processing means for determining the presentation content of a subsequent information stream based on the current information stream and the speech input.
25. The terminal of claim 24, further comprising:
a networking device to:
acquiring information to be presented;
uploading the acquired voice input; and
voice input feedback for determining a subsequent information stream is obtained.
26. The terminal of claim 24, wherein the presenting means comprises:
and the voice output device is used for voice broadcasting the current information flow and the subsequent information flow.
27. The terminal of claim 26, wherein the presenting means comprises:
and the display device is used for displaying the current information flow and/or the subsequent information flow.
28. The terminal of claim 26, wherein the input means comprises:
operation control means for acquiring operation control input of the user, and
the processing device is configured to:
based on the operational control input, the presentation content of the subsequent information stream is determined.
29. A voice interaction method, comprising:
presenting the current information stream;
obtaining a plurality of speech inputs from a plurality of users; and
based on the current information stream and the plurality of speech inputs, determining presentation content for a subsequent information stream.
30. The method of claim 29, wherein obtaining a plurality of speech inputs from a plurality of users comprises at least one of:
respectively acquiring voice input from different users aiming at different current information streams presented successively; and
multiple speech inputs from different users are obtained for one current information stream.
31. The method of claim 29, further comprising:
determining that the plurality of speech inputs are from different users,
wherein determining presentation content of a subsequent information stream based on the current information stream and the plurality of speech inputs comprises at least one of:
generating a sub information stream and determining the presentation content of the sub information stream aiming at different users; and
and comprehensively judging the user identities and the input contents of the voice inputs to determine the presentation contents of the subsequent information streams.
32. A voice interaction method, comprising:
presenting the current information stream;
acquiring multiple rounds of voice input from a user; and
based on the current information stream and the multiple rounds of voice input, determining presentation content of a subsequent information stream.
33. The method of claim 32, wherein obtaining multiple rounds of voice input from a user comprises:
presenting the interactive content of the current round according to a preset frame;
acquiring the voice input of the current round generated by the user aiming at the interactive content of the current round; and
presenting a next round of interactive content based on the predetermined frame and/or the current round of voice input.
34. The method of claim 32, wherein the information stream comprises a storyline story, and determining presentation content for a subsequent information stream based on the current information stream and the multiple rounds of voice input comprises:
and constructing a plot framework of the plot story based on the multi-round voice input.
35. A computing device, comprising:
a processor; and
a memory having executable code stored thereon, which when executed by the processor, causes the processor to perform the method of any of claims 1-17 and 29-34.
36. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1-17 and 29-34.
CN202010183403.XA 2020-03-16 2020-03-16 Voice interaction method, system and terminal Pending CN113409778A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010183403.XA CN113409778A (en) 2020-03-16 2020-03-16 Voice interaction method, system and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010183403.XA CN113409778A (en) 2020-03-16 2020-03-16 Voice interaction method, system and terminal

Publications (1)

Publication Number Publication Date
CN113409778A true CN113409778A (en) 2021-09-17

Family

ID=77676638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010183403.XA Pending CN113409778A (en) 2020-03-16 2020-03-16 Voice interaction method, system and terminal

Country Status (1)

Country Link
CN (1) CN113409778A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114102628A (en) * 2021-12-04 2022-03-01 广州美术学院 Interaction method and device of picture book and robot
CN114177621A (en) * 2021-12-15 2022-03-15 乐元素科技(北京)股份有限公司 Data processing method and device
CN115103237A (en) * 2022-06-13 2022-09-23 咪咕视讯科技有限公司 Video processing method, device, equipment and computer readable storage medium
CN115212580A (en) * 2022-09-21 2022-10-21 深圳市人马互动科技有限公司 Method and related device for updating game data based on telephone interaction
CN115220608A (en) * 2022-09-20 2022-10-21 深圳市人马互动科技有限公司 Method and device for processing multimedia data in interactive novel
CN115408510A (en) * 2022-11-02 2022-11-29 深圳市人马互动科技有限公司 Plot interaction node-based skipping method and assembly and dialogue development system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102947774A (en) * 2010-06-21 2013-02-27 微软公司 Natural user input for driving interactive stories
US9583106B1 (en) * 2013-09-13 2017-02-28 PBJ Synthetics Corporation Methods, systems, and media for presenting interactive audio content
CN109240564A (en) * 2018-10-12 2019-01-18 武汉辽疆科技有限公司 Artificial intelligence realizes the device and method of interactive more plot animations branch
CN110085221A (en) * 2018-01-26 2019-08-02 上海智臻智能网络科技股份有限公司 Speech emotional exchange method, computer equipment and computer readable storage medium
CN110265021A (en) * 2019-07-22 2019-09-20 深圳前海微众银行股份有限公司 Personalized speech exchange method, robot terminal, device and readable storage medium storing program for executing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102947774A (en) * 2010-06-21 2013-02-27 微软公司 Natural user input for driving interactive stories
US9583106B1 (en) * 2013-09-13 2017-02-28 PBJ Synthetics Corporation Methods, systems, and media for presenting interactive audio content
CN110085221A (en) * 2018-01-26 2019-08-02 上海智臻智能网络科技股份有限公司 Speech emotional exchange method, computer equipment and computer readable storage medium
CN109240564A (en) * 2018-10-12 2019-01-18 武汉辽疆科技有限公司 Artificial intelligence realizes the device and method of interactive more plot animations branch
CN110265021A (en) * 2019-07-22 2019-09-20 深圳前海微众银行股份有限公司 Personalized speech exchange method, robot terminal, device and readable storage medium storing program for executing

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114102628A (en) * 2021-12-04 2022-03-01 广州美术学院 Interaction method and device of picture book and robot
CN114177621A (en) * 2021-12-15 2022-03-15 乐元素科技(北京)股份有限公司 Data processing method and device
CN114177621B (en) * 2021-12-15 2024-03-22 乐元素科技(北京)股份有限公司 Data processing method and device
CN115103237A (en) * 2022-06-13 2022-09-23 咪咕视讯科技有限公司 Video processing method, device, equipment and computer readable storage medium
CN115103237B (en) * 2022-06-13 2023-12-08 咪咕视讯科技有限公司 Video processing method, device, equipment and computer readable storage medium
CN115220608A (en) * 2022-09-20 2022-10-21 深圳市人马互动科技有限公司 Method and device for processing multimedia data in interactive novel
CN115212580A (en) * 2022-09-21 2022-10-21 深圳市人马互动科技有限公司 Method and related device for updating game data based on telephone interaction
CN115212580B (en) * 2022-09-21 2022-11-25 深圳市人马互动科技有限公司 Method and related device for updating game data based on telephone interaction
CN115408510A (en) * 2022-11-02 2022-11-29 深圳市人马互动科技有限公司 Plot interaction node-based skipping method and assembly and dialogue development system
CN115408510B (en) * 2022-11-02 2023-01-17 深圳市人马互动科技有限公司 Plot interaction node-based skipping method and assembly and dialogue development system

Similar Documents

Publication Publication Date Title
CN113409778A (en) Voice interaction method, system and terminal
US10987596B2 (en) Spectator audio analysis in online gaming environments
Collins Playing with sound: a theory of interacting with sound and music in video games
JP6719747B2 (en) Interactive method, interactive system, interactive device, and program
US10293260B1 (en) Player audio analysis in online gaming environments
CN112074899A (en) System and method for intelligent initiation of human-computer dialog based on multimodal sensory input
JP6699010B2 (en) Dialogue method, dialogue system, dialogue device, and program
CN106774845B (en) intelligent interaction method, device and terminal equipment
Domsch Dialogue in video games
CN111870935B (en) Business data processing method and device, computer equipment and storage medium
US20140194201A1 (en) Communication methods and apparatus for online games
CN113301358A (en) Content providing and displaying method and device, electronic equipment and storage medium
CN117377519A (en) Crowd noise simulating live events through emotion analysis of distributed inputs
WO2024020972A1 (en) Live broadcast interaction method and apparatus, device, storage medium and program product
CN111095397A (en) Natural language data generation system and method
Harvey Virtual worlds: an ethnomusicological perspective
Roden et al. Toward mobile entertainment: A paradigm for narrative-based audio only games
Karouzaki et al. A framework for adaptive game presenters with emotions and social comments
Okkema Harvester of Desires: Gaming Amazon Echo through John Cayley's The Listeners.
Huang et al. A voice-assisted intelligent software architecture based on deep game network
Fish Interactive and adaptive audio for home video game consoles
JP7445938B1 (en) Servers, methods and computer programs
KR20060001175A (en) Methode for learning a language through a online role playing game
CN112752159A (en) Interaction method and related device
MAO An application of game refinement theory in popular activities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40059913

Country of ref document: HK