CN112402952A

CN112402952A - Interactive method and terminal based on audio and virtual image

Info

Publication number: CN112402952A
Application number: CN201910785707.0A
Authority: CN
Inventors: 陈节省; 许荣峰; 李中冬; 刘旺; 林剑宇
Original assignee: Fujian Kaimi Network Science & Technology Co ltd
Current assignee: Fujian Kaimi Network Science & Technology Co ltd
Priority date: 2019-08-23
Filing date: 2019-08-23
Publication date: 2021-02-26

Abstract

The invention provides an interactive method and a terminal based on audio and virtual images, which are used for acquiring played audio data in real time; converting the audio data into beat data; arranging corresponding actions in real time according to the beat data, controlling the virtual image to follow the audio data to execute the actions and output and display the actions; the corresponding action is arranged in real time according to the beat of the audio data acquired in real time, the virtual image is controlled to execute the action and the action is displayed and output, the corresponding relation between the audio and the action executed by the virtual image is not required to be set in advance, any audio can be converted into the corresponding virtual image to dance, the audio can be added and deleted at any time, the flexibility is high, the manufacturing cost is low, any video can be visualized, and the user experience is improved.

Description

Interactive method and terminal based on audio and virtual image

Technical Field

The invention relates to the field of audio interaction, in particular to an interaction method and a terminal based on audio and virtual images.

Background

A dancing machine is a common entertainment device that plays a piece of music and then dances with a cartoon image on the screen. When the dance machine carries out dance movement setting on the cartoon image, a section of audio is set to be a fixed dance movement correspondingly, and the association between the audio and the dance movement is preset. If new audio is added, the corresponding dance action cannot be obtained without the previous setting. The setting mode is not only inflexible, but also how many sections of audio are required to be preset in the audio in advance, and for batch production, the production cost is very high.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the interactive method and the terminal based on the audio and the virtual image can convert any audio into corresponding action and are high in flexibility.

In order to solve the technical problems, the invention adopts a technical scheme that:

an interactive method based on audio and virtual images, comprising the steps of:

acquiring played audio data in real time;

converting the audio data into beat data;

and arranging corresponding actions in real time according to the beat data, and controlling the virtual image to follow the audio data to execute the actions and output and display the actions.

In order to solve the technical problem, the invention adopts another technical scheme as follows:

an interactive terminal based on audio and avatar, comprising a memory, a processor and a computer program stored on said memory and executable on said processor, said processor implementing the following steps when executing said computer program:

acquiring played audio data in real time;

converting the audio data into beat data;

The invention has the beneficial effects that: the corresponding action is arranged in real time according to the beat of the played audio data acquired in real time, the virtual image is controlled to follow the audio data to execute the action and output and display, the corresponding relation between the action executed by the audio and the virtual image is not required to be set in advance, any audio can be converted into the corresponding virtual image to dance, the audio can be added and deleted at any time, the flexibility is high, the manufacturing cost is low, any video can be visualized, and the user experience is improved.

Drawings

FIG. 1 is a flowchart illustrating steps of an interactive method based on audio and avatar according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of an interactive terminal based on audio and avatar according to an embodiment of the present invention;

description of reference numerals:

1. an interactive terminal based on audio and virtual image; 2. a memory; 3. a processor.

Detailed Description

In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.

Referring to fig. 1, an interactive method based on audio and avatar includes the steps of:

acquiring played audio data in real time;

converting the audio data into beat data;

As can be seen from the above description, the beneficial effects of the present invention are: the corresponding action is arranged in real time according to the beat of the played audio data acquired in real time, the virtual image is controlled to follow the audio data to execute the action and output and display, the corresponding relation between the action executed by the audio and the virtual image is not required to be set in advance, any audio can be converted into the corresponding virtual image to dance, the audio can be added and deleted at any time, the flexibility is high, the manufacturing cost is low, any video can be visualized, and the user experience is improved.

Further, the method also comprises the following steps:

extracting a temporal feature of the audio data;

the real-time arrangement of the corresponding actions according to the beat data comprises the following steps:

and arranging corresponding actions in real time according to the time characteristics and the beat data.

As can be seen from the above description, by extracting the time feature of the audio data, and performing real-time arrangement of the motion on the basis of considering the time feature and the beat data, the flexibility of the motion arrangement can be further improved.

Further, the method also comprises the following steps:

extracting emotional features of the audio data;

the real-time arrangement of the corresponding actions according to the time characteristics and the beat data comprises the following steps:

combining different audio types according to the emotional characteristics and the beat data;

and arranging corresponding actions in real time according to the time characteristics and the audio type.

According to the description, different audio types are combined according to the emotional characteristics and the beat data of the audio data, corresponding actions are arranged in real time according to different audio types and time characteristics, the actions with high attaching degree can be matched in real time according to different audio types, and the matching degree between the actions and the corresponding audio is improved.

Further, the time characteristic comprises a period of time in which the audio data is currently located;

respectively designing a plurality of actions in each time period of the audio data for each audio type;

the real-time arrangement of the corresponding actions according to the time characteristics and the audio types comprises the following steps:

one is randomly selected from a plurality of actions corresponding to a period in which the audio data is currently located according to the temporal characteristics and the audio type.

According to the description, a plurality of actions are respectively designed for each audio type in different time periods, and when the actions are arranged, one action is randomly selected from the corresponding actions according to the current time period of the audio data, so that the diversity and uncertainty of the actions executed in the audio interaction process are ensured, the singleness is avoided, and the user experience is improved.

Further, the method also comprises the following steps: extracting pitch features of the audio data;

the audio data comprises lyric information;

the real-time arrangement of corresponding actions according to the time characteristics and the beat data comprises the following steps:

determining the position of the lyrics of the currently acquired audio data according to the time characteristic, the beat data and the pitch data respectively, wherein the position of each sentence of the lyrics comprises a starting time point, a beat change point and a pitch change point;

if the current lyrics are at the starting time point, pitch change point or beat change point of the lyrics position, arranging corresponding actions;

if the current lyric is in the lyric position except the starting time point, the pitch change point and the beat change point, arranging a random one in the preset action set.

According to the description, through the identification of the lyric starting time point, the pitch change point and the beat change point, the corresponding actions are arranged, and the random one of the preset action sets is arranged between the different points, so that the rhythm, tone and lyric position of the played audio data can be known through the actions, the audio visualization degree is further improved, the singing of the user is facilitated, the action set can be used as an effective means for assisting or guiding the user to sing the song, the user experience is improved, the preset actions are not unique, but one action is randomly selected from the preset action set, the dance actions which are arranged at different playing times are different even if the lyrics are the same, the monotony is avoided, and the richness of the dance actions is improved.

Further, the method also comprises the following steps:

and when the audio data cannot be acquired, inputting the audio data through the avatar prompt.

According to the description, when the audio is not obtained, the input of the audio data is prompted through the virtual image, so that the attention of a user can be improved, and the user can input the audio data in time.

Further, the audio data comprises song accompaniment and user singing audio;

further comprising:

comparing the closeness degree between the song accompaniment and the singing audio of the user, and if the closeness degree is greater than a preset value, controlling the virtual image to make a preset first expression and a first action; otherwise, controlling the virtual image to make a preset second expression and a second action.

As can be seen from the above description, when applied to the K song scene, the closeness of the song accompaniment and the audio singing by the user will be compared. The comparison between the song accompaniment and the singing audio of the user is the closeness degree of the intonation of the singing audio of the user and the intonation of the song standard, and if the intonation is closer, the more accurate the singing of the user is represented. And the virtual image is controlled to make different expressions and actions according to different comparison results, so that the user experience can be improved.

Further, when the singing audio of the user is not received, the virtual image is controlled to be in a reduced state or an action of guiding and encouraging the user to sing is performed;

and when the singing audio of the user is received, amplifying the virtual image.

According to the description, the virtual presenting state is controlled correspondingly according to whether the singing audio of the user is input or not, so that the user can know whether the singing audio of the user is input or not intuitively, and the user experience is further improved.

Further, the method also comprises the following steps:

receiving a code scanning request of a user, displaying a user image according to the code scanning request, and controlling the user image to follow the virtual image to execute actions.

According to the description, the user can display own image by scanning the code and dance along with the virtual image, so that the interestingness is improved.

Further, the method also comprises the following steps:

receiving a preset third action;

and controlling the user image to execute the preset third action according to the preset third action.

According to the description, the user can set the action executed by the user image by himself, so that the user image can not only act along with the work execution end, but also have the action of himself, and the flexibility is improved.

Further, the method also comprises the following steps:

receiving sensor information sent by handheld equipment;

and controlling the virtual image to execute the action corresponding to the sensor information according to the sensor information.

According to the description, the virtual image can execute the corresponding action according to the audio data to finish dance and can dance according to the action of the user holding the handheld device, so that the user can freely and flexibly control the action of the virtual image through own action, and the interestingness is further improved.

Referring to fig. 2, an interactive terminal based on audio and avatar includes a memory, a processor and a computer program stored in the memory and running on the processor, wherein the processor implements the steps of the interactive method based on audio and avatar when executing the computer program.

Example one

The interaction method based on the audio and the avatar can be applied to any application scene needing to execute actions according to the audio, such as dance machines, KTVs and the like, and is explained by combining with a specific application scene:

acquiring played audio data in real time;

the audio data may include audio of a singer, audio of any music, audio of a video file, and the like;

converting the audio data into beat data;

i.e. beat signals in the audio data are identified, for a total of 8 beats: 1/4 beats, 2/4 beats, 3/4 beats, 4/4 beats, 3/8 beats, 6/8 beats, 8/8 beats, 8/16 beats;

if the audio data is a song, determining whether the song is a fast song, a slow song, a common song or the like according to the beat;

arranging corresponding actions in real time according to the beat data, controlling the virtual image to follow the audio data to execute the actions and output and display the actions;

for example, the corresponding dance action can be arranged according to the song type determined by the beat data, and the virtual image is controlled to follow the audio data to execute the dance action and output and display the dance action;

preferably, the method further comprises extracting a temporal feature of the audio data;

the time characteristics comprise the time period of the audio data, such as the opening, the passing, the ending, the climax, the normal music singing part and the like in the audio;

arranging corresponding actions in real time according to the time characteristics and the beat data;

for each time interval, a plurality of corresponding actions are designed for each song type, for example, when the played audio data is in a climax stage and the played audio data is a slow song, the corresponding actions can be arranged in real time, and the virtual image is controlled to execute the actions;

the action scheduled in real time may be randomly selected from a plurality of actions corresponding to a period in which the audio data is currently located;

wherein the action comprises a sequence of sub-actions, each frame of the action comprising a sub-action;

each action is formed by combining a plurality of sub-actions, each frame of each action comprises one sub-action, and the action is generated by combining the sub-actions, so that the rich diversity of the action is ensured;

in a preferred approach, the first frame and the last frame of each action may be the same sub-action;

for each action, the first frame and the last frame of the action are the same sub-actions, so that the consistency among different actions is ensured;

and each action has a unique code, when the real-time arrangement is carried out, the corresponding action can be called as long as the corresponding code is input, and the action is executed by the virtual image.

Example two

The difference between the present embodiment and the first embodiment is: further comprising:

extracting emotional features of the audio data;

specifically, different emotional types of music, such as excitement, anger, sadness, relaxation, etc., can be identified according to the audio data;

wherein, the emotional characteristics of the audio data can be extracted by the following modes:

extracting sound signal characteristics and music score characteristics of audio data to be trained, and training to obtain an emotion recognition model;

extracting sound signal characteristics and music score characteristics of audio data to be identified;

inputting the sound signal characteristics and the music score characteristics of the audio data to be recognized into an emotion recognition model, and recognizing the emotion of the audio data to be recognized as emotion characteristics;

specifically, 8 beats and 4 emotional characteristics can be combined into 32 audio types, such as excited 1/4 beats, angry 1/4 beats, and so on;

arranging corresponding actions in real time according to the time characteristics and the audio type;

specifically, the time characteristic includes a time period in which the audio data is currently located;

respectively designing a plurality of actions in each time period of the audio for each audio type;

randomly selecting one action from a plurality of actions corresponding to the time period in which the audio data is currently located according to the time characteristics and the audio type;

for example, when music is played, and when the excitement 1/4 beat type is recognized, one of the designed actions can be randomly selected as an open dance, and then the virtual image starts to dance.

EXAMPLE III

The difference between the present embodiment and the first embodiment is that the method further includes:

extracting pitch features of the audio data;

the audio data comprises lyric information;

For example, if the lyric in the current audio data is determined to be at the starting time point of the lyric position, an action 1 is selected, then the time point of the pitch change of the lyric in the current audio data in the sentence is identified, an action 2 is also selected, further, if the time point of the pitch change of the lyric in the current audio data in the sentence is identified, an action 3 is also selected, and so on, and the other part of the lyric is filled with a random one in a preset action set, wherein the preset action set comprises a plurality of preset actions, such as actions 7, 8, 9, 10, and the like, and the specific details are as follows:

pitch-up pitch-down beat change point for beginning of sentence

Actions 1 action 8 action 2 action 10 action 2 action 7 action 3

Example four

The present embodiment is different from the three embodiments described above in that the present embodiment further includes:

when the audio data cannot be acquired, inputting the audio data through the virtual image prompt;

specifically, the virtual image can do different dances according to the microphone input of the user, when no microphone is input, the virtual image prompts the user to sing songs, and when the microphone is input, the virtual image dances according to music;

preferably, when the user follows the accompaniment with the singing scene in the KTV, the audio data comprises the song accompaniment and the user singing audio, namely the audio input by the user singing;

at the moment, the closeness degree between the song accompaniment and the singing audio of the user can be compared, and if the closeness degree is greater than a preset value, the virtual image is controlled to make a preset first expression and a first action; otherwise, controlling the virtual image to make a preset second expression and a second action;

specifically, when the audio input by the user singing is closer to the song accompaniment (the intonation is closer), the virtual image makes joyful and praise dance gestures and expressions to stimulate the user; when the singing audio frequency of the user deviates from the accompaniment of the song more, namely when the running tone of the user is more, the virtual image makes depressed dance postures and expressions;

in order to improve the attention of the user, when the singing audio of the user is not received, the virtual image is controlled to be contracted or an action of guiding and encouraging the user to sing is made;

when the singing audio of the user is received, the virtual image is amplified, and the virtual image is controlled to dance along with the singing audio of the user;

specifically, when the user does not begin singing, the avatar may zoom out to a small avatar or image, and when the user begins singing, the avatar may zoom in again and begin dancing.

EXAMPLE five

The present embodiment is different from the four embodiments described above in that the present embodiment further includes:

receiving a code scanning request of a user, displaying a user image according to the code scanning request, and controlling the user image to follow the virtual image to execute actions;

specifically, when the avatar dances according to the audio frequency according to the method of the above embodiments, the user can display the avatar on the screen and dance with the avatar after scanning the code;

preferably, the user may preset his dance motion, that is, the method further includes:

receiving a preset third action, namely a dance action preset by the user;

controlling the user image to execute the preset third action according to the preset third action, so that the situation that two virtual images respectively dance on a screen is generated, and the interestingness is further improved;

preferably, the user can also control the action of the avatar;

namely also comprising:

receiving sensor information sent by handheld equipment;

controlling the virtual image to execute actions corresponding to the sensor information according to the sensor information;

specifically, the user holds a mobile phone or other handheld devices to dance together with the avatar, the handheld devices transmit sensor information to the central control video terminal, and the avatar can dance according to the gestures of the user.

EXAMPLE six

Referring to fig. 2, an interactive terminal 1 based on audio and avatar comprises a memory 2, a processor 3 and a computer program stored in the memory 2 and executable on the processor 3, wherein the processor 3 implements the steps of the interactive method based on audio and avatar in the above embodiments when executing the computer program.

In summary, the interactive method and terminal based on audio and avatar provided by the present invention realize the visualization of audio based on audio and avatar, can arrange corresponding actions in real time according to the time characteristics, pitch characteristics, emotion characteristics, tempo and lyrics of the audio data acquired in real time, and control the avatar to execute the actions, do not need to set the corresponding relationship of the actions executed by the audio and avatar in advance, can convert any audio into corresponding avatar to dance, can add or delete audio at any time, has high flexibility and low manufacturing cost, can visualize any video, is particularly suitable for the application scene of user's singing in the K song scene, can play a certain role in guiding the user to singing, improves user experience, and simultaneously the user can also control the actions of the avatar, and can show another personage image that represents the user with the same display screen ground of avatar, both can follow the avatar dance, also can the user oneself presume the personage's dance, realize dancing altogether between two images, further improved user experience.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. An interactive method based on audio and virtual images, characterized in that it comprises the steps of:

acquiring played audio data in real time;

converting the audio data into beat data;

2. The method of claim 1, further comprising:

extracting a temporal feature of the audio data;

3. The method of claim 2, further comprising:

extracting emotional features of the audio data;

4. The audio and avatar based interaction method of claim 3, wherein said temporal characteristics include a time period in which said audio data is currently located;

5. The method of claim 2, further comprising:

extracting pitch features of the audio data;

the audio data comprises lyric information;

6. The method of claim 1, further comprising:

7. The audio and avatar based interaction method of claim 2, wherein said audio data includes song accompaniment and audio of singing by the user;

further comprising:

8. The interactive method based on audio and virtual image of claim 7, wherein when the audio is not received, the virtual image is controlled to be in a reduced state or to make actions for guiding and encouraging the user to sing the song;

9. The method of claim 1, further comprising:

10. The method of claim 9, further comprising:

receiving a preset third action;

11. The method of claim 1, further comprising:

receiving sensor information sent by handheld equipment;

12. An audio and avatar based interactive terminal comprising a memory, a processor and a computer program stored in said memory and executable on said processor, wherein said processor when executing said computer program implements the steps of an audio and avatar based interactive method according to any of claims 1 to 11.