CN108776985A

CN108776985A - A kind of method of speech processing, device, equipment and readable storage medium storing program for executing

Info

Publication number: CN108776985A
Application number: CN201810568421.2A
Authority: CN
Inventors: 汪守成; 卢洁; 蔡申; 彭元涛; 慕壮; 王开峻; 余飞; 吴作鹏
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2018-06-05
Filing date: 2018-06-05
Publication date: 2018-11-09

Abstract

This application discloses a kind of method of speech processing, device, equipment and readable storage medium storing program for executing, the application obtains the voice data under voice input scene or speech play scene, the voice data can be that the voice data of input can also be the voice data for needing to play, further obtain the acoustic feature value set of voice data, include the characteristic value of at least a kind of acoustic feature in set, with reference to the characteristic value of every one kind acoustic feature in acoustic feature value set, determine the elementary state of each animated element, and scene animation is built according to the elementary state of each animated element, and then show the scene animation of structure.It can be seen that, this case can build scene animation under voice input scene and speech play scene according to the acoustic feature of voice data, and show scene animation under corresponding scene, allow user to get information about the acoustic feature of voice data by scene animation, enhances the degree of understanding to voice data.

Description

Voice processing method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a speech processing method, apparatus, device, and readable storage medium.

Background

Because the development of mobile communication technology and the application of artificial intelligence are becoming mature, on the one hand, instant voice message exchange is used by more and more people, on the other hand, the accuracy rate of voice to characters is greatly improved, and because the work efficiency of people is greatly improved under the high recognition rate, the crowd who uses voice to input characters is rapidly increasing, and the time for people to input voice and experience voice interaction every day is continuously increasing.

The voice input is selected by more and more users, the efficiency is improved, the inherent keyboard input typing input mode is overturned, and even a brand new application scene in the man-machine interaction era is gradually created. However, in the existing voice input and play process, the interface display is too single, and only the voice input control or the voice bar to be played is displayed on the interface. Taking the voice input process as an example, the user clicks or presses for a long time to input voice. For the user, the user cannot intuitively know detailed information of the input voice in the input process, such as volume and the like, so that the comprehension degree of the user on the input voice is reduced.

Disclosure of Invention

In view of this, the present application provides a voice processing method, apparatus, device and readable storage medium, which are used to solve the problem that the existing voice interaction process is single in interface display, which results in low understanding degree of the user to the voice.

In order to achieve the above object, the following solutions are proposed:

a method of speech processing comprising:

acquiring voice data in a voice input scene or a voice playing scene;

acquiring a set of acoustic feature values of the voice data, wherein the set of acoustic feature values comprises feature values of at least one type of acoustic features;

determining the element state of each animation element by referring to the characteristic value of each type of acoustic characteristic in the acoustic characteristic value set;

and constructing scene animation according to the element state of each animation element, and displaying the scene animation.

Preferably, the acquiring the set of acoustic feature values of the voice data includes:

and obtaining the characteristic values of any one or more of the loudness characteristic, the tone characteristic and the speech speed characteristic of the voice data to form an acoustic characteristic value set.

Preferably, the determining the element state of each animation element by referring to the feature value of each type of acoustic feature in the acoustic feature value set includes:

determining an animation element corresponding to each type of acoustic features in the acoustic feature value set;

and determining the element state of the corresponding animation element according to the characteristic value of each type of acoustic characteristic.

Preferably, the determining the element state of the corresponding animation element according to the feature value of each type of acoustic feature comprises:

determining a characteristic value interval to which a characteristic value of each type of acoustic characteristic belongs;

and determining the element state of the animation element corresponding to the characteristic value interval.

and determining the element state of the corresponding animation element according to the characteristic value of each type of acoustic characteristic and the value range of the element state of the corresponding animation element.

Preferably, the constructing the scene animation according to the element state of each animation element comprises:

selecting animation element materials corresponding to the element states according to the element states of the animation elements;

and constructing scene animation by using the selected animation element materials.

Preferably, the acquiring voice data in a voice input scene or a voice playing scene includes:

responding to the triggering operation of a user on a voice input control displayed in a voice input interface, and acquiring voice data through a microphone assembly, wherein the voice input interface comprises: a conversation interface, an information input interface and an information retrieval interface;

the displaying the scene animation includes:

and displaying the scene animation on the voice input interface.

responding to a triggering operation of a user on a conversation voice control displayed in a voice playing interface, and acquiring received conversation voice data corresponding to the conversation voice control, wherein the voice playing interface comprises: a session interface;

the displaying the scene animation includes:

and displaying the scene animation on the voice playing interface.

Preferably, the element states include any one or more of: element size, element position, element posture, element color, element image and element visible and invisible conditions.

A speech processing apparatus comprising:

the voice acquisition unit is used for acquiring voice data in a voice input scene or a voice playing scene;

the acoustic feature value acquisition unit is used for acquiring an acoustic feature value set of the voice data, wherein the acoustic feature value set comprises feature values of at least one type of acoustic features;

the element state determining unit is used for determining the element state of each animation element by referring to the characteristic value of each type of acoustic characteristic in the acoustic characteristic value set;

the animation construction unit is used for constructing scene animation according to the element states of all animation elements;

and the animation display unit is used for displaying the scene animation.

Preferably, the acoustic feature value obtaining unit includes:

and the first acoustic characteristic value acquisition subunit is used for acquiring characteristic values of any one or more of loudness characteristics, tone characteristics and speech speed characteristics of the voice data to form an acoustic characteristic value set.

Preferably, the element state determination unit includes:

the animation element determining unit is used for determining an animation element corresponding to each type of acoustic features in the acoustic feature value set;

and the characteristic value corresponding unit is used for determining the element state of the corresponding animation element according to the characteristic value of each type of acoustic characteristic.

Preferably, the feature value corresponding unit includes:

the interval determining unit is used for determining a characteristic value interval to which a characteristic value of each type of acoustic characteristic belongs;

and the interval corresponding unit is used for determining the element state of the animation element corresponding to the characteristic value interval.

Preferably, the feature value corresponding unit includes:

and the characteristic value calculating unit is used for determining the element state corresponding to the animation element according to the characteristic value of each type of acoustic characteristic and the value range of the element state corresponding to the animation element.

Preferably, the animation construction unit includes:

the material selecting unit is used for selecting animation element materials corresponding to the element states according to the element states of the animation elements;

and the material combination unit is used for constructing the scene animation by utilizing the selected animation element materials.

Preferably, the voice acquiring unit includes:

the first voice acquiring subunit is configured to respond to a user's trigger operation on a voice input control displayed in a voice input interface, and acquire voice data through a microphone assembly, where the voice input interface includes: a conversation interface, an information input interface and an information retrieval interface;

the animation display unit includes:

and the first animation display subunit is used for displaying the scene animation on the voice input interface.

Preferably, the voice acquiring unit includes:

a second voice obtaining subunit, configured to respond to a user's trigger operation on a session voice control displayed in a voice playing interface, and obtain received session voice data corresponding to the session voice control, where the voice playing interface includes: a session interface;

the animation display unit includes:

and the second animation display subunit is used for displaying the scene animation on the voice playing interface.

A speech processing device comprising a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech processing method as described above.

A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the speech processing method as previously introduced.

It can be seen from the foregoing technical solutions that, in the voice processing method provided in this embodiment of the present application, voice data in a voice input scene or a voice playing scene is obtained, where the voice data may be input voice data or voice data to be played, an acoustic feature value set of the voice data is further obtained, a set includes feature values of at least one type of acoustic feature, an element state of each animation element is determined with reference to the feature value of each type of acoustic feature in the acoustic feature value set, a scene animation is constructed according to the element state of each animation element, and a constructed scene animation is further displayed. Therefore, the scene animation can be constructed according to the acoustic features of the voice data in the voice input scene and the voice playing scene, and the scene animation is displayed in the corresponding scene, so that a user can visually know the acoustic features of the voice data through the scene animation, and the understanding degree of the voice data is enhanced.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIGS. 1-3 illustrate several voice interaction interface diagrams;

FIG. 4 is a flowchart of a speech processing method disclosed in an embodiment of the present application;

FIG. 5 illustrates a diagram of a session interface playing speech;

FIG. 6 is a diagram illustrating correspondence between acoustic features and scene animations;

FIG. 7 is a schematic diagram of a scenario embodiment of the present application;

FIG. 8 is a schematic diagram of another exemplary scenario of the present application;

FIG. 9 is a schematic diagram illustrating a correspondence between acoustic features and animation elements in a scene embodiment of the present application;

FIGS. 10-14 are schematic diagrams of further exemplary embodiments of the present application;

fig. 15 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;

fig. 16 is a block diagram of a hardware structure of a speech processing device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the problems that a user cannot intuitively know detailed information of input voice in an input process and the understanding degree of the user on the input voice is reduced in the existing voice input and interaction scene, the inventor of the invention carries out research.

In the research process, the inventor firstly thinks of a scheme that a voice identification image is additionally displayed on an interface, and the change of the image is controlled according to the existence of voice. Fig. 1-3 show three embodiments designed by the present inventors.

In fig. 1, whether or not a speech signal is received is indicated by a change in the number of stacked histograms M1. If the speech signal is not received, the number of the histograms is controlled to be 3, and if the speech signal is received, the number of the histograms M1 is controlled to be increased to 4.

In fig. 2, whether or not a voice signal is received is represented by an extension of a circle M2. If the number of the control column circles M2 is 1 when no voice signal is received, the control increases the number of the circles M2 to 2 when a voice signal is received.

In fig. 3, whether or not a speech signal is received is represented by a change in the ripple M3. For example, when the voice signal is not received, the control ripple M3 is kept horizontal, and when the voice signal is received, the control ripple M3 is changed from a straight line to a curved line.

Through the above design scheme, in the process of voice input and play, whether the voice signal exists can be determined according to the image displayed on the interface, and the distribution condition of the voice signal in the voice data can be intuitively known.

However, further research shows that the above-mentioned solutions still have a single interactive form, and a user can only determine the presence or absence of a speech signal according to a displayed image, and cannot know more detailed features of speech data, such as features of tone, timbre, loudness, speech speed, and the like. Moreover, the interaction form is too monotonous, and the actual experience of the user is not good.

Therefore, the inventors have made further intensive studies and finally obtained a solution described in the following examples. It should be noted that, the present application is established that, in the case that voice data exists in an existing scene, a scene animation is created through the existing voice data, and then the scene animation is played while voice data is interacted, so as to achieve the purpose of improving the comprehension of a user on the voice data.

Referring to fig. 4, fig. 4 is a flowchart of a speech processing method disclosed in the embodiment of the present application. As shown in fig. 4, the method includes:

and S100, acquiring voice data in a voice input scene or a voice playing scene.

Specifically, the voice input scene refers to a scene in which voice input is performed, for example, a scene in which a user chats with another person involves voice input of chat content, or a scene in which the user involves voice input of text information in information entry and information retrieval processes, and the like. The voice playing scene refers to a scene for performing voice playing, such as a scene in which a user chats with other people by voice, a scene in which voice sent by the other party is played, or a scene in which acquired voice data is played in other applications.

It should be noted that the process of acquiring the voice data in this step may be performed in real time, for example, the process of recording the voice by the user in real time, and the voice data recorded by the user may be acquired in real time. In addition, under the voice playing scene, the played voice data can be acquired in real time in the voice playing process, and the voice data to be played can also be directly acquired completely.

In this step, the method of acquiring the voice data is not limited.

Step S110, obtaining an acoustic feature value set of the voice data, wherein the acoustic feature value set comprises feature values of at least one type of acoustic features.

Specifically, the speech data may have various types of acoustic features, such as loudness features, pitch features, timbre features, speech rate features, and the like. In this step, feature values of various types of acoustic features can be acquired according to the voice data acquired in the previous step.

And step S120, determining the element state of each animation element by referring to the characteristic value of each type of acoustic characteristic in the acoustic characteristic value set.

Specifically, the present application may specify, in advance, each animation element included in a scene animation to be constructed, and each animation element may have a plurality of element states, and examples of the element states may include: element size, element position, element posture, element color, element image, element appearing and disappearing condition and the like.

The element state of the animation element is in an incidence relation with the characteristic value of the acoustic characteristic, and the element state of the animation element belongs to visual reflection of the characteristic value of the acoustic characteristic. In this step, the element state of each animation element may be determined according to the feature value of each type of acoustic feature in the acoustic feature value set.

And S130, constructing a scene animation according to the element state of each animation element, and displaying the scene animation.

Specifically, in this step, a scene animation may be constructed based on each animation element for which the element state has been determined, and the scene animation may be displayed in a corresponding voice input scene or voice playback scene. The method and the device have the advantages that the scene animation corresponding to the voice is synchronously displayed in the voice input process or the voice playing process, the user experience is improved, and the understanding degree of the user to the voice is enhanced.

Optionally, in the embodiment of the present application, animation element materials corresponding to different element states may be obtained in advance. For example, the terminal requests animation element materials corresponding to the states of the elements of different animation elements from the server in advance, and stores the animation element materials locally. And when the element state of each animation element is determined in the previous step, selecting corresponding animation element materials locally, and constructing the scene animation by using the selected animation element materials.

Of course, the terminal may also temporarily request the server for corresponding animation element materials when the element state of each animation element is determined in the previous step, and construct the scene animation using the requested animation element materials.

Taking the animation element of "airplane" as an example, the element states include two states of "color value is red" and "color value is blue". An "airplane" animation element material of the two color values can be created in advance. And when it is determined in step S120 that the element state of the "airplane" is "the color value is red", the corresponding animation element material is selected to construct the scene animation.

The voice processing method provided by the embodiment of the application obtains voice data in a voice input scene or a voice playing scene, the voice data can be input voice data or voice data to be played, further obtains an acoustic characteristic value set of the voice data, the set comprises characteristic values of at least one type of acoustic characteristics, determines element states of animation elements by referring to the characteristic values of each type of acoustic characteristics in the acoustic characteristic value set, and constructs scene animation according to the element states of the animation elements, so as to display the constructed scene animation. Therefore, the scene animation can be constructed according to the acoustic features of the voice data in the voice input scene and the voice playing scene, and the scene animation is displayed in the corresponding scene, so that a user can visually know the acoustic features of the voice data through the scene animation, and the understanding degree of the voice data is enhanced.

Optionally, the embodiment of the present application illustrates two manners of acquiring voice data, which are respectively as follows:

the first method comprises the following steps:

and responding to the triggering operation of the user on the voice input control displayed in the voice input interface, and acquiring voice data through the microphone assembly.

Optionally, the voice input interface includes, but is not limited to, the following: conversation interface, information input interface, information retrieval interface.

As exemplified in fig. 1 and 2, is a session interface. In fig. 1, the control labeled "end of release" below the interface is the voice input control. The user can input voice by long pressing the control, and the input voice is acquired by the microphone assembly.

Corresponding to the voice input interface, the process of displaying the scene animation in step S130 may include:

and displaying the scene animation on a voice input interface.

That is, the user can display the created scene animation on the interface while inputting the voice through the voice input interface.

The first method comprises the following steps:

and responding to the triggering operation of a user on a conversation voice control displayed in a voice playing interface, and acquiring received conversation voice data corresponding to the conversation voice control.

Optionally, the voice playing interface includes, but is not limited to, a conversation interface, such as fig. 5, which illustrates a schematic diagram of a conversation interface playing voice.

In fig. 5, the speech sent from the chat object is displayed on the interface in the form of a conversation speech control. The user can click the conversation voice control to play voice.

Corresponding to the voice playing interface, the process of displaying the scene animation in step S130 may include:

and displaying the scene animation on a voice playing interface.

That is, the user can display the created scene animation on the interface while playing the voice on the voice playing interface.

The embodiment of the present application introduces a process of determining an element state of each animation element by referring to the feature value of each type of acoustic feature in the acoustic feature value set in step S120.

Specifically, the matching relationship between different types of acoustic features and animation elements can be preset. The acoustic features and the animation elements may be in a one-to-one correspondence relationship, or may be in a one-to-many or many-to-one relationship, and may be specifically set by a user.

And determining animation elements corresponding to each type of acoustic features in the acoustic feature value set based on the set relationship. Further, according to the characteristic value of each type of acoustic characteristic, the element state of the corresponding animation element is determined.

Optionally, for each type of acoustic feature and the animation element matched with the acoustic feature, a matching relationship between a feature value interval of the acoustic feature and an element state of the animation element may be preset.

Based on the matching relationship, firstly, a characteristic value interval to which the characteristic value of each type of acoustic characteristic belongs is determined, and then the element state of the animation element corresponding to the characteristic value interval is determined.

In another optional case, for each type of acoustic feature and the animation element matched therewith, a rule for determining an element state of the animation element may be preset, where the rule is related to a feature value of the acoustic feature and a value range of the element state of the corresponding animation element. Based on the above, the element state of the corresponding animation element can be determined according to the feature value of each type of acoustic feature and the value range of the element state of the corresponding animation element. For example, the value range of the element state of the animation element is x₁～x₂The eigenvalue of the acoustic characteristic has a value range of y₁～y₂The determined characteristic value of the acoustic characteristic of the current voice data is y_tThen, the preset rule can be expressed by the following formula:

wherein x is_tRepresenting the real-time element state of the animated element.

Further, for animated elements, there may be one or more animation features, each of which has a corresponding element state. Taking an airplane as an example of an animation element, two animation features of tilting and lifting can exist, each animation feature has a corresponding element state, taking tilting as an example, different tilting angles correspond to different element states, and taking lifting as an example, different upgrading heights correspond to different element states. The matching relationship between the set feature value interval of the acoustic feature and the element state of the animation element in this embodiment may include: and matching the characteristic value interval of the acoustic characteristic with the element state of each animation characteristic of the animation element.

The relationship between the acoustic features and the animation elements and element states in the scene animation is illustrated by a diagram, as shown in fig. 6:

in the case illustrated in fig. 6, the acoustic features include four types, which are: loudness, pitch, pace of speech, timbre. Reflecting on the specific acoustic parameters, the loudness corresponds to the amplitude of the sound, the pitch corresponds to the frequency of the sound, the speech rate corresponds to the zero-crossing rate of the sound, and the timbre corresponds to the characteristic value of the sound.

Animation elements 1-5 (abbreviated as elements in FIG. 6) are included in the corresponding scene animation. In the case illustrated in fig. 6, each animation element corresponds to at most one animation feature (abbreviated as a feature in fig. 6), but of course, a one-to-many relationship between animation elements and animation features may be set. Fig. 6 also illustrates a case where two different animation elements correspond to the same animation feature, that is, a many-to-one relationship between animation elements and animation features may also be set in this embodiment. For each animation feature, there is a corresponding state value, i.e., animation state.

The method and the device can establish a matching relation model in advance, and determine the corresponding relation between the acoustic characteristics and the animation elements and the corresponding relation between the characteristic value intervals of the acoustic characteristics and the element states of the corresponding animation elements through mathematical analysis and transformation.

Next, the present application will be described with reference to a specific example.

As shown in fig. 7, which illustrates one frame image of a scene animation. The scene animation includes animation elements: pilot Y1, airplane Y2, tail gas Y3, cloud Y4 and ground dotted line Y5.

The pilot Y1 has an animation characteristic of image type, the airplane Y2 has animation characteristics of color, lifting and cutting, the tail gas Y3 has an animation characteristic of size, and the cloud Y4 and the ground dotted line Y5 serve as reference objects and have an animation characteristic of translation. Each animation feature corresponds to a plurality of changing element states.

With further reference to fig. 8, fig. 8 illustrates an effect diagram of the chat interface showing scene animation while the user inputs voice.

The present application includes in terms of acoustic features: four types of loudness, pitch, speech rate and timbre are exemplified. The correspondence between the acoustic features and the animation elements is established in advance, as shown in fig. 9.

Loudness is determined by the amplitude of the sound, pitch is determined by the frequency of the sound, speech rate is determined by the zero-crossing rate of the sound, and timbre is determined by the eigenvalues of the sound.

In the correspondence of the example of fig. 9:

the loudness of the user's voice represents the power of the aircraft Y2, the higher the loudness, the larger the area of the aircraft emitting exhaust gas Y3. The tail gas Y3 material of three kinds of different area sizes can be designed in advance to this application, corresponds different loudness interval respectively. As in fig. 10-12.

By contrast, from fig. 10 to fig. 12, the area of the exhaust gas gradually increases, which represents that the loudness of the sound is higher and higher.

Further, in the correspondence relationship illustrated in fig. 9:

the climbing condition of the airplane Y2 is represented by the tone of sound, the higher the sound frequency is, the more obvious the climbing state of the airplane Y2 is, the larger the inclination angle of the airplane Y2 is, and the higher the height of the airplane Y2 is. The method can be used for designing two airplane Y2 materials with different climbing conditions in advance, and the two airplane Y2 materials correspond to different tone intervals, namely frequency intervals. As in fig. 13-14.

In contrast, as shown in fig. 14, the aircraft Y2 has an increased pitch angle and an increased altitude, representing a higher and higher pitch of sound, as compared to fig. 13.

Further, in the correspondence relationship illustrated in fig. 9:

the term "speed" indicates the flying speed of the airplane Y2, and the faster the speech speed, the faster the flying speed of the airplane Y2, and the faster the corresponding reference object is translated backward. This can be set by animation parameters, i.e. by setting the speed of the backward translation of the cloud Y4 and the ground dotted line Y5 as reference objects.

Further, in the correspondence relationship illustrated in fig. 9:

the type of pilot Y1 and the color of airplane Y2 are represented in timbre. The inventor finds that the difference of the timbre can reflect the mood, the character and the sex of a person. The gender of the user is distinguished by timbre in this example, and two types of pilot Y1 and two colors of airplane Y2 are provided for boys and girls, respectively. For example, the tone feature values are set to 1 and 2, respectively, with 1 corresponding to boys and 2 corresponding to girls. Two types of pilot types are preset, namely a timbre characteristic value 1 corresponding to small flying and a timbre characteristic value 2 corresponding to hello kitty. Two colors are preset for the airplane, namely a blue color corresponding to the tone characteristic value 1 and a red color corresponding to the tone characteristic value 2.

Through the setting mode, the user can see the content and the color atmosphere which are consistent with the gender of the voice user in the animation, the male can see the blue airplane and the small flying, the female can see the red airplane and the hello kitty, the personalized experience is improved, and the detailed information of the voice user can be visually known.

Next, a procedure of determining correspondence between feature values of acoustic features and element states of animation elements will be described.

Still take the corresponding relationship illustrated in fig. 9 as an example:

1. defining the amplitude of the acoustic loudness as L, the range of the amplitude L is defined as L_min～L_maxThe real-time amplitude of the acquired voice data is l_t. In this embodiment, the amplitude may be divided into three equally spaced intervals, which respectively correspond to the exhaust gas states (P) with three sizes preset in the animation₁P₂P₃) Then, the area size of the tail gas corresponds to the following formula:

wherein p is_tRepresenting the real-time status selection of the exhaust.

Of course, the above only illustrates one implementation of trisecting the amplitude, and it is understood that the amplitude may be divided into other number of intervals, and the division manner is not limited to the manner of trisecting.

2. Defining the frequency of the sound tone as f, the range of the frequency f is defined as f_min～f_maxThe real-time frequency of the acquired voice data is f_t。

<1>Defining an aircraft inclination angle asIn particular to the included angle between the airplane body and the horizontal line.The transformation range of (a) is:the formula for the aircraft inclination angle is as follows:

wherein,representing the real-time angle of inclination of the aircraft flight.

<2>Defining the flying height of the airplane in the animation as H, wherein the transformation range of H is as follows: h_min～H_maxThen, the aircraft flight altitude corresponds to the following formula:

wherein h is_tRepresenting the real-time altitude of the aircraft flight.

3. Defining the zero crossing rate of the voice as gamma, and the value range of the zero crossing rate as gamma is 0-gamma_max. The real-time zero crossing rate of the acquired voice data is gamma_t. The playing speed of the reference object is set as V, and the value range is V_min～V_max. The playing speed of the reference object in the animation is corresponding to the following formula:

wherein, V_tRepresenting the real-time playing speed of the reference object in the animation.

4. Defining the tone type as x and the value range as { 1; 2, the color number of the airplane is a, and a has two states: blue and red, corresponding to a1 and a2, respectively; the pilot category is b, and b has two states: small fly-away and hello kitty, corresponding to b1 and b2, respectively, then the aircraft color and pilot category correspond to the formula:

it is understood that the scene animations of fig. 7-14 are only one alternative example, and other scene animations can be designed, as long as the element states of the animation elements in the scene animations can be reasonably related to the acoustic features of the voice data.

According to the scheme, the voice interface constructs rich multi-dimensional animation interaction according to the acoustic characteristics of voice, the comprehension of a user to the voice is improved, the interaction experience is excellent, the individuation is strong, the user can carry out voice interaction in pleasure, and the use efficiency is improved.

The following describes a voice processing apparatus provided in an embodiment of the present application, and the voice processing apparatus described below and the voice processing method described above may be referred to correspondingly.

As shown in fig. 15, a schematic structural diagram of a speech processing apparatus is disclosed, and the speech processing apparatus may include:

a voice acquiring unit 11, configured to acquire voice data in a voice input scene or a voice playing scene;

an acoustic feature value obtaining unit 12, configured to obtain an acoustic feature value set of the voice data, where the acoustic feature value set includes feature values of at least one type of acoustic features;

an element state determining unit 13, configured to determine an element state of each animation element by referring to a feature value of each type of acoustic feature in the set of acoustic feature values;

an animation construction unit 14 for constructing a scene animation according to the element states of the animation elements;

and an animation display unit 15 for displaying the scene animation.

Optionally, the acoustic feature value obtaining unit may include:

Optionally, the element state determining unit may include:

Optionally, the feature value corresponding unit may include:

In another alternative, the feature value corresponding unit may include:

Optionally, the animation construction unit may include:

Optionally, the voice acquiring unit may include:

the first voice acquiring subunit is configured to respond to a user's trigger operation on a voice input control displayed in a voice input interface, and acquire voice data through a microphone assembly, where the voice input interface includes: conversation interface, information input interface, information retrieval interface. On the basis of this, the method is suitable for the production,

optionally, the voice acquiring unit may include:

a second voice obtaining subunit, configured to respond to a user's trigger operation on a session voice control displayed in a voice playing interface, and obtain received session voice data corresponding to the session voice control, where the voice playing interface includes: a session interface. On the basis of this, the method is suitable for the production,

the animation display unit may include:

The voice processing device provided by the embodiment of the application can be applied to voice processing equipment, such as mobile phones, IPADs, PC terminals and the like. Alternatively, fig. 16 shows a block diagram of a hardware structure of the voice processing apparatus, and referring to fig. 16, the hardware structure of the voice processing apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application specific Integrated Circuit ASIC (application specific Integrated Circuit), or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring voice data in a voice input scene or a voice playing scene;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

acquiring voice data in a voice input scene or a voice playing scene;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech processing, comprising:

acquiring voice data in a voice input scene or a voice playing scene;

2. The method of claim 1, wherein the obtaining the set of acoustic feature values for the speech data comprises:

3. The method of claim 1, wherein said determining the element state of each animation element with reference to the feature value of each type of acoustic feature in the set of acoustic feature values comprises:

4. The method of claim 3, wherein determining the element state of the corresponding animation element according to the feature value of each type of acoustic feature comprises:

5. The method of claim 3, wherein determining the element state of the corresponding animation element according to the feature value of each type of acoustic feature comprises:

6. The method of claim 1, wherein constructing a scene animation according to the element states of the animation elements comprises:

7. The method of claim 1, wherein the obtaining voice data in a voice input scenario or a voice playback scenario comprises:

the displaying the scene animation includes:

and displaying the scene animation on the voice input interface.

8. The method of claim 1, wherein the obtaining voice data in a voice input scenario or a voice playback scenario comprises:

the displaying the scene animation includes:

and displaying the scene animation on the voice playing interface.

9. The method of any one of claims 1-8, wherein the element state comprises any one or more of: element size, element position, element posture, element color, element image and element visible and invisible conditions.

10. A speech processing apparatus, comprising:

and the animation display unit is used for displaying the scene animation.

11. The apparatus according to claim 10, wherein the acoustic feature value obtaining unit includes:

12. The apparatus of claim 10, wherein the element state determination unit comprises:

13. The apparatus of claim 12, wherein the feature value corresponding unit comprises:

14. The apparatus of claim 12, wherein the feature value corresponding unit comprises:

15. The apparatus of claim 10, wherein the animation construction unit comprises:

16. The apparatus of claim 10, wherein the voice obtaining unit comprises:

the animation display unit includes:

17. The apparatus of claim 10, wherein the voice obtaining unit comprises:

the animation display unit includes:

18. A speech processing device comprising a memory and a processor;

the memory is used for storing programs;

the processor, which executes the program, implements the steps of the speech processing method according to any one of claims 1 to 9.

19. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech processing method according to any one of claims 1-9.