CN117238287A

CN117238287A - Voice data processing method and device

Info

Publication number: CN117238287A
Application number: CN202311209184.8A
Authority: CN
Inventors: 肖海; 林永吉; 黎清顾
Original assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Current assignee: Gree Electric Appliances Inc of Zhuhai; Zhuhai Lianyun Technology Co Ltd
Priority date: 2023-09-18
Filing date: 2023-09-18
Publication date: 2023-12-15

Abstract

The application provides a voice data processing method and a device, wherein the method comprises the following steps: acquiring a question to be replied; determining reply content of a text format corresponding to a question to be replied, wherein the reply content comprises at least one target text; determining a target sound speed value and a target volume value of at least one target text in the reply content; and generating a reply voice corresponding to the question to be replied according to the reply content and the target sound speed value and the target sound volume value of the target text in the reply content. By the method, the reply voice with the target sound speed value and the target sound volume value meeting the user requirement can be generated.

Description

Voice data processing method and device

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a voice data processing method and device.

Background

Along with the development of artificial intelligence technology, intelligent voice is also increasingly widely applied in the fields of intelligent home, intelligent vehicle-mounted systems, intelligent customer service, audio books, voice prompts, intelligent shopping guides and the like.

In the related art, aiming at a question of a user, an artificial intelligence spells a reply answer through a pre-arranged voice segment and feeds the reply answer back to the user.

However, according to the answer composed of the pre-arranged voice fragments, the phenomenon of hard intonation and lack of emotion expression exists, which results in the problem that the answer voice meeting the user requirement cannot be generated aiming at the question of the user.

Disclosure of Invention

The application aims to provide a voice data processing method and device, which at least solve the problem that reply voice meeting the user requirement cannot be generated aiming at a user question in the prior art.

In a first aspect, an embodiment of the present application discloses a voice data processing method, including:

acquiring a question to be replied;

determining reply content in a text format corresponding to the question to be replied, wherein the reply content comprises at least one target text;

determining a target sound speed value and a target sound volume value of at least one target text in the reply content;

generating a reply voice corresponding to the question to be replied according to the reply content, and the target sound speed value and the target volume value of the target text in the reply content _。

In a second aspect, an embodiment of the present application discloses a voice data processing apparatus, the apparatus including:

the first acquisition module is used for acquiring a question to be replied;

The first determining module is used for determining reply content in a text format corresponding to the question to be replied, and the reply content comprises at least one target text;

the second determining module is used for determining a target sound speed value and a target volume value of at least one target text in the reply content;

the first generation module is used for generating reply voice corresponding to the question to be replied according to the reply content, and the target sound speed value and the target sound volume value of the target text in the reply content.

In a third aspect, an embodiment of the present application further discloses an electronic device, including a processor and a memory, where the memory stores a program or instructions executable on the processor, where the program or instructions implement the steps of the method according to the first aspect when executed by the processor.

In a fourth aspect, embodiments of the present application also disclose a readable storage medium having stored thereon a program or instructions which, when executed by a processor, implement the steps of the method as described in the first aspect.

In summary, in the embodiment of the application, in a reply question corresponding to a question to be replied, an initial sound velocity value and a target sound velocity weight value of each target character are obtained, a target sound velocity value of the target character is obtained according to the initial sound velocity value and the target sound velocity weight value, an initial sound volume value and a target sound volume weight value of each target character are obtained, a target sound volume value of the target character is obtained according to the initial sound volume value and the target sound volume weight value, and reply voice with sound velocity characteristics and sound volume characteristics is generated according to the target sound velocity value and the target sound volume value, so that the problem that reply voice meeting user requirements cannot be generated for a user question is solved.

Drawings

In the drawings:

FIG. 1 is a flowchart illustrating steps of a method for processing voice data according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating steps of another voice data processing method according to an embodiment of the present application;

fig. 3 is a schematic diagram of volume distribution and time distribution of first voice data according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating steps of a method for processing voice data according to an embodiment of the present application;

FIG. 5 is a block diagram of a voice data processing apparatus according to an embodiment of the present application;

FIG. 6 is a block diagram of an electronic device of one embodiment provided by an embodiment of the application;

fig. 7 is a block diagram of an electronic device according to another embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type, and are not limited to the number of objects, such as the first object may be one or more. Furthermore, in the description and claims, "and/or" means at least one of the connected objects, and the character "/", generally means that the associated object is an "or" relationship.

Referring to fig. 1, an embodiment of the present application provides a voice data processing method, which may include the following steps:

and step 101, acquiring a question to be replied.

In one embodiment, the question to be replied to is a question in speech format issued by the target user. The target user is a user who needs to acquire a reply voice corresponding to the question to be replied sent by the target user.

In another embodiment, the question to be replied to is a text format question. For example, the question to be replied may be obtained by converting a question in a voice format issued by the target user into a question in a text format. For another example, the question to be replied to may be a file in a text format obtained in response to a text input operation by the target user.

Step 102, determining the reply content of the text format corresponding to the question to be replied.

Wherein the reply content includes at least one target text.

In this step, the reply content is text format content, and the target text is text included in the text format reply content.

In one embodiment, the question to be replied is a question in a voice format, and the question in the voice format sent by the target user is converted into the question in a text format. And then matching the question in the text format with question data in a plurality of question-answer pairs in a pre-stored database, wherein the question-answer pairs comprise question data and answer data corresponding to the question data. Target question data matched with the question is obtained through matching, then the reply data in the question-reply pair where the target question data is located is determined according to the reply data, and the reply data is reply content in a text format corresponding to the question to be replied.

In another embodiment, the question to be replied is a question in a text format, and the question in the text format is directly matched with question data in a plurality of question-answer pairs in a pre-stored database, wherein the question-answer pairs comprise question data and reply data corresponding to the question data. After target question data matched with the question is obtained through matching, according to the reply data in the question-answer pair where the target question data is located, determining that the reply data is reply content in a text format corresponding to the question to be replied.

Step 103, determining a target sound speed value and a target sound volume value of at least one target text in the reply content.

In this step, the reply content includes at least one target text. And acquiring a target sound velocity value and a target sound volume value of each target text.

Specifically, the target sound velocity value is a sound velocity value generated for the target character, and the target sound velocity value is a sound velocity value generated for the target character.

In one embodiment, the target sound velocity value of the target text is calculated according to the initial sound velocity value and the target sound velocity weight value of the target text, and the target volume value of the target text is calculated according to the initial volume value and the target sound volume weight value of the target text.

And 104, generating a reply voice corresponding to the question to be replied according to the reply content, and the target sound speed value and the target sound volume value of the target text in the reply content.

In this step, for each target character in the reply content, the target sound speed value and the target volume value are configured to the target character, and a reply voice having the corresponding target sound speed value and target volume value is generated.

To sum up, in this embodiment, by acquiring a question to be replied, reply content in a text format corresponding to the question to be replied is determined, and a target sound speed value and a target sound volume value of at least one target text in the reply content are determined. Compared with the prior art that reply voices are spliced based on pre-arranged voice fragments, the reply voices do not have voice characteristics similar to those of true persons, the reply voices are hard and lack of emotion expression, the reply voices are not as natural as the true persons, the method obtains the target sound speed value and the target sound volume value of each target word in reply contents, and generates reply voices corresponding to questions to be replied according to the reply contents and the target sound speed value and the target sound volume value of the target words in the reply contents, the obtained reply voices have the target sound speed value and the target sound volume value of each target word, the reply voices have emotion expression similar to the true persons, and the problem that the reply voices meeting user requirements cannot be generated according to user questions is solved.

Fig. 2 is a flowchart of steps of another voice data processing method according to an embodiment of the present application, and referring to fig. 2, the method may include the following steps:

step 201, obtain a question to be replied.

The method of this step is described in the foregoing step 101, and will not be described here again.

Step 202, determining the reply content of the text format corresponding to the question to be replied.

Wherein the reply content includes at least one target text.

The method of this step is already described in step 102, and will not be described here again.

Step 203, for each target text, obtaining an initial sound speed value, a target sound speed weight value, an initial sound volume value and a target sound volume weight value of the target text.

In one embodiment, the initial sound velocity value is a sound velocity value obtained according to a second sound velocity value of each second word in the second voice data, and the initial sound velocity value is a sound velocity value obtained according to a second sound velocity value of each second word in the second voice data.

The target sound speed weight value is a weight value for optimizing the initial sound speed value, and the target sound volume weight value is a weight value for optimizing the initial sound volume value.

Step 204, obtaining a target sound speed value of the target text according to the initial sound speed value and the target sound speed weight value;

In one embodiment, the initial sound velocity value and the target sound velocity weight value are multiplied to obtain a target sound velocity value of the target text.

Step 205, obtaining the target volume value of the target text according to the initial volume value and the target volume weight value.

In one embodiment, the initial volume value and the target volume weight value are multiplied to obtain a target volume value of the target text.

Step 206, generating a reply voice corresponding to the question to be replied according to the reply content and the target sound speed value and the target sound volume value of the target text in the reply content.

The method of this step is described in the foregoing step 104, and will not be described herein.

In one embodiment, the step 203 of obtaining the target sound speed weight value and the target sound volume weight value of the target text may include the following sub-steps:

sub-step 2031, at least one first voice data is acquired.

In one embodiment, a part of voice data may be randomly selected from a preset voice database, and the selected part of voice data is used as the first voice data.

In another embodiment, at least one voice data of the preset user may be recorded, and the recorded at least one voice data is used as the first voice data.

In a substep 2032, for each first speech data, a first sound velocity value and a first sound volume value of at least one first text included in the first speech data are determined.

The target characters are characters in the first characters.

In one embodiment, each first voice data includes at least one first text, each first text having a corresponding first sound velocity value and first sound volume value. Wherein the at least one first text comprises a target text in the reply content.

In one embodiment, the collected decibel data is used as the first sound value corresponding to the first text by collecting the decibel data of each first text in the first voice data.

In one embodiment, a first sonic value of each first word is obtained according to time distribution data of each first word in the first voice data.

Specifically, a first word in the first voice data is taken as a time starting point, and time distribution data of each first word in the first voice data is obtained. For each ofA preceding first character, acquiring the time t of a following first character adjacent to and behind the preceding first character according to the time distribution data ₂ And the time t of the previous first character ₁ Time interval Δt between Δt=t ₂ -t ₁ First sound velocity value v= (t) of preceding first character ₂ -t ₁ )/2。

In one embodiment, the text content corresponding to the first voice data is "my really thanks to its existence", wherein the first volume value distribution and time distribution data of each first text in the text content are as shown in fig. 3. Taking the first word "feel" as an example, it is the preceding first word relative to the following first word, and the first sound value of the preceding first word is 72dB.

With continued reference to FIG. 3, the time corresponding to the first letter "feel" before is T ₅ The time corresponding to the following first word 'thank' adjacent to and behind it is T ₆ The first sound velocity value v of the preceding first word "feel" is equal to: v= (T ₆ -T ₅ )/2。

In the substep 2033, for each first text, a sound velocity weight value of the first text is obtained according to the first sound velocity value of the first text in the first voice data.

In one embodiment, for each first text, a speed weight value of the first text is obtained from a first speed value of the first text in the first speech data and a speed average of the first speed values of all the first text in the first speech data.

Sub-step 2034, determining a target sound velocity weight value of the target text from the sound velocity weight values of the first text;

for example, a first text matching a target text is determined from the first text, and a sound speed weight value of the first text is determined as a target sound speed weight value of the target text.

In the substep 2035, for each first text, a volume weight value of the first text is obtained according to the first volume value of the first text in the first voice data.

For each first text, a volume weight value of the first text is obtained according to a first volume value of the first text in the first voice data and a volume average value of the first volume values of all the first text in the first voice data.

Sub-step 2036, determines a target volume weight value for the target text from the volume weights values for the first text.

For example, a first text matching a target text is determined from the first text, and the volume weight value of the first text is used as the target volume weight value of the target text.

In this embodiment, by acquiring the first sound speed weight value and the first volume weight value of the first text in the first voice data, the target sound speed weight value and the target volume weight value of the target text can be rapidly determined according to the first sound speed weight value and the first volume weight value of the first text.

In one embodiment, sub-step 2033 may include the sub-steps of:

sub-step 2037 obtains a mean value of the speed of sound of the first speech data based on the first speed of sound value of the at least one first word included in the first speech data.

In one embodiment, a first sound speed value of each first word in the first voice data is obtained, the first sound speed value of each first word is preprocessed, the preprocessed first sound speed value is obtained, and then a sound speed average value of the first voice data is obtained according to the preprocessed first sound speed value. Further, the first sonic values of all the preprocessed first words are averaged to obtain a sonic average value.

Further, preprocessing the first sound velocity value of each first character includes: and acquiring the standard deviation of the first sound speed value of each first character relative to the first sound speed values of all the first characters, and deleting the first sound speed value of the first character under the condition that the standard deviation of the sound speed exceeds a preset standard deviation threshold value of the sound speed.

Correspondingly, after deleting all the first characters with the preset sound speed standard deviation exceeding the preset sound speed standard deviation threshold, taking the first sound speed value of the remaining first characters as the preprocessed first sound speed value, acquiring the average value of the first sound speed values of the remaining first characters, and taking the average value of the first sound speed values of the remaining first characters as the sound speed average value of the first voice data.

For example, the preset sound speed standard deviation threshold may be set according to the user's demand, for example, may be set to 3.

A substep 2038, for each first text, obtaining, according to the first speed value of the first text in the first voice data and the speed average value of the first voice data, a speed relative value of the first speed value of the first text with respect to the speed average value;

for each first word, the ratio operation is performed on the first sound velocity value of the first word and the sound velocity average value of all the first words in the first voice data, so as to obtain a sound velocity relative value of the first sound velocity value of the first word relative to the sound velocity average value.

Sub-step 2039, obtains a sound velocity weight value of the first text according to the sound velocity relative value.

In one embodiment, the relative value of the speed of sound is taken as the speed of sound weight value of the first text.

In another embodiment, the speed relative value and the first preset weight are multiplied to obtain a speed weight value of the first text.

In one embodiment, where the first voice data is plural, sub-step 2039 may include the sub-steps of:

a sub-step 2040 of respectively acquiring, for each first word, a relative value of the first sonic velocity of the first word to a sonic velocity average value of each first voice data;

in this step, for each first word, a ratio operation is performed on the first sonic value of the first word and the sonic average value of the first voice data where the first sonic value is located, where the obtained ratio is the sonic relative value of the first sonic value of the first word relative to the sonic average value.

Sub-step 2041, obtaining a first average value of the relative values of the speed of sound from the relative values of the speed of sound for each of the first voice data;

illustratively, the relative values of the speeds of all the first voice data are averaged to obtain a first average value of the relative values of the speeds of sound.

In sub-step 2043, a sonic weight value for the first text is obtained based on the first average value.

In one embodiment, the first average value is taken as a sound velocity weight value of the first text.

In another embodiment, the product of the first average value and the second preset weight value is used as the sound speed weight value of the first text.

In one embodiment, sub-step 2035 may include the sub-steps of:

sub-step 2044, obtaining a volume average value of the first voice data according to a first volume value of at least one first word included in the first voice data;

in one embodiment, a first volume value of each first word in the first voice data is obtained, the first volume value of each first word is preprocessed, the preprocessed first volume value is obtained, and then a volume average value of the first voice data is obtained according to the preprocessed first volume value. Further, the first volume values of all the preprocessed first words are averaged to obtain a volume average value.

Further, preprocessing the first volume value of each first text, including: and acquiring the volume standard deviation of the first volume values of the first characters relative to the first volume values of all the first characters according to each first character, and deleting the first volume values of the first characters under the condition that the volume standard deviation exceeds a preset volume standard deviation threshold. Correspondingly, after deleting all the first characters with the preset volume standard deviation exceeding the preset volume standard deviation threshold, taking the first volume values of the remaining first characters as the preprocessed first volume values, obtaining the average value of the first volume values of the remaining first characters, and taking the average value of the first volume values of the remaining first characters as the volume average value of the first voice data.

For example, the preset volume standard deviation threshold may be set according to user requirements, for example, may be set to 3.

In step 2045, for each first word, a relative volume value of the first word with respect to the volume average value is obtained according to the first volume value of the first word in the first voice data and the volume average value of the first voice data.

In this step, for each first text, a ratio operation is performed on the first volume value of the first text and the volume average value of the first voice data where the first volume value is located, where the obtained ratio is the volume relative value of the first volume of the first text with respect to the volume average value.

In sub-step 2046, a volume weight value for the first text is obtained based on the volume relative value.

In one embodiment, the relative volume value is used as the volume weight value of the first text.

In another embodiment, the volume relative value and the third preset weight are multiplied to obtain a volume weight value of the first text.

In one embodiment, where the first speech data is plural, sub-step 2046 may comprise the sub-steps of:

a sub-step 2047 of respectively obtaining, for each first text, a volume relative value of a first volume value of the first text relative to a volume average value of each first voice data;

Sub-step 2048 of obtaining a second average value of the volume relative values from the volume relative values for each of the first voice data;

illustratively, the relative volume values of all the first voice data are averaged to obtain a second average value of the relative volume values.

In sub-step 2049, a volume weight value for the first text is obtained based on the second average.

In one embodiment, the second average value is used as the volume weight value of the first text.

In another embodiment, the product of the second average value and the fourth preset weight value is used as the volume weight value of the first text.

In one embodiment, the obtaining the initial sound velocity value and the initial sound volume value of the target text in step 203 may include the following sub-steps:

sub-step 2050, obtaining second voice data;

in one embodiment, voice data of a target user who sends out a question to be replied is obtained, and the voice data of the target user is used as second voice data.

In another embodiment, the voice data is obtained from a preset voice database, and the voice data is used as second voice data.

Sub-step 2051, obtaining a second sound velocity value and a second sound volume value of at least one second text included in the second voice data;

wherein the target text is a text in the second text.

In one embodiment, decibel data of a second word in the second voice data is obtained, and the obtained distribution data is used as a second sound value of the second word.

In one embodiment, the time of the preceding second word and the second word adjacent to and located behind the preceding second word in the second voice data is obtained, and the time difference between the time of the preceding second word and the time of the preceding second word is obtained, where half of the time difference is the second sound velocity value of the second word.

For example, the second sonic value of the last second word in the second voice data may be equal to the second sonic value of the preceding second word.

Sub-step 2052, determining an initial sonic value of the target text from the second sonic values of the at least one second text;

for example, the target text is matched with at least one second text, the target second text matched with the target text in the at least one second text is determined, and the second sound speed value of the target second text is determined to be the initial sound speed value of the target text.

Sub-step 2053 determines an initial volume value for the target text from the second volume values for the at least one second text.

For example, a target second text matching the target text in the at least one second text is determined, and a second volume value of the target second text is determined as an initial volume value of the target text.

Referring to fig. 4, the present application provides another embodiment of a voice data processing method, referring to fig. 4, the method includes the steps of:

step S1, obtaining a first sound value of each first word in the first voice data and time data corresponding to each first word, and obtaining the first sound value of the first word according to the time data.

The method for obtaining the first sound value of the first text in the first voice data and obtaining the first sound value of the first text according to the time distribution data is described in the foregoing substep 2032, and is not repeated here.

Step S2, obtaining the average value of the first sound speed values of all the first characters in the first voice data.

For example, the first sound velocity values of all the first words in the first voice data are averaged to obtain a sound velocity average value.

Step S3, obtaining the average value of the first sound volume values of all the first characters in the first voice data.

For example, the first volume values of all the first words in the first voice data are averaged to obtain a volume average value.

And S4, acquiring a sound speed weight value and a volume weight value of each first character.

The method for obtaining the relative value of the sound speed of the first text has been described in the sub-step 2039, and the method for obtaining the relative value of the sound volume of the first text has been described in the sub-step 2046, and will not be described here again.

And S5, constructing an emotion characteristic value database of the first text according to the sound speed weight value and the volume weight value of the first text.

Wherein the emotion characteristic value database comprises a sound speed weight value and a volume weight value of the first text

In this step, the steps S1 to S4 are repeated for each first character to obtain a plurality of sound velocity weight values and volume weight values of the first characters,

for example, the number of the obtained first voice data is multiple, for each first word, the relative value of the sound velocity of each first word for each first voice data and the relative value of the volume are obtained, the relative value of the sound velocity of each first voice data is averaged, a first average value of the relative values of the sound velocity is obtained, the average value is the average relative value of the sound velocity of the first word, and the average relative value of the sound velocity is used as the sound velocity weight value of the first word.

For example, the obtained first voice data is a plurality of, for each first word, the relative volume value and the relative volume value of each first word for each first voice data are obtained, the relative volume value for each first voice data is averaged, a second average value of the relative volume values is obtained, the average value is the average relative volume value of the first word, and the average relative volume value is used as the volume weight value of the first word.

In this embodiment, by acquiring the sound velocity weight value and the volume weight value of the first text in the plurality of first speech data, the emotion feature value database of the first text is constructed based on the sound velocity weight value and the volume weight value of the plurality of first text.

The number of the first words for obtaining the sound speed weight value and the volume weight value is equal to the number determined according to the number of the words included in all the first voice samples. In one embodiment, the number of first words for which the sound velocity weight value and the volume weight value are obtained is 3500.

And S6, collecting voice data of the target user.

In this step, the target user is a user who needs to feed back a reply voice for the question to be replied according to the question to be replied. Wherein the voice data of the target user corresponds to the second voice data in the foregoing embodiment.

Step S7, the voice data of the target user is used as first voice data, and the step S1 is returned.

In this step, the voice data of the target user is used as the first voice data, and the step S1 is returned, which is equivalent to using the voice data of the target user to enlarge and acquire the sample of the data in the emotion characteristic value database, so as to improve the accuracy of acquiring the data in the emotion characteristic value database.

Step S8, according to the voice data of the target user, obtaining a second sound speed value and a second sound volume value of each second word in the voice data of the target user.

In this embodiment, the voice data of the target user corresponds to the second voice data in the foregoing embodiment. In practical applications, the second voice data may not be voice data of the target user, and the second voice data may be voice data of a user selected by the user from a voice database pre-stored with a plurality of users.

For example, the second sound velocity value of the second text is obtained by extracting the second sound velocity value and the time distribution data of the second text in the voice data of the target user according to the time distribution data.

Further, the voice data of the target user are multiple, the voice average value of the second voice value of the second text in the voice data of the target user is obtained and is used as the second target voice value of the second text, and the average value of the second voice value of the second text in the voice data of the target user is obtained and is used as the second target voice value of the second text.

Step S9, obtaining a question to be replied of the target user, and generating reply content aiming at the question to be replied.

In one embodiment, according to the question to be replied of the target user, a large language model (Large Language Model, LLM) is combined to obtain reply content for the question to be replied.

And S10, acquiring a target sound speed weight value and a target volume weight value of each target character in the reply content according to the sound speed weight value and the volume weight value of the first character included in the emotion characteristic value database.

For example, the sound speed weight value of the first text matched with the target text in the first text is taken as the target sound speed weight value of the target text, and the sound volume weight value of the first text matched with the target text is taken as the target sound volume weight value of the target text.

Step S11, according to the second sound speed value and the second sound volume value of each second character, obtaining the initial sound speed value and the initial sound volume value of each target character in the reply content.

For example, a second sound velocity value of a second word of the second words that matches the target word is taken as an initial sound velocity value of the target word, and a second sound volume value of a second word of the second words that matches the target word is taken as an initial sound volume value of the target word.

And step S12, acquiring the target sound speed of the target text according to the initial sound speed value and the target sound speed weight value, and acquiring the target sound volume of the target text according to the initial sound volume value and the target sound volume weight value.

For example, the initial sound velocity value and the target sound velocity weight value are multiplied to obtain a target sound velocity value of the target text, and the initial sound velocity value and the target sound volume weight value are multiplied to obtain a target sound volume value of the target text.

And S13, generating a reply voice according to the target sound speed value, the target sound volume value and the target characters in the reply content.

Specifically, emotion expression of target sound speed and target sound volume is performed on each target text in the reply content. For example, the reply content of the question to be replied "i really thanks to its existence" is "no use of the words", and then the target volume value of each target word in "no use of the words" is equal to the product of its corresponding initial volume value and target volume weight value, where the target speed value of each target word is equal to the product of its corresponding initial speed value and target speed weight value.

Referring to fig. 5, there is shown a voice data processing apparatus 30 according to an embodiment of the present application, where the voice data processing apparatus 30 includes:

A first obtaining module 301, configured to obtain a question to be replied;

a first determining module 302, configured to determine reply content in a text format corresponding to a question to be replied, where the reply content includes at least one target text;

a second determining module 303, configured to determine a target sound speed value and a target volume value of at least one target text in the reply content;

the first generating module 304 is configured to generate a reply voice corresponding to the question to be replied according to the reply content, and the target sound speed value and the target volume value of the target text in the reply content.

In one embodiment, the second determining module 303 may include:

the first acquisition submodule is used for acquiring an initial sound speed value, a target sound speed weight value, an initial sound volume value and a target sound volume weight value of each target character;

the second acquisition sub-module is used for acquiring a target sound speed value of the target text according to the initial sound speed value and the target sound speed weight value;

and the third acquisition sub-module is used for acquiring the target volume value of the target text according to the initial volume value and the target volume weight value.

Optionally, the first obtaining sub-module may include:

a first acquisition unit configured to acquire at least one first voice data;

A first determining unit, configured to determine, for each first voice data, a first sound velocity value and a first sound volume value of at least one first text included in the first voice data, where the target text is a text in the first text;

the second acquisition unit is used for acquiring a sound speed weight value of each first character according to a first sound speed value of the first character in the first voice data;

the second determining unit is used for determining a target sound speed weight value of the target character from sound speed weight values of the first character;

the third acquisition unit is used for acquiring a volume weight value of each first word according to a first volume value of the first word in the first voice data;

and the fourth determining unit is used for determining the target volume weight value of the target text from the volume weight values of the first text.

Alternatively, the second acquisition unit may include:

the first obtaining subunit is used for obtaining a sonic average value of the first voice data according to a first sonic value of at least one first word included in the first voice data;

the second obtaining subunit is used for obtaining a relative value of the first sound speed value of the first text relative to the sound speed average value according to the first sound speed value of the first text in the first voice data and the sound speed average value of the first voice data for each first text;

And the third acquisition subunit is used for acquiring the sound speed weight value of the first text according to the sound speed relative value.

Alternatively, in the case where the first voice data is plural, the third acquisition subunit may include:

the first acquisition subunit is used for respectively acquiring a sound velocity relative value of a first sound velocity value of each first character relative to a sound velocity average value of each first voice data for each first character;

a second acquisition subunit configured to acquire a first average value of the speed relative values according to the speed relative value for each of the first voice data;

and the third acquisition subunit is used for acquiring the sound speed weight value of the first text according to the first average value.

Optionally, the third obtaining unit includes:

a fourth obtaining subunit, configured to obtain a volume average value of the first voice data according to a first volume value of at least one first text included in the first voice data;

a fifth obtaining subunit, configured to obtain, for each first text, a volume relative value of the first volume value of the first text with respect to the volume average value according to the first volume value of the first text in the first voice data and the volume average value of the first voice data;

And the sixth acquisition subunit is used for acquiring the volume weight value of the first text according to the volume relative value.

Alternatively, in the case where the first voice data is plural, the sixth acquisition subunit may include:

a fourth obtaining subunit, configured to obtain, for each first text, a volume relative value of a first volume value of the first text relative to a volume average value of each first voice data;

a fifth acquisition subunit, configured to acquire a second average value of the volume relative values according to the volume relative value for each of the first voice data;

and a sixth acquisition subunit, configured to acquire a volume weight value of the first text according to the second average value.

Optionally, the first obtaining sub-module may include:

a fourth acquisition unit configured to acquire second voice data;

a fifth obtaining unit, configured to obtain a second sound velocity value and a second sound volume value of at least one second text included in the second voice data, where the target text is a text in the second text;

a fifth determining unit, configured to determine an initial sound velocity value of the target text from the second sound velocity values of the at least one second text;

and a sixth determining unit, configured to determine an initial sound volume value of the target text from the second sound volume values of the at least one second text.

Fig. 6 is a block diagram of an electronic device 400 shown. For example, electronic device 400 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 6, electronic device 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an input/output (I/O) interface 412, a sensor component 414, and a communication component 416.

The processing component 402 generally controls overall operation of the electronic device 400, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 402 may include one or more processors 420 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 402 can include one or more modules that facilitate interaction between the processing component 402 and other components. For example, the processing component 402 may include a multimedia module to facilitate interaction between the multimedia component 408 and the processing component 402.

Memory 404 is used to store various types of data to support operations at electronic device 400. Examples of such data include instructions for any application or method operating on electronic device 400, contact data, phonebook data, messages, pictures, multimedia, and so forth. The memory 404 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 406 provides power to the various components of the electronic device 400. The power components 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 400.

The multimedia component 408 includes an interface between the electronic device 400 and the user that provides an output interface. In some embodiments, the interface may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the interface includes a touch panel, the interface may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense demarcations of touch or sliding actions, but also detect durations and pressures associated with touch or sliding operations. In some embodiments, the multimedia component 408 includes a front camera and/or a rear camera. When the electronic device 400 is in an operational mode, such as a photographing mode or a multimedia mode, the front-facing camera and/or the rear-facing camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 410 is for outputting and/or inputting audio signals. For example, the audio component 410 includes a Microphone (MIC) for receiving external audio signals when the electronic device 400 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 404 or transmitted via the communication component 416. In some embodiments, audio component 410 further includes a speaker for outputting audio signals.

Input/output I/O interface 412 provides an interface between processing component 402 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 414 includes one or more sensors for providing status assessment of various aspects of the electronic device 400. For example, the sensor assembly 414 may detect an on/off state of the electronic device 400, a relative positioning of the components, such as a display and keypad of the electronic device 400, the sensor assembly 414 may also detect a change in position of the electronic device 400 or a component of the electronic device 400, the presence or absence of a user's contact with the electronic device 400, an orientation or acceleration/deceleration of the electronic device 400, and a change in temperature of the electronic device 400. The sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact. The sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 414 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 416 is used to facilitate communication between the electronic device 400 and other devices, either wired or wireless. The electronic device 400 may access a wireless network based on a communication standard, such as WiFi, an operator network (e.g., 2G, 3G, 4G, or 5G), or a combination thereof. In one exemplary embodiment, the communication component 416 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 416 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 400 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for implementing a voice data processing method as provided by an embodiment of the application.

In an exemplary embodiment, a non-transitory computer-readable storage medium is also provided, such as memory 404, that includes instructions executable by processor 420 of electronic device 400 to perform the above-described method. For example, the non-transitory storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

Fig. 7 is a block diagram of an electronic device 500 in accordance with another embodiment of the application. For example, electronic device 500 may be provided as a server. Referring to fig. 7, electronic device 500 includes a processing component 522 that further includes one or more processors and memory resources represented by memory 532 for storing instructions, such as applications, executable by processing component 522. The application programs stored in the memory 532 may include one or more modules each corresponding to a set of instructions. In addition, the processing component 522 is configured to execute instructions to perform a voice data processing method provided by an embodiment of the present application.

The electronic device 500 may also include a power component 526 configured to perform power management of the electronic device 500, a wired or wireless network interface 550 configured to connect the electronic device 500 to a network, and an input/output (I/O) interface 558. The electronic device 500 may operate based on an operating system stored in the memory 532, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, or the like.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method of processing speech data, comprising:

acquiring a question to be replied;

and generating a reply voice corresponding to the question to be replied according to the reply content, and the target sound speed value and the target volume value of the target text in the reply content.

2. The method of claim 1, wherein determining a target speed of sound value and a target volume value for at least one target text in the reply content comprises:

for each target character, acquiring an initial sound speed value, a target sound speed weight value, an initial sound volume value and a target sound volume weight value of the target character;

acquiring a target sound speed value of the target text according to the initial sound speed value and the target sound speed weight value;

And acquiring a target volume value of the target text according to the initial volume value and the target volume weight value.

3. The method of claim 2, wherein obtaining the target sound speed weight value and the target sound volume weight value for the target text comprises:

acquiring at least one first voice data;

determining a first sound velocity value and a first sound volume value of at least one first word included in the first voice data aiming at each first voice data, wherein the target word is a word in the first word;

for each first word, acquiring a sound velocity weight value of the first word according to a first sound velocity value of the first word in the first voice data;

determining a target sound speed weight value of the target character from the sound speed weight values of the first character;

for each first word, acquiring a volume weight value of the first word according to a first volume value of the first word in the first voice data;

and determining the target volume weight value of the target text from the volume weight values of the first text.

4. The method of claim 3, wherein the obtaining, for each first text, a sonic weight value for the first text based on a first sonic value for the first text in the first voice data, comprises:

Acquiring a sound velocity average value of the first voice data according to a first sound velocity value of at least one first word included in the first voice data;

for each first character, acquiring a relative value of the first sound velocity value of the first character relative to the sound velocity average value according to the first sound velocity value of the first character in the first voice data and the sound velocity average value of the first voice data;

and acquiring the sound speed weight value of the first character according to the sound speed relative value.

5. The method according to claim 4, wherein, in the case where the first voice data is plural, the obtaining the sound velocity weight value of the first text according to the sound velocity relative value includes:

for each first word, respectively acquiring a relative value of the first sound velocity of the first word relative to the average value of the sound velocity of each first voice data;

acquiring a first average value of the relative values of the sound speed according to the relative values of the sound speed for each piece of first voice data;

and acquiring the sound speed weight value of the first text according to the first average value.

6. The method of claim 3, wherein the obtaining, for each first text, a volume weight value for the first text based on a first volume value for the first text in the first voice data, comprises:

Acquiring a volume average value of the first voice data according to a first volume value of at least one first word included in the first voice data;

for each first word, acquiring a volume relative value of a first volume value of the first word relative to a volume average value according to the first volume value of the first word in the first voice data and the volume average value of the first voice data;

and acquiring the volume weight value of the first text according to the volume relative value.

7. The method according to claim 6, wherein, in the case where the first voice data is plural, the obtaining the volume weight value of the first text according to the volume relative value includes:

for each first word, respectively acquiring a volume relative value of a first volume value of the first word relative to a volume average value of each first voice data;

acquiring a second average value of the volume relative values according to the volume relative value of each piece of first voice data;

and acquiring the volume weight value of the first text according to the second average value.

8. The method of claim 2, wherein obtaining the initial sound velocity value and the initial sound volume value of the target text comprises:

Acquiring second voice data;

acquiring a second sound velocity value and a second sound volume value of at least one second word included in the second voice data, wherein the target word is a word in the second word;

determining an initial sound speed value of the target character from the second sound speed value of the at least one second character;

and determining the initial sound volume value of the target character from the second sound volume values of the at least one second character.

9. A voice data processing apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a question to be replied;

10. An electronic device comprising a processor and a memory, the memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the method of any one of claims 1 to 8.

11. A readable storage medium, characterized in that it has stored thereon a program or instructions which, when executed by a processor, implement the steps of the method according to any of claims 1 to 8.