CN115132207A

CN115132207A - Voice message processing method and device and electronic equipment

Info

Publication number: CN115132207A
Application number: CN202210744413.5A
Authority: CN
Inventors: 雷夏飞
Original assignee: Vivo Mobile Communication Co Ltd
Current assignee: Vivo Mobile Communication Co Ltd
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-09-30

Abstract

The application discloses a voice message processing method and device and electronic equipment, and belongs to the technical field of computers. The method comprises the following steps: acquiring a voice message to be processed and a target correction degree corresponding to the voice message to be processed; determining a similarity threshold according to the target correction degree; determining the similarity degree between the voice message to be processed and a key voice message through a voice correction model, and performing personalized voice correction on the key voice message based on the similarity degree and the similarity degree threshold value to obtain a target voice message; the key voice message corresponds to the voice message to be processed; the voice correction model is obtained by utilizing sample voice training of a target object; the target object is a message recording object corresponding to the voice message to be processed; the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object.

Description

Voice message processing method and device and electronic equipment

Technical Field

The application belongs to the technical field of computers, and particularly relates to a voice message processing method and device and electronic equipment.

Background

Along with the continuous promotion of science and technology, the frequency that people used electronic equipment is also higher and higher, and people often can use the voice message function in some application programs when contacting, and voice message brings very big convenient simultaneously, and it still has vividly and has stronger user characteristics, and information loss is less in the dissemination etc. characteristics.

However, in the using process, when a user sends a plurality of voice messages with long time, if an error occurs in the middle of recording, new messages need to be repeatedly recorded, and the problems that the sending time of the voice messages is long, and the efficiency of obtaining effective information from the voice messages is low exist.

Disclosure of Invention

The embodiment of the application provides a voice message processing method and device and electronic equipment, and can solve the problems that in the prior art, when a voice message is sent, if an error occurs in the middle of recording, a new message needs to be repeatedly recorded, the sending time of the voice message is long, and the efficiency of obtaining effective information from the voice message is low.

In a first aspect, an embodiment of the present application provides a method for processing a voice message, where the method includes:

acquiring a voice message to be processed and a target correction degree corresponding to the voice message to be processed;

determining a similarity threshold according to the target correction degree;

determining the similarity degree between the voice message to be processed and a key voice message through a voice correction model, and performing personalized voice correction on the key voice message based on the similarity degree and the similarity degree threshold value to obtain a target voice message; the key voice message corresponds to the voice message to be processed;

the voice correction model is obtained by utilizing sample voice training of a target object; the target object is a message recording object corresponding to the voice message to be processed; the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object.

In a second aspect, an embodiment of the present application provides a voice message processing apparatus, where the apparatus includes:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring a voice message to be processed and a target correction degree corresponding to the voice message to be processed;

the determining module is used for determining a similarity threshold according to the target correction degree;

the correction module is used for determining the similarity degree between the voice message to be processed and the key voice message through a voice correction model, and performing personalized voice correction on the key voice message based on the similarity degree and the similarity degree threshold value to obtain a target voice message; the key voice message corresponds to the voice message to be processed;

In a third aspect, an embodiment of the present application provides an electronic device, which includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor, and when executed by the processor, the program or instructions implement the steps of the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a readable storage medium, on which a program or instructions are stored, which when executed by a processor implement the steps of the method according to the first aspect.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a processor and a communication interface, where the communication interface is coupled to the processor, and the processor is configured to execute a program or instructions to implement the method according to the first aspect.

In the embodiment of the application, the voice message to be processed and the target correction degree corresponding to the voice message to be processed are firstly obtained, then the similarity degree threshold value is determined according to the target correction degree, finally the similarity degree between the voice message to be processed and the key voice message is determined through a voice correction model obtained by utilizing sample voice training of a target object, and personalized voice correction is carried out on the key voice message based on the similarity degree and the similarity degree threshold value to obtain the target voice message. According to the method and the device, the similarity degree threshold value is determined according to the obtained target correction degree, personalized voice correction is performed on the key voice message by using the voice correction model, the target voice message with the voice characteristics of the voice message to be processed and the voiceprint characteristics of the target object can be obtained, repeated recording when the voice message is recorded by a user in error can be avoided, the efficiency of obtaining the voice message is improved, meanwhile, the finally obtained target voice message and the user can record the voice message in the same tone, intonation, speed and other information, the finally obtained voice message can record the words of the user more like the message, and the accuracy of voice message processing is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a voice message processing method according to an embodiment of the present application;

fig. 2 is a schematic diagram of acquiring a pending voice message according to an embodiment of the present application;

FIG. 3 is a schematic illustration of a selection of a target revision level provided by an embodiment of the present application;

FIG. 4 is a schematic illustration of another selected target correction level provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a structure of a speech modification model provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a voice network provided by an embodiment of the present application;

FIG. 7 is a simplified schematic diagram of inputs and outputs during speech modification model training provided by an embodiment of the present application;

fig. 8 is a schematic overall flowchart of a voice message processing method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a voice message processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

fig. 11 is a hardware structure diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described clearly below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present disclosure.

The terms first, second and the like in the description and in the claims of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or described herein. In addition, "and/or" in the specification and claims means at least one of connected objects, a character "/" generally means that a preceding and succeeding related objects are in an "or" relationship.

In some embodiments, when a user sends a voice message with a long duration, some errors may occur in the middle of recording, which results in that the user needs to repeatedly record to generate a voice message to be sent, which makes the time taken for the user to send the voice message longer, but the effective information in the voice message is less, which results in that the efficiency of acquiring information from the voice messages is lower. In order to solve the above problems, embodiments of the present application provide a voice message processing method, an apparatus, and an electronic device, where in a scenario where a user sends a voice message, the user may continue to record without re-recording when recording is wrong, and after recording is finished, at least one recorded voice message is converted into a target voice message by using the voice message processing method provided by the present application, where the target voice message has a voiceprint characteristic of the user who records the message, and the target voice message has a voice characteristic identical to that of the at least one recorded voice message, for example, the at least one recorded voice message is an open voice, the obtained target voice message is also an open voice, and finally the target voice message is displayed in a session interface, so that the user obtains a voice message with a concise content, a voiceprint characteristic of a message recording object, and a voice characteristic identical to that of the at least one voice message, the time spent by the user is reduced, and the efficiency of acquiring the voice information is improved.

A voice message processing method, a voice message processing apparatus, and an electronic device provided in the embodiments of the present application are described in detail below with reference to the accompanying drawings through specific embodiments and application scenarios thereof.

Fig. 1 is a schematic flow chart of a voice message processing method according to an embodiment of the present application. The voice message processing method may include the contents shown in S101 to S103.

In S101, a to-be-processed voice message and a target correction degree corresponding to the to-be-processed voice message are acquired.

The voice message to be processed is a voice message recorded by the message recording object, and the voice message to be processed comprises at least one voice message. If the voice message to be processed comprises a plurality of voice messages, the mode for acquiring the voice message to be processed can be selected one by one; the multiple voice messages may also be selected by sliding, and the multiple recorded voice messages may be selected by sliding, which is specifically determined according to practical applications without limitation in the embodiment of the present application. By the above method, the voice message to be processed can be acquired, and then the user can enter the voice message processing interface by sliding left, so as to convert the selected voice message into the target voice message, specifically as shown in fig. 2.

It should be noted that the target correction degree may be selected by the user, as shown in fig. 3, after entering the voice message processing interface, the user may manually select the target correction degree through a voice correction control on the interface, specifically, the voice correction control may be slid up and down, the upward sliding represents that the user needs less correction, the sliding to the top represents only needs error correction, and does not need to simplify the text; and sliding downwards to represent that the user needs to correct more, simplifying the text and outputting the summary. The system may automatically confirm according to the session scene, etc., at this time, the user may click an automatic button on the interface, as shown in fig. 4, and the system may automatically confirm the target correction degree according to the session scene, etc.

The value of the target correction degree can be any value between 0 and 1, 0 represents a large simplification degree, and only one summary is output, and 1 represents a summary without simplification and only needs to correct errors.

In S102, a similarity degree threshold is determined according to the target correction degree.

It is to be noted that the target correction degree may be set by ω ═ a × S _thd +b*S _type Determining, wherein omega is the target correction degree, a and b are normalization weight coefficients set according to needs, S _thd As a similarity degree threshold, S _type The value is 0 to 1 for the simplification degree of the text content, and corresponds to the value of omega.

For example, when ω is 0, S _type ＝0；ω＝(0,0.3]When, S _type ＝0.3；ω＝(0.3,0.6]When S is present _type ＝0.6；ω＝(0.6,0.9]When S is present _type ＝0.9；ω＝(0.9,1]When S is present _type ＝1。

And S _thd Can be represented by the formula

And (4) calculating.

Wherein, the larger the value of the similarity degree threshold value is, the closer the final target voice message is to the voice message reesented by the user who records the voice.

In S103, determining the similarity between the voice message to be processed and the key voice message through the voice correction model, and performing personalized voice correction on the key voice message based on the similarity and a similarity threshold value to obtain a target voice message; the key voice message corresponds to a pending voice message.

The voice correction model is obtained by utilizing sample voice training of a target object; the target object is a message recording object of the voice message to be processed; the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object.

The key voice message refers to a voice message obtained after the voice message to be processed is subjected to voice editing processing, and the voice message may correct error contents in the voice message to be processed, and when there are more voice messages (for example, the voice message has a longer duration and a larger number of voice messages), key contents in the voice message to be processed may be extracted, that is, a simplified voice message is obtained.

It should be noted that the sample voice of the target object refers to the local voice of the target user, i.e. the historical voice message recorded by the target object, wherein the local voice of the target user can be stored in the local database and can be directly called when in use. The speech characteristic may be at least one of tone, intonation, speed, volume.

The voice personalized correction refers to correcting the key voice message into a target voice message with voice characteristics of the voice message to be processed and voiceprint characteristics of a target object, so that the target voice message finally obtained by a user is a voice message similar to the tone, intonation, voiceprint and the like of the voice message to be processed, repeated recording of the user when the voice message is recorded mistakenly can be avoided, the efficiency of obtaining the voice message is improved, meanwhile, the tone, intonation and speed and the like of the finally obtained target voice message and the user when the voice message is recorded can be the same, and the finally obtained voice message is more like a message for recording the words of the user.

In the embodiment of the application, the voice message to be processed and the target correction degree corresponding to the voice message to be processed are firstly obtained, then the similarity degree threshold value is determined according to the target correction degree, finally the similarity degree between the voice message to be processed and the key voice message is determined through a voice correction model obtained by utilizing sample voice training of a target object, and personalized voice correction is carried out on the key voice message based on the similarity degree and the similarity degree threshold value to obtain the target voice message. According to the method and the device, the similarity degree threshold is determined according to the obtained target correction degree, personalized voice correction is performed on the key voice message by using the voice correction model, the target voice message with the voice characteristics of the voice message to be processed and the voiceprint characteristics of the target object can be obtained, repeated recording of the voice message by a user when the voice message is recorded mistakenly can be avoided, the efficiency of obtaining the voice message is improved, meanwhile, the finally obtained target voice message and the user can be enabled to have the same information such as tone, intonation and speed when the voice message is recorded by the user, the finally obtained voice message is enabled to be more like the message self-recorded words, and the accuracy of voice message processing is improved.

In one possible embodiment of the present application, the step of acquiring the key voice message includes: converting the voice message to be processed into a message text through a voice conversion model; extracting key contents from the message text through a text extraction model to obtain a key text; the key text is converted into a key voice message through a text conversion model.

In the embodiment of the application, the voice conversion model is a model for converting a voice message into a character, and the voice message to be processed can be converted into a message text, so that the key content in the message text can be extracted more quickly in the following process. The text extraction model is used for simplifying the voice message to be processed with longer voice message duration or more voice messages and extracting the key content in the voice message to be processed, so that the time spent by a voice message receiving party on listening to the voice message to be processed is reduced, and the efficiency of a user for acquiring the content in the voice message to be processed is improved. The text conversion model is a model for converting characters into voice messages, and can convert the extracted key texts into key voice messages, so that a voice message receiving party can conveniently obtain effective contents and emotion information of a message recording object.

The text extraction model may be a BERTSUM model or other models, and only the key content in the message text may be extracted, which is not specifically limited in the embodiment of the present application.

Optionally, before extracting key content from the message text through the text extraction model to obtain the key text, the voice message processing method may further include: and determining the content simplification degree according to the target correction degree.

According to the embodiment, the content reduction degree corresponds to the target correction degree, and after the target correction degree is determined, the content reduction degree can be determined.

Wherein, the content simplification degree can be 0, 0.3, 0.6, 0.9 and 1; the content reduction degree can be applied to the text extraction model, when the content reduction degree is 1, the text extraction model is not used, the output of the text extraction model is equal to the input, and at the moment, only the text content needs to be corrected, the reduction is not needed, and the content reduction degree is consistent with the original voice message as much as possible; and when the content reduction degree is 0, reducing the text to the maximum degree by using the text extraction model to obtain the key text. The reduction degree of the target voice message required by the user can be determined through the reduction degree of the content.

Correspondingly, extracting the key content from the message text through the text extraction model, and obtaining the key text may include: and extracting key contents from the message text through a text extraction model matched with the content simplification degree to obtain the key text.

That is to say, each content reduction degree has a text extraction model matched with the content reduction degree, and according to the text extraction model, key content required by a user can be extracted from a message text to obtain a key text.

Optionally, the method for processing a voice message may further include, by extracting key content from the message text through a text extraction model to obtain key text space, and: and carrying out error correction processing on the message text through a text error correction model.

Because the voice message to be processed may have wrong pronunciation or mouth error, the text error correction model can be used to correct the error of the converted text. The text error correction model can adopt the existing model structure, such as the traditional natural voice error correction model, such as a Chinese language model (N-Gram), and the model structure does not need to be reconstructed, so that the resources can be saved.

Correspondingly, extracting the key content from the message text through the text extraction model to obtain the key text, which may include: and extracting key contents from the message text after error correction processing through a text extraction model to obtain a key text.

In the embodiment of the application, after the text information is subjected to error correction processing, the text extraction model is utilized to extract the key content from the corrected message text, so that the obtained key text is more accurate, and the meaning which is originally intended to be expressed by the information sending object is more accurately expressed.

The voice conversion model and the text conversion model in the above embodiments may be converted according to the text-to-voice mapping relationship corresponding to the target object, so that the converted text and voice have individual characteristics of the information recording object, the problems of accent and individuation expression when voice is converted into text are solved, and the text-to-voice conversion is more common and vivid, as described in the following embodiments.

In one possible embodiment of the present application, converting a pending voice message into a message text through a voice conversion model may include: and converting the voice message to be processed into a message text based on the text voice mapping relation corresponding to the target object through the voice conversion model.

And the text voice mapping relation is obtained by carrying out fine tuning training on the initial voice correction model by utilizing the sample voice of the target object. The text-to-speech mapping relationship comprises a mapping relationship between words and at least one speech segment, wherein the at least one speech segment has a voiceprint characteristic of the target object, and the speech characteristic of each speech segment is different.

It should be noted that the initial speech modification model is obtained by training a speech modification model to be trained by using a universal sample speech, where the universal sample speech may be a public speech obtained from a network or a database, and the speech modification model to be trained may use a model structure of an existing model, such as a joint task learning training model, and does not need to reconstruct the model structure, which may save resources.

In the embodiment of the application, the initial voice correction model is subjected to fine tuning training by using the sample voice of the target object, so that a voice correction model with the characteristics of the target object can be obtained, and the text voice mapping relation obtained according to the voice correction model with the characteristics of the target object also has the characteristics of the target object. The voice message to be processed can be rapidly cut through the text voice mapping relation corresponding to the target object, a plurality of voice candidate segments are trained, the weight of the character segment corresponding to the voice candidate segments is determined, and then corresponding characters are obtained.

In one possible embodiment of the present application, converting the key text into the key voice message through the text conversion model may include: and converting the key text into a key voice message through a text conversion model based on the text voice mapping relation corresponding to the target object.

And the text voice mapping relation is obtained by carrying out fine tuning training on the initial voice correction model by utilizing the sample voice of the target object. The text-to-speech mapping relation comprises a mapping relation between characters and at least one speech segment, wherein the at least one speech segment has the voiceprint characteristics of the target object, and the speech characteristics of each speech segment are different.

The training of the initial speech modification model is the same as that in the above embodiment, and the process of the fine tuning training is also the same, which is not described again in this embodiment.

In the embodiment of the application, the initial voice correction model is subjected to fine tuning training by using the sample voice of the target object, so that a voice correction model with more target object characteristics can be obtained, and the text voice mapping relation obtained according to the voice correction model with the target object characteristics also has more target object characteristics. And the edited voice can be quickly and accurately generated through the text voice mapping relation corresponding to the target object.

The text-to-speech mapping relation in the above embodiment is obtained in the process of performing fine tuning training on the speech modification model by using the sample speech of the target object, and refers to a mapping relation between one character and a plurality of speech frequency bands, and a text-to-speech mapping relation table x _u,i,j →y _u,i,j This can be shown as follows:

wherein I is the dimension of speech, J is the dimension of text, ω is _i,j And screening the weight of the text j corresponding to the voice fragment i of the user u.

In one possible embodiment of the present application, the obtaining step of the text-to-speech mapping relationship may include: the method comprises the steps of obtaining a plurality of voice segments from sample voice of a target object, obtaining characters corresponding to each voice segment through a text network of a voice correction model, and determining a text-to-voice mapping relation according to each voice segment and the characters corresponding to each voice segment.

That is to say, a plurality of voice segments in the sample voice of the target object can be obtained and input into the text network of the voice correction model to obtain characters corresponding to each voice segment, and according to each voice segment and the characters corresponding to each voice segment, the mapping relationship between the characters and the plurality of voice segments, that is, the text-to-voice mapping relationship, can be determined.

In one possible embodiment of the present application, determining the similarity between the voice message to be processed and the key voice message through the voice modification model may include: acquiring the voice characteristics of the key voice message and the voice characteristics of the voice message to be processed through a voice network of the voice correction model; acquiring text voice combination characteristics of the key voice message based on voice characteristics of the key voice message through a text network of the voice correction model, and acquiring text voice combination characteristics of the voice message to be processed based on voice characteristics of the voice message to be processed; and determining the similarity degree between the key voice message and the voice message to be processed based on the text voice combination characteristic of the key voice message and the text voice combination characteristic of the voice message to be processed through the similarity evaluation network of the voice correction model.

The voice correction model comprises three networks, namely a voice network, a text network and a similarity evaluation network. The voice network is used for dividing an input voice message into a plurality of voice segment hidden vectors, outputting a voice feature vector, wherein the voice feature vector can express voice characteristics of the input voice message, such as tone, intonation, speed, volume and the like, that is, the voice network can be used for reconstructing the input voice message in terms of tone, intonation, speed, volume and the like to obtain a weight of a plurality of voice segments of the voice message on various voice characteristics, and then reconstructing the voice message according to the weight. Inputting the reconstructed voice message into a text network, converting the reconstructed voice message into text voice combination characteristics including character characteristics and voice characteristics through the text network, namely, a text obtained through the text network has both character level characteristics and voice characteristics, and then inputting the text voice combination characteristics obtained by the two input voice messages through the voice network and the text network into a similarity evaluation network to determine the similarity degree of the two input voice messages.

According to the method and the device, the key voice message and the voice message to be processed are processed respectively through the three networks of the voice correction model, and then the similarity degree is determined, wherein the voice network can more accurately determine voice characteristics such as tone, intonation, speed, volume and the like of the input voice message, so that after the voice message passes through a text network, the converted text can have characteristics of a character level and characteristics of voice, the converted text is more accurate, and when the similarity evaluation network evaluates, an evaluation result is more accurate.

As shown in fig. 5, which is a schematic structural diagram of the speech modification model, as can be seen from fig. 5, the speech modification model includes three parts, namely, a speech network, a text network and a similarity evaluation network. The voice message is input into the voice modification model, and the voice network passing through the voice modification model is divided into a plurality of voice segment hidden vectors, as shown in fig. 6, the voice feature vector can express voice characteristics of the input voice message, such as tone, intonation, speed, volume, etc., that is, the input voice message can be reconstructed in terms of tone, intonation, speed, volume, etc. through the voice network, a weight of a plurality of voice segments of the voice message on various voice characteristics is obtained, and then the voice message is reconstructed according to the weight. The method is different from the method for directly converting the voice message into the text in the prior art, and the converted text can have the character-level characteristics and the voice-aspect characteristics after the voice message passes through the text network, so that the converted text is more accurate. And then inputting the reconstructed voice message into a text network, and converting the reconstructed voice message into text voice combination characteristics including character characteristics and voice characteristics through the text network, namely, a text obtained through the text network has both character-level characteristics and voice characteristics, which is different from the prior art that only a text message is converted into a voice message, in the application, a voice message with original voice characteristics is converted. And finally, inputting the text voice combination characteristics obtained by the two input voice messages through the voice network and the text network into the similarity evaluation network to determine the similarity of the two input voice messages.

In a possible embodiment of the present application, performing personalized voice modification on the key voice message based on the similarity degree and the similarity degree threshold to obtain the target voice message may include: and under the condition that the similarity degree is smaller than the similarity degree threshold value, transmitting the similarity degree to at least one of the voice conversion model and the text conversion model so that the at least one of the voice conversion model and the text conversion model adjusts the respective output result until the similarity degree between the key voice message and the voice message to be processed is larger than or equal to the similarity degree threshold value, and obtaining the target voice message.

That is, in the embodiment of the present application, a PID (proportional-integral-derivative) control principle is used, in which a control deviation is formed according to a given value and an actual output value, and the deviation is linearly combined according to proportion, integral and derivative to form a control quantity, so as to control a controlled object. In the application, the given value refers to a similarity threshold, the actual output value refers to a similarity, the deviation refers to a difference value between the similarity and the similarity threshold, the controlled object refers to a voice correction model and/or a text conversion model, and the character and voice weight distribution of voice conversion is corrected through the difference value between the similarity and the similarity threshold, so that the text converted by the voice correction model is more accurate, and the voice converted by the text conversion model is more similar to the voice characteristics of the original voice.

The PID control principle in the embodiment of the application is mainly used for actively feeding back the model to the voice conversion model and the text conversion model in real time according to the difference between the similarity degree and the similarity degree threshold value in the using process, and the voice conversion model and the text conversion model modify the character and voice weight distribution of voice conversion according to the difference, so that the text converted by the voice modification model is more accurate, and the voice characteristics of the voice converted by the text conversion model are more similar to those of the original voice.

According to the text-to-speech mapping relation, the screening weights of characters corresponding to one speech fragment are different, and the condition that the weights are wrongly distributed exists when the speech is converted into the text or the text is converted into the speech, so that an error exists in outputting and inputting the speech, when the similarity degree is calculated subsequently and does not meet the similarity degree threshold value, the error value is fed back to at least one of the speech conversion module and the text conversion model, and the final similarity degree is adjusted by adjusting the output results of the speech conversion module and/or the text conversion model so as to obtain the target speech message.

In one possible embodiment of the present application, the obtaining of the voice message to be processed and the target correction degree corresponding to the voice message to be processed may include: responding to a first input of a user to a voice message in a conversation interface, and determining a voice message to be processed; and responding to a second input of the user to the voice correction control in the conversation interface, and determining a target correction degree corresponding to the voice message to be processed.

The first input may be a bar-by-bar selection input or a sliding selection input, which is not specifically limited in this embodiment and is determined according to an actual application. The second input may be a vertical sliding input or a click input, and is not specifically limited in this embodiment and is determined according to actual applications.

Specifically, by sliding to select a plurality of voice messages, a plurality of recorded voice messages can be selected by sliding, as shown in fig. 2. After entering a voice message processing interface, a user can manually select a target correction degree through a voice correction control on the interface, specifically, the voice correction control can be slid up and down, the upward sliding represents that the user needs less correction, the sliding to the top represents that only correction is needed, and the text does not need to be simplified; and sliding downwards to represent that the user needs to correct more, simplifying the text and outputting the summary. The system may automatically confirm according to the session scene, etc., at this time, the user may click an automatic button on the interface, as shown in fig. 4, and the system may automatically confirm the target correction degree according to the session scene, etc.

In one possible embodiment of the present application, the training step of the speech modification model may include: pre-training a voice correction model to be trained by using a universal sample voice; and carrying out fine tuning training on the pre-trained voice correction model by using the sample voice of the target object until the training is finished, and obtaining the voice correction model.

The general sample voice refers to public voice, and is not voice of a specific person or a specific group, and the general sample voice may be obtained from a network or a voice database. The sample voice of the target object refers to the voice of the target object, and can be a voice message recorded by the target object before the voice to be processed or a voice message of the target object in a local voice database of the device authorized by the user. The voice correction model to be trained can utilize the model structure of the existing model, such as a joint task learning training model, and the model structure does not need to be reconstructed, so that resources can be saved.

In the embodiment of the application, the voice modification model is obtained by pre-training the voice modification model to be trained by using a universal sample voice, namely a public voice, and then performing fine tuning training on the universal model by using the sample voice of the target object, so that the voice modification model with the characteristics of the target object can be obtained, the similarity between the voice message to be processed and the key voice message is determined by using the model, and when the voice modification is performed on the key voice message, the modified voice message has the characteristics of the target object and is more similar to the re-expression of the target object.

Optionally, pre-training the speech modification model to be trained by using a common sample speech may include: acquiring voice characteristics of at least two sample voice messages in universal sample voice through a voice network of a voice correction model to be trained, wherein any two sample voice messages have a preset similarity degree; acquiring text voice combination characteristics of each sample voice message based on voice characteristics of at least two sample voice messages through a text network of a voice correction model to be trained; determining the similarity degree between any two sample voice messages based on the text voice combination characteristics of any two sample voice messages through the similarity evaluation network of the voice correction model to be trained; and training the voice network and the text network based on the difference between the similarity between any two sample voice messages and the preset similarity between any two sample voice messages until the similarity between any two sample voice messages is greater than or equal to the preset similarity, and obtaining a voice correction model after pre-training.

In the embodiment of the application, any two sample voices with preset similarity degrees pass through the voice network and the text network to respectively obtain respective text voice combination characteristics, then, any two text voice combination characteristics are input into the similarity evaluation network to determine the similarity degrees of the two sample voices, if the determined similarity degrees of any two sample voice messages are greater than or equal to the preset similarity degrees of any two sample voice messages, the voice correction model is trained, otherwise, the voice network and the text network are trained continuously until the conditions are met. Through the training process, a universal voice correction model can be obtained through training, so that the simplified voice message corrected by the voice correction model is closer to the original voice message.

After the voice correction model to be trained is pre-trained to obtain a universal voice correction model, the universal voice correction model can be subjected to fine tuning training by using the sample voice of the target object to obtain the voice correction model according with the voice characteristics of the target object, and specifically, the fine tuning training process can include: acquiring voice characteristics of at least two sample voice messages in the sample voice of the target object through a voice network of the trained voice correction model, wherein any two sample voice messages have a preset similarity degree; acquiring text voice combination characteristics of each sample voice message based on voice characteristics of at least two sample voice messages through a text network of a trained voice correction model; determining the similarity degree between any two sample voice messages based on the text voice combination characteristics of any two sample voice messages through the similarity evaluation network of the trained voice correction model; and training the voice network and the text network based on the difference between the similarity between any two sample voice messages and the preset similarity between any two sample voice messages until the similarity between any two sample voice messages is greater than or equal to the preset similarity, and obtaining a trained voice correction model.

Specific descriptions of the trained voice network, the trained text network, and the trained similarity evaluation network of the voice correction model are described in detail in the above embodiments, and are not described in detail in this embodiment.

In the embodiment of the application, any two sample voices with preset similarity degrees of a target object pass through the voice network and the text network to respectively obtain respective text voice combination characteristics, then any two text voice combination characteristics are input into the similarity evaluation network to determine the similarity degrees of the two sample voices, if the determined similarity degrees of any two sample voice messages are greater than or equal to the preset similarity degrees, the voice correction model is trained, otherwise, the voice network and the text network are trained continuously until the conditions are met. Through the training process, the voice correction model with the characteristics of the target object can be obtained through training, so that the target voice message corrected by the voice correction model is closer to the voice message to be processed.

Fig. 7 is a schematic diagram showing a simple structure of input and output during speech modification model training. The general sample voice and the sample voice of the target object are input in the figure, and the similarity between two voice messages, the hidden vector of the voice segment, the text voice mapping relation, the similarity threshold value, and the like can be obtained in the model training process.

Fig. 8 is a schematic overall flow chart of the voice message processing method according to the present application. Specifically, when the voice message to be processed is obtained, the key voice message is obtained through the voice conversion model, the text error correction model, the text extraction model and the text conversion model, and then the voice message to be processed and the key voice message are input into the voice correction model, so that the target voice message can be obtained. In the process, the voice correction model can feed back the difference value between the similarity degree and the similarity degree threshold value to the voice conversion model and the text conversion model in real time according to the similarity degree of the voice message to be processed and the key voice message, so that the voice conversion model and the text conversion model correct the character and voice weight distribution of voice conversion according to the difference value, the text converted by the voice correction model is more accurate, and the voice converted by the text conversion model is more similar to the voice characteristics of the original voice. Specifically, details of the embodiments have been described above, and details of the embodiments are not repeated.

It should be noted that, in the voice message processing method provided in the embodiment of the present application, the execution main body may be a voice message processing apparatus, or a control module in the voice message processing apparatus, which is used for executing the voice message processing method. The embodiment of the present application takes a voice message processing apparatus executing a voice message processing method as an example, and describes a voice message processing apparatus provided in the embodiment of the present application.

Fig. 9 is a schematic diagram of a voice message processing apparatus according to an embodiment of the present application. The voice message processing apparatus may include: an acquisition module 901, a determination module 902 and a correction module 903.

The acquiring module 901 is configured to acquire a voice message to be processed and a target correction degree corresponding to the voice message to be processed; a determining module 902, configured to determine a similarity threshold according to the target correction degree; a correction module 903, configured to determine, through a voice correction model, a degree of similarity between the voice message to be processed and the key voice message, and perform personalized voice correction on the key voice message based on the degree of similarity and a similarity threshold value to obtain a target voice message; the key voice message corresponds to the voice message to be processed; the voice correction model is obtained by utilizing sample voice training of a target object; the target object is a message recording object corresponding to the voice message to be processed; the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object.

In this embodiment of the application, the obtaining module 901 obtains a to-be-processed voice message and a target correction degree corresponding to the to-be-processed voice message, the determining module 902 then determines a similarity degree threshold according to the target correction degree, and finally the correcting module 903 determines a similarity degree between the to-be-processed voice message and a key voice message through a voice correction model obtained by training a sample voice of a target object, and performs personalized voice correction on the key voice message based on the similarity degree and the similarity degree threshold to obtain the target voice message. According to the method and the device, the similarity degree threshold is determined according to the obtained target correction degree, personalized voice correction is performed on the key voice message by using the voice correction model, the target voice message with the voice characteristics of the voice message to be processed and the voiceprint characteristics of the target object can be obtained, repeated recording of the voice message by a user when the voice message is recorded mistakenly can be avoided, the efficiency of obtaining the voice message is improved, meanwhile, the finally obtained target voice message and the user can be enabled to have the same information such as tone, intonation and speed when the voice message is recorded by the user, the finally obtained voice message is enabled to be more like the message self-recorded words, and the accuracy of voice message processing is improved.

Optionally, the determining module 902 may be configured to: converting the voice message to be processed into a message text through a voice conversion model; extracting key contents from the message text through a text extraction model to obtain a key text; the key text is converted into a key voice message through a text conversion model.

Optionally, the determining module 902 may be configured to: determining the content simplification degree according to the target correction degree; and extracting key contents from the message text through a text extraction model matched with the content simplification degree to obtain the key text.

Optionally, the modification module 903 may be configured to: acquiring the voice characteristics of the key voice message and the voice characteristics of the voice message to be processed through a voice network of the voice correction model; acquiring text voice combination characteristics of the key voice message based on voice characteristics of the key voice message through a text network of a voice correction model, and acquiring text voice combination characteristics of the voice message to be processed based on voice characteristics of the voice message to be processed; and determining the similarity degree between the key voice message and the voice message to be processed based on the text voice combination characteristic of the key voice message and the text voice combination characteristic of the voice message to be processed through the similarity evaluation network of the voice correction model.

Optionally, the modification module 903 may be configured to: and under the condition that the similarity degree is smaller than the similarity degree threshold value, transmitting the similarity degree to at least one of the voice conversion model and the text conversion model so that the at least one of the voice conversion model and the text conversion model adjusts the respective output result until the similarity degree between the key voice message and the voice message to be processed is larger than or equal to the similarity degree threshold value, and obtaining the target voice message.

Optionally, the obtaining module 901 may be configured to: responding to a first input of a user to a voice message in a conversation interface, and determining a voice message to be processed; and responding to a second input of the user to the voice correction control in the conversation interface, and determining a target correction degree corresponding to the voice message to be processed.

Optionally, the modification module 903 may be configured to: pre-training a voice correction model to be trained by using a universal sample voice; and carrying out fine tuning training on the pre-trained voice correction model by using the sample voice of the target object until the training is finished, and obtaining the voice correction model.

The voice message processing apparatus in the embodiment of the present application may be an apparatus, or may be a component, an integrated circuit, or a chip in a terminal. The device can be mobile electronic equipment or non-mobile electronic equipment. By way of example, the mobile electronic device may be a mobile phone, a tablet computer, a notebook computer, a palm top computer, a vehicle-mounted electronic device, a wearable device, an ultra-mobile personal computer (UMPC), a netbook or a Personal Digital Assistant (PDA), and the like, and the non-mobile electronic device may be a server, a Network Attached Storage (NAS), a Personal Computer (PC), a Television (TV), a teller machine or a self-service machine, and the like, and the embodiments of the present application are not particularly limited.

The voice message processing device in the embodiment of the present application may be a device having an operating system. The operating system may be an Android (Android) operating system, an ios operating system, or other possible operating systems, and embodiments of the present application are not limited specifically.

The voice message processing apparatus provided in the embodiment of the present application can implement each process implemented by the method embodiments shown in fig. 1 to 8, and is not described herein again to avoid repetition.

Optionally, as shown in fig. 10, an electronic device 1000 is further provided in this embodiment of the present application, and includes a processor 1001, a memory 1002, and a program or an instruction stored in the memory 1002 and executable on the processor 1001, where the program or the instruction is executed by the processor 1001 to implement each process of the foregoing voice message processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

It should be noted that the electronic device in the embodiment of the present application includes the mobile electronic device and the non-mobile electronic device described above.

Fig. 11 is a schematic diagram of a hardware structure of an electronic device implementing an embodiment of the present application.

The electronic device 1100 includes, but is not limited to: radio frequency unit 1101, network module 1102, audio output unit 1103, input unit 1104, sensor 1105, display unit 1106, user input unit 1107, interface unit 1108, memory 1109, and processor 1110.

Those skilled in the art will appreciate that the electronic device 1100 may further include a power source (e.g., a battery) for supplying power to the various components, and the power source may be logically connected to the processor 1110 via a power management system, so as to manage charging, discharging, and power consumption management functions via the power management system. The electronic device structure shown in fig. 11 does not constitute a limitation to the electronic device, and the electronic device may include more or less components than those shown in the drawings, or combine some components, or arrange different components, and thus, the description is omitted here.

The processor 1110 is configured to obtain a to-be-processed voice message and a target correction degree corresponding to the to-be-processed voice message; determining a similarity threshold according to the target correction degree; determining the similarity degree between the voice message to be processed and the key voice message through a voice correction model, and performing personalized voice correction on the key voice message based on the similarity degree and a similarity degree threshold value to obtain a target voice message; the key voice message corresponds to the voice message to be processed; the voice correction model is obtained by utilizing sample voice training of a target object; the target object is a message recording object corresponding to the voice message to be processed; the target voice message has voice characteristics of the voice message to be processed and voiceprint characteristics of the target object.

In the embodiment of the application, the voice message to be processed and the target correction degree corresponding to the voice message to be processed are obtained firstly, then the similarity degree threshold value is determined according to the target correction degree, finally the similarity degree between the voice message to be processed and the key voice message is determined through the voice correction model obtained through sample voice training of the target object, and personalized voice correction is carried out on the key voice message based on the similarity degree and the similarity degree threshold value to obtain the target voice message. According to the method and the device, the similarity degree threshold value is determined according to the obtained target correction degree, personalized voice correction is performed on the key voice message by using the voice correction model, the target voice message with the voice characteristics of the voice message to be processed and the voiceprint characteristics of the target object can be obtained, repeated recording when the voice message is recorded by a user in error can be avoided, the efficiency of obtaining the voice message is improved, meanwhile, the finally obtained target voice message and the user can record the voice message in the same tone, intonation, speed and other information, the finally obtained voice message can record the words of the user more like the message, and the accuracy of voice message processing is improved.

It should be understood that in the embodiment of the present application, the input Unit 1104 may include a Graphics Processing Unit (GPU) 11041 and a microphone 11042, and the Graphics processor 11041 processes image data of still pictures or video obtained by an image capturing device (such as a camera) in a video capturing mode or an image capturing mode. The display unit 1106 may include a display panel 11061, and the display panel 11061 may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like. The user input unit 1107 includes a touch panel 11071 and other input devices 11072. A touch panel 11071, also called a touch screen. The touch panel 11071 may include two portions of a touch detection device and a touch controller. Other input devices 11072 may include, but are not limited to, a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail herein. The memory 1109 may be used to store software programs and various data including, but not limited to, application programs and an operating system. Processor 1110 may integrate an application processor that handles primarily operating systems, user interfaces, applications, etc. and a modem processor that handles primarily wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1110.

The embodiment of the present application further provides a readable storage medium, where a program or an instruction is stored on the readable storage medium, and when the program or the instruction is executed by a processor, the program or the instruction implements each process of the foregoing voice message processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

The processor is the processor in the electronic device described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and so on.

The embodiment of the present application further provides a chip, where the chip includes a processor and a communication interface, the communication interface is coupled to the processor, and the processor is configured to run a program or an instruction to implement each process of the foregoing voice message processing method embodiment, and can achieve the same technical effect, and in order to avoid repetition, the details are not repeated here.

It should be understood that the chips mentioned in the embodiments of the present application may also be referred to as system-on-chip, system-on-chip or system-on-chip, etc.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Further, it should be noted that the scope of the methods and apparatus of the embodiments of the present application is not limited to performing the functions in the order illustrated or discussed, but may include performing the functions in a substantially simultaneous manner or in a reverse order based on the functions involved, e.g., the methods described may be performed in an order different than that described, and various steps may be added, omitted, or combined. In addition, features described with reference to certain examples may be combined in other examples.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a computer software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal (which may be a mobile phone, a computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present application.

While the present embodiments have been described with reference to the accompanying drawings, it is to be understood that the invention is not limited to the precise embodiments described above, which are meant to be illustrative and not restrictive, and that various changes may be made therein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for processing a voice message, comprising:

determining a similarity threshold according to the target correction degree;

2. The method of claim 1, wherein the step of obtaining the key voice message comprises:

converting the voice message to be processed into a message text through a voice conversion model;

extracting key contents from the message text through a text extraction model to obtain a key text;

and converting the key text into the key voice message through a text conversion model.

3. The method of claim 2, wherein before extracting key content from the message text by the text extraction model to obtain the key text, the method further comprises:

determining the content simplification degree according to the target correction degree;

extracting key contents from the message text through a text extraction model to obtain a key text comprises the following steps:

and extracting key contents from the message text through a text extraction model matched with the content simplification degree to obtain the key text.

4. The method of claim 2, wherein the determining the similarity between the voice message to be processed and the key voice message through the voice modification model comprises:

acquiring the voice characteristics of the key voice message and the voice characteristics of the voice message to be processed through a voice network of a voice correction model;

acquiring text voice combination characteristics of the key voice message based on voice characteristics of the key voice message through a text network of a voice correction model, and acquiring the text voice combination characteristics of the voice message to be processed based on the voice characteristics of the voice message to be processed;

and determining the similarity degree between the key voice message and the voice message to be processed based on the text voice combination characteristic of the key voice message and the text voice combination characteristic of the voice message to be processed through a similarity evaluation network of a voice correction model.

5. The method of claim 2, wherein the performing personalized voice modification on the key voice message based on the similarity degree and the similarity degree threshold to obtain a target voice message comprises:

and if the similarity degree is smaller than the similarity degree threshold value, transmitting the similarity degree to at least one of the voice conversion model and the text conversion model so that the at least one of the voice conversion model and the text conversion model adjusts the respective output result until the similarity degree between the key voice message and the voice message to be processed is larger than or equal to the similarity degree threshold value, and obtaining the target voice message.

6. The method of claim 1, wherein obtaining the voice message to be processed and the target correction degree corresponding to the voice message to be processed comprises:

responding to a first input of a user to a voice message in a conversation interface, and determining the voice message to be processed;

and responding to a second input of the voice correction control in the conversation interface by the user, and determining a target correction degree corresponding to the voice message to be processed.

7. The method of claim 1, wherein the step of training the speech modification model comprises:

pre-training a voice correction model to be trained by using a universal sample voice;

and carrying out fine tuning training on the pre-trained voice correction model by using the sample voice of the target object until the training is finished, and obtaining the voice correction model.

8. A voice message processing apparatus, comprising:

9. An electronic device comprising a processor, a memory, and a program or instructions stored on the memory and executable on the processor, the program or instructions when executed by the processor implementing the steps of the voice message processing method according to any one of claims 1-7.

10. A readable storage medium, characterized in that it stores thereon a program or instructions which, when executed by a processor, implement the steps of the voice message processing method according to any one of claims 1-7.