CN111986662A

CN111986662A - Self-adaptive English voice generation method

Info

Publication number: CN111986662A
Application number: CN202010891349.4A
Authority: CN
Inventors: 崔炜
Original assignee: Shanghai Squirrel Classroom Artificial Intelligence Technology Co Ltd
Current assignee: Shanghai Squirrel Classroom Artificial Intelligence Technology Co Ltd
Priority date: 2020-08-30
Filing date: 2020-08-30
Publication date: 2020-11-24

Abstract

The invention discloses a method for generating self-adaptive English voice, which comprises the following steps: receiving a triggered English voice generation instruction, and acquiring a target voice signal based on the English voice generation instruction; carrying out signal analysis and processing on the collected target voice signal to obtain a corresponding signal to be reserved; aiming at the obtained signal to be reserved, performing defect recognition on the signal to be reserved by referring to a standard voice signal corresponding to English voice; inputting the voice data containing the target voice signal into a corresponding English voice output model according to the defect recognition result, and acquiring a voice output result to obtain a generated English voice; the purpose of carrying out speech input according to the defect result is realized, the accuracy and intelligence of English speech output are improved, and meanwhile, the output efficiency of English speech is also improved.

Description

Self-adaptive English voice generation method

Technical Field

The invention relates to the technical field of voice processing, in particular to a self-adaptive English voice generation method.

Background

Along with the continuous development and progress of artificial intelligence, intelligent voice service is also applied to daily work and life of people more and more, and simultaneously, for adapting to different application scenes and meeting different requirements, the requirement of using English as the artificial intelligence of an output language is also more and more.

At present, the english speech output mode used in the prior art is basically to directly input the speech signal into the corresponding speech output model and obtain the speech output result, thereby directly outputting the english speech. The processing mode does not carry out defect analysis and recognition on the input voice signal, so that the output English voice is not accurate enough.

Disclosure of Invention

The invention provides a self-adaptive English voice generation method, which aims to analyze an input voice signal and perform defect recognition, so that the input voice signal is input into a corresponding voice output model according to a recognition result, and the accuracy of English voice output is improved.

The invention provides a method for generating self-adaptive English voice, which comprises the following steps:

receiving a triggered English voice generation instruction, and acquiring a target voice signal based on the English voice generation instruction;

carrying out signal analysis and processing on the collected target voice signal to obtain a corresponding signal to be reserved;

aiming at the obtained signal to be reserved, performing defect recognition on the signal to be reserved by referring to a standard voice signal corresponding to English voice;

and inputting the voice data containing the target voice signal into a corresponding English voice output model according to the defect recognition result, and acquiring a voice output result to obtain the generated English voice.

Further, the signal analysis and processing of the acquired target voice signal to obtain a corresponding signal to be retained includes:

carrying out signal frame section splitting on the acquired target voice signal to obtain a split m-frame voice signal;

carrying out signal conversion on the m-frame voice signals obtained after splitting to obtain corresponding electric signals;

filtering the electric signal obtained after the signal conversion to obtain a corresponding signal to be extracted;

and extracting the characteristic information of the electric signal from the signal to be extracted, and filtering other miscellaneous information in the signal to be extracted to form the signal to be reserved.

Further, the performing defect identification on the to-be-reserved signal by referring to a standard voice signal corresponding to an english voice for the obtained to-be-reserved signal includes:

performing signal preprocessing on the signal to be reserved aiming at the obtained signal to be reserved, and extracting n characteristic parameters related to voice rhythm in the signal to be reserved based on a preprocessing result of the signal to be reserved;

calculating the score of each frame of signal in the signal to be reserved according to the n extracted characteristic parameters;

and identifying the defects of the signals to be reserved according to the calculated value of each frame of signals in the signals to be reserved.

Further, the n feature parameters related to the speech prosody include: tone, intonation, and temperament.

Further, the calculating a score of each frame signal in the signals to be retained according to the extracted n feature parameters includes:

calculating a first fractional value S1 corresponding to each frame signal in the signal to be preserved by using formula (1) according to the extracted n feature parameters, where:

in the formula (1), β_iThe actual characteristic value of the ith characteristic parameter of each frame of signal in the signal to be reserved is a preset value, and the value range of the actual characteristic value is [0, 1 ]]；χ_iThe weighted value of the ith characteristic parameter of each frame of signal to be reserved is a preset value, and the value range of the weighted value is [0, 1 ]]；β_i' the standard characteristic value of the ith characteristic parameter of each frame of signal in the signal to be reserved is a preset value, and the value range of the standard characteristic value is [0, 1 ]]。

Further, the performing defect identification on the signal to be retained according to the calculated score of each frame signal in the signal to be retained includes:

calculating a second fractional value of the target speech signal mapped by the signal to be reserved according to the calculated first fractional value of each frame of signal in the signal to be reserved;

judging whether the first score value and the second score value both meet a preset English score standard value;

if the first score value and the second score value both meet a preset English score standard value, identifying that the signal to be reserved has no defect;

and if the first score value and the second score value do not meet the preset English score standard value at the same time, identifying that the signal to be reserved has defects.

Further, the calculating a second fractional value of the target speech signal mapped by the signal to be preserved according to the calculated first fractional value of each frame of signal in the signal to be preserved includes:

finding the maximum value S of the first score values S1 according to the first score values S1_maxIf the second score value S2 of the target speech signal is calculated by using equation (2) for the m frames of signals included in the target speech signal, the following steps are performed:

in the formula (2), S_maxRepresents the corresponding maximum score value of all the first score values S1; λ represents the maximum fraction value S_maxThe occupation value of the corresponding signal to be reserved in the target speech signal, a3 represents the signal frame contained in the signal to be reserved.

Further, the inputting the voice data including the target voice signal into a corresponding english voice output model according to the defect recognition result includes:

when the signal to be reserved is identified to have no defects, inputting a target voice signal mapped by the signal to be reserved into the English voice output model;

and when the signal to be reserved is identified to have defects, inputting defect identification result information and the target voice signal into the English voice output model together according to a defect identification result.

Further, when the signal to be reserved is identified to have a defect, inputting the information of the defect identification result and the target speech signal into the english speech output model together according to the result of the defect identification, including:

calling a pre-stored defect database when the signal to be reserved is identified to have defects;

based on the defect database, performing defect identification on the target voice signal to obtain defect identification result information matched with the target voice signal;

and inputting the defect identification result information and the target voice signal into the English voice output model.

Further, the receiving a triggered english voice generation instruction, and acquiring a target voice signal based on the english voice generation instruction includes:

receiving a triggered English voice generation instruction, authenticating the triggered English voice generation instruction, and acquiring a corresponding target voice signal according to the English voice generation instruction when the authentication is passed;

the triggering mode of the English voice generating instruction comprises the following steps:

a manual trigger mode for triggering the corresponding English voice generation instruction by the user and a trigger mode for automatically triggering the system; when the system detects that the triggering condition of the English voice generation instruction is met, the system automatically triggers the English voice generation instruction.

The self-adaptive English voice generating method comprises the steps of receiving a triggered English voice generating instruction, and acquiring a target voice signal based on the English voice generating instruction; carrying out signal analysis and processing on the collected target voice signal to obtain a corresponding signal to be reserved; aiming at the obtained signal to be reserved, performing defect recognition on the signal to be reserved by referring to a standard voice signal corresponding to English voice; inputting the voice data containing the target voice signal into a corresponding English voice output model according to the defect recognition result, and acquiring a voice output result to obtain a generated English voice; the purpose of carrying out speech input according to the defect result is realized, the accuracy and intelligence of English speech output are improved, and meanwhile, the output efficiency of English speech is also improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described below by means of the accompanying drawings and examples.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

fig. 1 is a schematic workflow diagram of an embodiment of a method for generating an adaptive english language voice according to the present invention.

Fig. 2 is a schematic workflow diagram of an embodiment of processing a target speech signal to obtain a signal to be preserved in the method for generating an adaptive english speech according to the present invention.

Fig. 3 is a schematic workflow diagram of an embodiment of performing defect identification on a signal to be reserved in the method for generating an adaptive english language voice according to the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation.

The invention provides a self-adaptive English voice generation method, which is characterized in that target voice signals input into an English voice output model are analyzed and defect recognition is carried out, and then corresponding voice data are input into a corresponding voice output model according to recognition results, so that the accuracy of English voice output is improved.

As shown in fig. 1, fig. 1 is a schematic workflow diagram of an embodiment of a method for generating an adaptive english language speech according to the present invention; the method for generating the adaptive English language voice of the present invention can be implemented as steps S10-S40 as described below.

And step S10, receiving a triggered English voice generation instruction, and acquiring a target voice signal based on the English voice generation instruction.

In the embodiment of the invention, the system receives the triggered English voice generation instruction and carries out the acquisition operation of the target voice signal according to the English voice generation instruction.

The triggering mode of the English voice generating instruction comprises the following steps: the system comprises two modes of manual triggering by a user and automatic triggering by the system. For example, a manual triggering mode that a user triggers a corresponding english speech generating instruction; aiming at the triggering mode of automatic triggering of the system, when the system detects that the triggering condition of the English voice generating instruction is met, the system automatically triggers the English voice generating instruction.

Further, in an embodiment, when receiving an english speech generating instruction, authenticating the triggered english speech generating instruction, that is, determining whether the triggered english speech generating instruction is legal, and after the authentication is passed, performing the operation of acquiring the target speech signal and generating the english speech.

For example, in a specific application scenario, when an english speech generation instruction that enters a man-machine conversation mode and executes speech output is received for a state in which a speech robot is currently located to perform speech interpretation of a museum, it is determined that a priority of a currently executed museum interpretation event of the speech robot is higher, and therefore, if the authentication of the speech robot for the currently triggered english speech generation instruction is not passed, the corresponding target speech signal collection and english speech generation operations are not performed.

And step S20, carrying out signal analysis and processing on the collected target voice signal to obtain a corresponding signal to be reserved.

In the embodiment of the invention, the system performs signal analysis and signal processing on the acquired target voice signal, for example, performs signal processing operations such as signal filtering and invalid information removal, so as to obtain a signal to be retained after signal analysis and signal processing.

And step S30, aiming at the obtained signal to be reserved, referring to a standard voice signal corresponding to English voice, and carrying out defect identification on the signal to be reserved.

Aiming at the signal to be reserved obtained after signal analysis and processing, the system identifies the defect of the signal to be reserved by referring to a standard voice signal corresponding to English voice; for example, the signal characteristics in the signal to be reserved are compared with the standard english speech signals one by one, and whether defects such as deletion exist in the signal to be reserved is judged, so that the purpose of identifying the defects of the target speech signal mapped by the signal to be reserved is achieved.

And step S40, inputting the voice data containing the target voice signal into a corresponding English voice output model according to the defect recognition result, and acquiring a voice output result to obtain the generated English voice.

And according to a defect identification result of the signal to be reserved, for example, if the signal to be reserved is identified to have a defect, inputting the identification result information and the target voice signal into a corresponding English voice output model according to identification result information corresponding to the defect identification result. And if the signal to be reserved is identified to have no defects, directly inputting the voice data corresponding to the target voice signal mapped by the signal to be reserved into an English voice output model. And acquiring a corresponding voice output result according to the output of the English voice output model to obtain the generated English voice.

The English speech output model used in the embodiment of the invention can be the existing English speech output model, for example, English standard pronunciation, English standard expression indexes and the like are obtained to obtain a corresponding standard English speech sample database; and training to obtain a corresponding voice output model and the like by using a standard English voice sample database.

Based on the description of the embodiment shown in fig. 1, as shown in fig. 2, fig. 2 is a schematic workflow diagram of an embodiment of processing a target speech signal to obtain a signal to be preserved in the method for generating an adaptive english speech according to the present invention. In the embodiment shown in fig. 2, the step S20 of the embodiment shown in fig. 1, which is to perform signal analysis and processing on the acquired target speech signal to obtain a corresponding signal to be preserved, may be implemented as steps S21-S24 described below.

Step S21, splitting the acquired target voice signal into signal frame sections to obtain split m-frame voice signals;

step S22, performing signal conversion on the m frames of voice signals obtained after splitting to obtain corresponding electric signals;

step S23, filtering the electric signal obtained after signal conversion to obtain a corresponding signal to be extracted;

and step S24, extracting the characteristic information of the electric signal from the signal to be extracted, and filtering other miscellaneous information in the signal to be extracted to form the signal to be reserved.

In the embodiment of the invention, when the collected target voice signal is subjected to signal analysis and processing, the target voice signal is subjected to signal frame section splitting to obtain m frames of voice signals. Converting each frame of voice signal to obtain a corresponding converted electric signal; and filtering the electric signal obtained after conversion to obtain a corresponding signal A1 to be extracted. Extracting corresponding characteristic information from each frame of signals of the signals to be extracted, and filtering out impurity information A2 to form corresponding signals to be reserved A3; namely, the method comprises the following steps: a3 ═ a1-a 2.

In the embodiment of the invention, the split of the signal frame is carried out on the collected target voice signal to obtain m split voice signals; carrying out signal conversion on the voice signal obtained after splitting to obtain a corresponding electric signal; filtering the electric signal obtained after the signal conversion to obtain a corresponding signal to be extracted; extracting characteristic information of the electric signal from the signal to be extracted, and filtering other miscellaneous information in the signal to be extracted to form the signal to be reserved; the method provides important data for subsequent defect identification of the target voice signal, improves the accuracy of the signal defect identification, and simultaneously improves the efficiency of the signal defect identification.

Based on the description of the embodiment shown in fig. 1 and fig. 2, as shown in fig. 3, fig. 3 is a schematic workflow diagram of an embodiment of performing defect recognition on a signal to be retained in the method for generating an adaptive english language speech according to the present invention. In the embodiment shown in fig. 3, "step S30, referring to the standard speech signal corresponding to english speech for the obtained signal to be reserved, and performing defect recognition on the signal to be reserved" in the embodiment shown in fig. 1 may be implemented as steps S31-S33 described below.

Step S31, aiming at the obtained signal to be reserved, performing signal preprocessing on the signal to be reserved, and extracting n characteristic parameters related to voice prosody in the signal to be reserved based on the preprocessing result of the signal to be reserved;

step S32, calculating the score of each frame signal in the signals to be reserved according to the n extracted characteristic parameters;

and step S33, identifying the defects of the signals to be reserved according to the calculated score of each frame of signal in the signals to be reserved.

In the embodiment of the invention, aiming at a signal to be reserved, which is obtained from the target voice signal, signal preprocessing is carried out on the signal to be reserved, and n characteristic parameters related to voice rhythm in the signal to be reserved are extracted according to a preprocessing result; the characteristic parameters include, but are not limited to: the tone, intonation, and temperament associated with the english phonetic rhythm, etc. And calculating the score of each frame of signal in the signal to be reserved according to the n extracted characteristic parameters, and then identifying the defect of the signal to be reserved according to the calculated score.

Further, in an embodiment, in step S33 in the embodiment shown in fig. 3, the defect identification is performed on the signal to be retained according to the calculated score of each frame signal in the signal to be retained, which may be implemented according to the following technical means:

calculating a second fractional value of the target speech signal mapped by the signal to be reserved according to the calculated first fractional value of each frame of signal in the signal to be reserved; and judging whether the first score value and the second score value both meet a preset English score standard value.

If the first score value and the second score value both meet a preset English score standard value, identifying that the signal to be reserved has no defect; and if the first score value and the second score value do not meet the preset English score standard value at the same time, identifying that the signal to be reserved has defects.

Further, in an embodiment, the first fractional value of each frame of signal in the signal to be preserved, which is calculated according to the n extracted feature parameters, may be implemented by using formula (1).

In this embodiment of the present invention, according to the n extracted feature parameters, a formula (1) is used to calculate a first fractional value S1 corresponding to each frame of signal in the signal to be retained, where the first fractional value S1 includes:

Further, calculating a second fractional value of the target speech signal mapped to the signal to be preserved according to the calculated first fractional value S1 of each frame signal in the signal to be preserved, which may be implemented as follows:

finding the maximum value S of the first score values S1 according to the first score values S1_maxIf the second score values S2 of the target speech signal corresponding to all the first score values S1 are calculated using formula (2) for m frames of signals included in the target speech signal, the following steps are performed:

Further, based on the first score value S1 corresponding to the signal to be reserved and the second score value S2 corresponding to the target voice signal obtained through the above calculation, when both the first score value S1 and the second score value S2 satisfy a preset english score standard value, it is identified that the signal to be reserved has no defect; and if the first score value S1 and the second score value S2 do not simultaneously meet the preset English score standard value, identifying that the signal to be reserved has defects.

In step S40 of the embodiment shown in fig. 1, the system inputs the speech data including the target speech signal into the corresponding english speech output model according to the defect recognition result, and may be implemented as follows:

when the system identifies that the signal to be reserved has no defects, inputting a target voice signal mapped by the signal to be reserved into the English voice output model; and when the system identifies that the signal to be reserved has defects, inputting the information of the defect identification result and the target voice signal into the English voice output model together according to the defect identification result.

Further, when the system identifies that the signal to be reserved has a defect, the system inputs the information of the defect identification result and the target speech signal into the english speech output model according to the defect identification result, and the method can be implemented according to the following technical means:

when the system identifies that the signal to be reserved has defects, calling a defect database stored in advance; based on the defect database, performing defect identification on the target voice signal to obtain defect identification result information matched with the target voice signal; and inputting the defect identification result information and the target voice signal into the English voice output model.

In the embodiment of the invention, the target voice signal is collected, so that the collected target voice signal can be conveniently analyzed and processed, firstly, the target voice signal is subjected to frame splitting, and secondly, the split frame content is subjected to characteristic parameter extraction; according to the formula (1), calculating corresponding first score values, calculating second score values corresponding to all the first score values according to the formula (2), and finally determining whether the target voice signal is simply input to the voice output model or is input together with a defect recognition result on the basis of comparative analysis, so that the accuracy of target voice signal recognition can be improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for generating an adaptive english language speech, the method comprising:

2. The method for generating adaptive english speech according to claim 1, wherein the analyzing and processing the collected target speech signal to obtain a corresponding signal to be preserved comprises:

3. The method for generating adaptive english speech according to claim 1, wherein the step of performing defect recognition on the signal to be retained by referring to a standard speech signal corresponding to english speech for the obtained signal to be retained includes:

4. The method of generating adaptive english speech according to claim 3, wherein the n feature parameters related to the prosody of speech include: tone, intonation, and temperament.

5. The method for generating adaptive english speech according to claim 3, wherein said calculating the score of each frame of signal in the signal to be retained according to the n extracted feature parameters comprises:

6. The method for generating adaptive english speech according to claim 5, wherein the performing defect recognition on the signal to be preserved according to the calculated score of each frame signal in the signal to be preserved includes:

7. The method for generating adaptive english speech according to claim 6, wherein said calculating a second score value of the target speech signal mapped to the signal to be preserved according to the calculated first score value of each frame of the signal to be preserved comprises:

8. The method for generating adaptive english speech according to any one of claims 1 to 7, wherein said inputting the speech data containing the target speech signal into the corresponding english speech output model according to the defect recognition result comprises:

9. The method for generating adaptive english speech according to claim 8, wherein said inputting the information of the defect recognition result together with the target speech signal into the english speech output model according to the result of the defect recognition when the signal to be preserved is recognized to have defects, comprises:

10. The method for generating adaptive english speech according to any one of claims 1 to 7, wherein said receiving a triggered english speech generating command, and acquiring a target speech signal based on the english speech generating command comprises: