CN112382267A

CN112382267A - Method, apparatus, device and storage medium for converting accents

Info

Publication number: CN112382267A
Application number: CN202011270379.XA
Authority: CN
Inventors: 汤本来; 李忠豪
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2020-11-13
Filing date: 2020-11-13
Publication date: 2021-02-19

Abstract

The application discloses a method, a device, equipment and a storage medium for converting accents, and relates to the fields of speech synthesis, natural language processing, computer technology, artificial intelligence and deep learning. The specific implementation scheme is as follows: acquiring a target speaking text; determining a first acoustic feature indicative of a target accent based on the target spoken text; determining linguistic features in the first acoustic features; and determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency. The implementation mode can realize that any text input by the user can be accurately and quickly converted into the audio frequency with the target accent of the user by carrying out voice synthesis, recognition and conversion on the acquired speaking text data of the user.

Description

Method, apparatus, device and storage medium for converting accents

Technical Field

The present application relates to the field of computer technology, and in particular, to the fields of speech synthesis, natural language processing, computer technology, artificial intelligence, and deep learning, and more particularly, to a method, an apparatus, a device, and a storage medium for converting accents.

Background

In recent years, due to rapid development of online education and online learning, accent conversion techniques aiming at converting an accent of a target user into a target accent have been widely studied and paid attention. Similarly, the accent conversion technology has great application prospect in entertainment. The existing accent conversion technology is used for performing accent conversion, so that the speed is low, and the result of the accent conversion is often inaccurate.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for converting accents.

According to an aspect of the present disclosure, there is provided a method for converting accents, including: acquiring a target speaking text; determining a first acoustic feature indicative of a target accent based on the target spoken text; determining linguistic features in the first acoustic features; and determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.

According to another aspect of the present disclosure, there is provided an apparatus for converting an accent, including: an acquisition unit configured to acquire a target utterance text; a first acoustic feature determination unit configured to determine a first acoustic feature indicating a target accent based on the target spoken text; a linguistic feature determination unit configured to determine a linguistic feature in the first acoustic features; and the conversion unit is configured to determine the audio with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text and output the audio.

According to still another aspect of the present disclosure, there is provided an electronic device for converting an accent, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for converting accents as described above.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method for converting an accent as described above.

According to the technology of the application, the problem that the accent conversion cannot be accurately and quickly carried out is solved, any text input by a user can be accurately and quickly converted into the audio with the target accent of the user by carrying out voice synthesis, recognition and conversion on the obtained speaking text data of the user.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method for converting an accent, according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a method for converting accents according to the present application;

FIG. 4 is a flow diagram of another embodiment of a method for converting an accent in accordance with the present application;

FIG. 5 is a schematic diagram of an embodiment of an apparatus for converting accents according to the present application;

fig. 6 is a block diagram of an electronic device for implementing a method for converting accents according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the present method for converting an accent or apparatus for converting an accent may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a speech synthesis application, may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, car computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server that provides various services, such as a background server that processes the targeted spoken text collected by the

terminal devices

101, 102, 103. The background server can acquire a target speaking text and determine a first acoustic characteristic used for indicating a target accent based on the target speaking text; determining linguistic features in the first acoustic features; and determining the audio with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as a plurality of software or software modules, or as a single software or software module. And is not particularly limited herein.

It should be noted that the method for converting accents provided in the embodiments of the present application is generally performed by the server 105. Accordingly, the means for converting accents is typically located in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method for converting accents in accordance with the present application is shown. The method for converting accents of the embodiment comprises the following steps:

step 201, obtaining a target speaking text.

In this embodiment, an executing subject (for example, the server 105 in fig. 1) of the method for converting an accent may obtain, through a wired connection or a wireless connection, a target utterance text acquired by a terminal device through recording or scanning. Specifically, the target speech text may be a speech text converted from a voice uttered by any one of persons having a certain identity and who want to convert an accent. The spoken text may be a word. For example, if the pragma is explained as "i love home", the target utterance text may be "i love home" expressed in the form of characters.

Based on the target spoken text, a first acoustic feature indicative of a target accent is determined, step 202.

After the execution subject acquires the target utterance text, a first acoustic feature indicating a target accent may be determined based on the target utterance text. In particular, the first acoustic feature may be a speech feature parameter for characterizing the target accent, which may be a mel-frequency spectrum, for example. In this embodiment, the execution subject may perform pre-emphasis, framing, and windowing on the audio according to the target spoken text and the audio corresponding to the target spoken text, and then may perform short-time Fourier transform (STFT), short-time Fourier transform, or short-term Fourier transform on a signal of each frame of audio to obtain a short-time amplitude spectrum; the short-time amplitude spectrum passes through a Mel filter bank to obtain a Mel frequency spectrum. The present application does not specifically limit the manner in which the mel spectrum (i.e., the first acoustic feature) is obtained.

Step 203, determining linguistic features in the first acoustic features.

The executive subject, upon determining the first acoustic feature, may determine a linguistic feature in the first acoustic feature. In particular, the linguistic features may include prosodic features, syntax, structure of speech pieces, structure of information, and the like. The prosodic feature may be a super-sound feature or a super-sound segment feature, which is a sound system structure of the language. Prosodic features can be divided into three main aspects: intonation, time domain distribution and stress are realized through the characteristics of the ultrasonic segment. The super-range features include pitch, intensity, and temporal characteristics, loaded by a phoneme or group of phonemes. Prosody is a typical feature of human natural language and has many features common across languages, such as: pitch downtilt, rereading, pauses, etc. are common among different languages. Prosodic features are one of the important forms of language and emotional expression. The executing subject may compare the existing known linguistic features with the features in the first acoustic features, and determine the feature in the first acoustic features, which has a similarity greater than a threshold value with the known linguistic features, as the linguistic feature in the first acoustic features. Specifically, the executing subject may convert each of the first acoustic features into a first acoustic feature vector and convert existing known linguistic features into a linguistic feature vector, respectively. And performing cosine similarity calculation on each first acoustic feature vector and each linguistic feature vector, wherein the cosine similarity is closer to 1, which shows that the more similar the first acoustic feature vector participating in the calculation of the cosine similarity is to the corresponding linguistic feature vector, the first acoustic feature vector can be determined as the linguistic feature vector, and correspondingly, the first acoustic feature corresponding to the first acoustic feature vector is determined as the linguistic feature corresponding to the linguistic feature vector. Similarly, traversing each first acoustic feature and each linguistic feature until finding the first acoustic feature corresponding to the first acoustic feature vector, wherein the absolute value of the difference between the cosine similarity of the vector corresponding to each first acoustic feature and each linguistic feature and 1 is less than the threshold value, and determining the first acoustic feature as the linguistic feature, thereby obtaining all the linguistic features in the first acoustic feature.

And step 204, determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.

After determining the linguistic feature, the execution subject may determine, based on the linguistic feature and the target spoken text, an audio corresponding to the target spoken text and having the target accent, and output the audio. Specifically, the executive body may input the linguistic features and the target spoken text into a pre-trained speech model to generate audio corresponding to the target spoken text with a target accent, and output the audio. The pre-trained speech model is used for representing the corresponding relation between linguistic features and target speaking texts and audio with target accents, and can be a pre-trained neural network model. When a speech model is trained, an initial neural network model can be obtained firstly; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and audio which is labeled and corresponds to the linguistic features and the target speaking texts corresponding to the linguistic features and the target speaking texts with target accents; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking audio corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features and audio with target accents as expected output, and training the initial neural network model; and determining the trained initial neural network model as the voice model. For example, the target spoken text may be a text "i love home" of mandarin chinese, the audio corresponding to the target spoken text and having the target accent may be an audio of "i love home" of northeast accent, the audio may be in the form of MP3 or MP4, and the form of the output audio is not particularly limited in the present application.

With continued reference to fig. 3, a schematic illustration of one application scenario of the method for converting accents according to the present application is shown. In the application scenario of fig. 3, a server 303 obtains targeted spoken text 301 over a network 302. The server 303 determines 304 a first acoustic feature indicative of a target accent based on the target spoken text 301. The server 303 determines linguistic features 305 in the first acoustic features 304. The server 303 determines the audio 306 having the target accent corresponding to the target spoken text 301 based on the linguistic features 305 and the target spoken text 301, and outputs the audio 306. This application is actually the conversion of speech text data to audio with a particular accent. The text content is unchanged.

According to the embodiment, any one text input by the user can be accurately and quickly converted into the audio with the target accent of the user by performing voice synthesis, recognition and conversion on the acquired speaking text data of the user.

With continued reference to FIG. 4, a flow 400 of another embodiment of a method for converting accents in accordance with the present application is shown. As shown in fig. 4, the method for converting accents of the present embodiment may include the following steps:

step 401, a target speaking text is obtained.

Based on the target spoken text, a first acoustic feature indicative of a target accent is determined, step 402.

The principle of step 401 to step 402 is similar to that of step 201 to step 202, and is not described herein again.

Specifically, step 402 may be implemented by step 4021:

step 4021, determining a first acoustic feature corresponding to the target speaking text and used for indicating the target accent according to the target speaking text and the pre-trained speech synthesis model.

In this embodiment, the pre-trained speech synthesis model is used to characterize the correspondence between the spoken text and the first acoustic feature. After the execution subject obtains the target speaking text, the first acoustic feature corresponding to the target speaking text and used for indicating the target accent can be determined according to the target speaking text and the pre-trained speech synthesis model. In particular, the first acoustic feature may be a mel-frequency spectrum indicative of a target accent, such as a northeast accent. The first acoustic feature corresponding to the target speaking text obtained through the speech synthesis model and used for indicating the target accent cannot be used for outputting the target accent at this time, further feature recognition and feature conversion are needed to obtain features carrying more target speaking text information, and then further speech synthesis is carried out according to the obtained features carrying more target speaking text information. In this embodiment, specifically, the execution subject may input the target utterance text into the pre-trained speech synthesis model, and perform extraction of the first acoustic feature on the target utterance text by using the pre-trained speech synthesis model to obtain the first acoustic feature corresponding to the target utterance text and used for indicating the target accent.

According to the embodiment, the audio Mel frequency spectrum with the target accent can be obtained according to the target speaking text and the pre-trained speech synthesis model, so that the conversion of the target accent can be accurately carried out on the speaking text of any user based on the obtained audio Mel frequency spectrum with the target accent, and the audio with the target accent can be obtained.

In step 403, linguistic features in the first acoustic features are determined.

The principle of step 403 is similar to that of step 203, and is not described in detail here.

Specifically, step 403 may be implemented by step 4031:

step 4031, linguistic features in the first acoustic features are extracted by using the pre-trained recognition model.

The pre-trained recognition model is used for representing the corresponding relation between the first acoustic feature and the linguistic feature. After obtaining the first acoustic feature, the executive body may extract a linguistic feature in the first acoustic feature by using a pre-trained recognition model. Specifically, the linguistic features in the first acoustic feature may include prosodic features such as intonation, time-domain distribution, accent, pitch, accent, pause, and the like. Linguistic features are one of the important forms of language and emotional expression.

In some optional implementations of this embodiment, the executing subject may further determine, according to the first acoustic feature and a pre-trained recognition model, a class identifier corresponding to the first acoustic feature, where the pre-trained recognition model is used to characterize a correspondence between the first acoustic feature and the class identifier. The obtained class identifier may be an identifier for characterizing a class of each phoneme in the first acoustic feature, for example, each phoneme in the first acoustic feature may be a intonation phoneme, a time domain distribution phoneme, an accent phoneme, a pitch phoneme, an accent phoneme, a pause phoneme, and may be represented by identifiers 1, 2, 3, 4, 5, 6, and 7, respectively. Then, the execution subject may determine a second acoustic feature for generating the audio with the target accent according to each phoneme in the first acoustic feature corresponding to the obtained identifier and a preset identifier, and a correspondence between the phoneme and the second acoustic feature. The second acoustic feature may be a mel-frequency spectrum corresponding to each phoneme required for generating the target accent. The execution subject may determine audio having a target accent corresponding to the target spoken text based on the second acoustic feature and output the audio. The realization mode can enrich the Mel frequency spectrum required by the audio used for generating the target accent and improve the accuracy of the audio used for generating the target accent.

In the embodiment, the linguistic features in the first acoustic features are extracted by utilizing the pre-trained recognition model, so that the relevant features of the corresponding language and emotion expression in the target speaking text can be extracted, the features of the audio used for generating the target accent are perfected, and the accuracy of generating the audio with the target accent is improved.

And step 404, determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.

The principle of step 404 is similar to that of step 204, and is not described here again.

Specifically, step 404 can be implemented by steps 4041 to 4042:

step 4041, determining a second acoustic feature corresponding to the target spoken text for generating an audio with the target accent according to the linguistic feature, the target spoken text, and the pre-trained conversion model.

In this embodiment, the conversion model is used to characterize the linguistic features and the correspondence between the target spoken text and the second acoustic feature used to generate the audio with the target accent. In particular, the second acoustic feature may comprise a mel-frequency spectrum corresponding to the above-mentioned linguistic feature. Specifically, the second acoustic feature may be a mel-frequency spectrum corresponding to each phoneme required for generating the target accent. The execution subject can input the linguistic features and the target speaking text into the pre-trained conversion model, and output second acoustic features corresponding to the target speaking text and used for generating audio with the target accent.

Step 4042, determining audio having the target accent based on the second acoustic feature.

The execution subject, having obtained the second acoustic feature, may determine audio having the target accent based on the second acoustic feature. Specifically, the execution body may input the obtained second acoustic feature to the vocoder, and the audio with the target accent is obtained through conversion by the vocoder. The vocoder encodes and encrypts the second acoustic feature at its transmitting end to obtain a match with the channel, which is transmitted to the receiving end of the vocoder via the information channel, and analyzes the received feature in the frequency domain to identify unvoiced and voiced sounds, determine the fundamental frequency of the voiced sounds, and further select the unvoiced-voiced decision, the fundamental frequency of the voiced sounds and the spectral envelope as feature parameters for transmission. Of course, the analysis may also be performed in the time domain, and some second acoustic features are periodically extracted to perform linear prediction, so as to generate audio with a target accent corresponding to the second acoustic features. Specifically, the vocoders may include a channel vocoder, a formant vocoder, a pattern vocoder, a linear prediction vocoder, a correlation vocoder, and an orthogonal function vocoder, and the type of the vocoder is not particularly limited in the present application.

According to the embodiment, the second acoustic feature corresponding to the target speaking text is determined according to the linguistic feature, the target speaking text and the pre-trained conversion model, and the audio with the target accent is determined based on the second acoustic feature, so that the audio of the target accent can be obtained accurately for any one speaking text of any user.

In some optional implementations of the present embodiment, the method for converting accents further comprises the following model training steps not shown in fig. 4: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.

In this embodiment, the execution subject may obtain the initial neural network model through a wired connection manner or a wireless connection manner. The initial Neural Network model may include various Artificial Neural Networks (ANN) including hidden layers. In this embodiment, the execution main body may also obtain a pre-stored initial model from a local place, or may also obtain the initial model from a communication-connected electronic device, which is not limited herein.

In this embodiment, the execution subject may acquire the training sample set in various ways. Specifically, the training samples in the training sample set may include the linguistic features, the target spoken text corresponding to the linguistic features, and the labeled second acoustic features corresponding to the linguistic features and the target spoken text corresponding to the linguistic features. The second acoustic feature corresponding to the target spoken text corresponding to the linguistic feature and the linguistic feature labeled in the training sample may be obtained from a local or communicatively connected electronic device in a wired or wireless connection manner, may also be labeled manually in real time, or may be obtained by first performing automatic labeling and then manually performing supplementary modification and correcting a labeling error, which is not specifically limited in this application. The linguistic features in the training samples may be extracted in real-time or may be obtained from a local or communicatively coupled electronic device via a wired or wireless connection. The target spoken text corresponding to the linguistic features in the training sample may be collected by different users in real time, or may be obtained from a local or communicatively connected electronic device in a wired or wireless connection manner, which is not specifically limited in this application.

In this embodiment, the initial neural network model is trained by obtaining a training sample set, so that a conversion model capable of accurately generating an audio frequency with a target accent according to linguistic features and a spoken text can be obtained, thereby more accurately converting characters on the obtained target spoken text into an audio frequency with the target accent, and improving the quality of the generated audio frequency with the target accent.

In some optional implementations of the embodiment, the audio with the target accent comprises at least one of: speech audio with a target accent, singing audio with a target accent.

In some optional implementations of the embodiment, the second acoustic feature includes a mel-frequency spectrum corresponding to the target accent; and determining audio having a target accent based on the second acoustic feature, comprising: and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.

Specifically, the execution body may automatically input a mel spectrum corresponding to the target accent into a preset neural network vocoder after obtaining the mel spectrum, to synthesize audio having the target accent based on the neural network vocoder and the mel spectrum.

The present implementation may make the synthesized audio with the target accent more accurate by a vocoder according to the mel spectrum corresponding to the target accent and a preset neural network.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for converting accents, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the apparatus 500 for converting accents of the present embodiment includes: an acquisition unit 501, a first acoustic feature determination unit 502, a linguistic feature determination unit 503, and a conversion unit 504.

An obtaining unit 501 configured to obtain a target utterance text.

A first acoustic feature determination unit 502 configured to determine a first acoustic feature indicating a target accent based on the target spoken text.

A linguistic feature determination unit 503 configured to determine linguistic features in the first acoustic features.

And the conversion unit 504 is configured to determine audio with a target accent corresponding to the target speaking text based on the linguistic features and the target speaking text, and output the audio.

In some optional implementations of the present embodiment, the first acoustic feature determination unit 502 is further configured to: and determining a first acoustic feature corresponding to the target speaking text and used for indicating the target accent according to the target speaking text and the pre-trained speech synthesis model, wherein the pre-trained speech synthesis model is used for representing the corresponding relation between the speaking text and the first acoustic feature.

In some optional implementations of this embodiment, the linguistic feature determination unit 503 is further configured to: and extracting linguistic features in the first acoustic features by using a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the first acoustic features and the linguistic features.

In some optional implementations of this embodiment, the conversion unit 504 is further configured to: determining a second acoustic feature corresponding to the target speaking text and used for generating the audio with the target accent according to the linguistic feature, the target speaking text and a pre-trained conversion model, wherein the conversion model is used for representing the corresponding relation between the linguistic feature and the target speaking text and the second acoustic feature used for generating the audio with the target accent; based on the second acoustic feature, audio having a target accent is determined.

In some optional implementations of this embodiment, the apparatus further comprises a training unit, not shown in fig. 5, configured to: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.

In some optional implementations of the embodiment, the second acoustic feature includes a mel-frequency spectrum corresponding to the target accent; and the conversion unit 504 is further configured to: and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.

It should be understood that the units 501 to 504 described in the apparatus 500 for converting accents correspond to the respective steps in the method described with reference to fig. 2, respectively. Thus, the operations and features described above for the method for converting accents apply equally to the apparatus 500 and the units comprised therein, and are not described in detail here.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 6 is a block diagram of an electronic device for converting accents according to an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 6, the electronic apparatus includes: one or more processors 601, memory 602, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses 605 and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses 605 may be used, along with multiple memories and multiple memories, if desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 6, one processor 601 is taken as an example.

The memory 602 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for converting accents provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for converting accents provided herein.

The memory 602, which is a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and units, such as program instructions/units corresponding to the method for converting accents in the embodiment of the present application (for example, the acquisition unit 501, the first acoustic feature determination unit 502, the linguistic feature determination unit 503, and the conversion unit 504 shown in fig. 5). The processor 601 executes various functional applications of the server and data processing by executing non-transitory software programs, instructions, and modules stored in the memory 602, that is, implements the method for converting accents in the above method embodiments.

The memory 602 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic apparatus for converting accents, and the like. Further, the memory 602 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 602 optionally includes memory located remotely from the processor 601, and these remote memories may be connected over a network to an electronic device for converting accents. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the method for converting accents may further comprise: an input device 603 and an output device 604. The processor 601, the memory 602, the input device 603, and the output device 604 may be connected by a bus 605 or other means, and are exemplified by the bus 605 in fig. 6.

The input device 603 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic apparatus for converting accents, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, or other input device. The output devices 604 may include a display device, auxiliary lighting devices (e.g., LEDs), and tactile feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme of the embodiment of the application, any text input by the user can be accurately and quickly converted into the audio frequency with the target accent of the user by carrying out voice synthesis, recognition and conversion on the acquired speaking text data of the user.

In accordance with one or more embodiments of the present disclosure, there is provided a method for converting accents, comprising: acquiring a target speaking text; determining a first acoustic feature indicative of a target accent based on the target spoken text; determining linguistic features in the first acoustic features; and determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.

In accordance with one or more embodiments of the present disclosure, determining a first acoustic feature for indicating a target accent based on a target spoken text includes: and determining a first acoustic feature corresponding to the target speaking text and used for indicating the target accent according to the target speaking text and the pre-trained speech synthesis model, wherein the pre-trained speech synthesis model is used for representing the corresponding relation between the speaking text and the first acoustic feature.

According to one or more embodiments of the present disclosure, determining linguistic features in the first acoustic feature comprises: and extracting linguistic features in the first acoustic features by using a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the first acoustic features and the linguistic features.

According to one or more embodiments of the present disclosure, determining audio corresponding to a target spoken text and having a target accent based on linguistic features and the target spoken text includes: determining a second acoustic feature corresponding to the target speaking text and used for generating the audio with the target accent according to the linguistic feature, the target speaking text and a pre-trained conversion model, wherein the conversion model is used for representing the corresponding relation between the linguistic feature and the target speaking text and the second acoustic feature used for generating the audio with the target accent; based on the second acoustic feature, audio having a target accent is determined.

According to one or more embodiments of the present disclosure, the method for converting accents further comprises: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.

In accordance with one or more embodiments of the present disclosure, audio having a target accent includes at least one of: speech audio with a target accent, singing audio with a target accent.

In accordance with one or more embodiments of the present disclosure, the second acoustic feature includes a mel-frequency spectrum corresponding to the target accent; and determining audio having a target accent based on the second acoustic feature, comprising: and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.

In accordance with one or more embodiments of the present disclosure, there is provided an apparatus for converting accents, comprising: an acquisition unit configured to acquire a target utterance text; a first acoustic feature determination unit configured to determine a first acoustic feature indicating a target accent based on the target spoken text; a linguistic feature determination unit configured to determine a linguistic feature in the first acoustic features; and the conversion unit is configured to determine the audio with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text and output the audio.

According to one or more embodiments of the present disclosure, the first acoustic feature determination unit is further configured to: and determining a first acoustic feature corresponding to the target speaking text and used for indicating the target accent according to the target speaking text and the pre-trained speech synthesis model, wherein the pre-trained speech synthesis model is used for representing the corresponding relation between the speaking text and the first acoustic feature.

According to one or more embodiments of the present disclosure, the linguistic feature determination unit is further configured to: and extracting linguistic features in the first acoustic features by using a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the first acoustic features and the linguistic features.

According to one or more embodiments of the present disclosure, the conversion unit is further configured to: determining a second acoustic feature corresponding to the target speaking text and used for generating the audio with the target accent according to the linguistic feature, the target speaking text and a pre-trained conversion model, wherein the conversion model is used for representing the corresponding relation between the linguistic feature and the target speaking text and the second acoustic feature used for generating the audio with the target accent; based on the second acoustic feature, audio having a target accent is determined.

According to one or more embodiments of the present disclosure, the apparatus for converting accents further comprises a training unit, not shown in fig. 5, configured to: acquiring an initial neural network model; acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features; taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as input of an initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model; and determining the trained initial neural network model as a conversion model.

In accordance with one or more embodiments of the present disclosure, the second acoustic feature includes a mel-frequency spectrum corresponding to the target accent; and the conversion unit is further configured to: and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.

It should be understood that the above embodiments are merely exemplary embodiments, but are not limited thereto, and include other methods known in the art that may be implemented for converting accents. Steps may be reordered, added, or deleted using the various forms of flow shown above. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for converting accents, comprising:

acquiring a target speaking text;

determining a first acoustic feature indicative of a target accent based on the target spoken text;

determining linguistic features in the first acoustic features;

and determining the audio frequency with the target accent corresponding to the target speaking text based on the linguistic characteristics and the target speaking text, and outputting the audio frequency.

2. The method of claim 1, wherein determining, based on the targeted spoken text, a first acoustic feature indicative of a targeted accent comprises:

and determining a first acoustic feature which is corresponding to the target speaking text and is used for indicating a target accent according to the target speaking text and a pre-trained speech synthesis model, wherein the pre-trained speech synthesis model is used for representing the corresponding relation between the speaking text and the first acoustic feature.

3. The method of claim 1, wherein the determining linguistic features in the first acoustic feature comprises:

and extracting linguistic features in the first acoustic features by using a pre-trained recognition model, wherein the pre-trained recognition model is used for representing the corresponding relation between the first acoustic features and the linguistic features.

4. The method of claim 1, wherein the determining audio corresponding to the target spoken text with a target accent based on the linguistic features and the target spoken text comprises:

determining a second acoustic feature corresponding to the target speaking text and used for generating the audio with the target accent according to the linguistic feature, the target speaking text and a pre-trained conversion model, wherein the conversion model is used for representing the corresponding relation between the linguistic feature and the target speaking text and the second acoustic feature used for generating the audio with the target accent;

based on the second acoustic feature, audio having a target accent is determined.

5. The method of claim 4, wherein the method further comprises:

acquiring an initial neural network model;

acquiring a training sample set, wherein training samples in the training sample set comprise linguistic features, target speaking texts corresponding to the linguistic features and labeled second acoustic features corresponding to the linguistic features and the target speaking texts corresponding to the linguistic features;

taking the linguistic features of the training samples in the training sample set and target speaking texts corresponding to the linguistic features as the input of the initial neural network model, taking second acoustic features corresponding to the input linguistic features and the target speaking texts corresponding to the linguistic features as expected output, and training the initial neural network model;

and determining the trained initial neural network model as the conversion model.

6. The method of any of claims 1-5, wherein the audio with the target accent comprises at least one of: speech audio with a target accent, singing audio with a target accent.

7. The method of claim 4, wherein the second acoustic feature comprises a Mel spectrum corresponding to the target accent; and

the determining audio having a target accent based on the second acoustic feature comprises:

and synthesizing the audio with the target accent according to the Mel frequency spectrum and a preset neural network vocoder.

8. An apparatus for converting accents, comprising:

an acquisition unit configured to acquire a target utterance text;

a first acoustic feature determination unit configured to determine a first acoustic feature indicating a target accent based on the target spoken text;

a linguistic feature determination unit configured to determine linguistic features in the first acoustic features;

and the conversion unit is configured to determine the audio with the target accent corresponding to the target speaking text based on the linguistic features and the target speaking text, and output the audio.

9. An electronic device for converting accents, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

10. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.