CN113035239A

CN113035239A - Chinese-English bilingual cross-language emotion voice synthesis device

Info

Publication number: CN113035239A
Application number: CN201911253410.6A
Authority: CN
Inventors: 吴沛文; 李曜; 吴云清
Original assignee: Shanghai Aviation Electric Co Ltd
Current assignee: Shanghai Aviation Electric Co Ltd
Priority date: 2019-12-09
Filing date: 2019-12-09
Publication date: 2021-06-25

Abstract

The invention discloses a Chinese-English bilingual cross-language emotion voice synthesis device which comprises a device body, wherein an armband is connected between the upper surface and the lower surface of the device body, an LCD (liquid crystal display) is arranged above the front surface of the device body, a text input unit is arranged below the front surface of the device body, a switch is arranged on the right side of the front surface of the device body between the LCD and the text input unit, a loudspeaker is arranged on the right side of the upper surface of the device body, a wireless transceiver is arranged above the right side of the device body, a voice synthesis FPGA is arranged in the device body, and the voice synthesis FPGA is respectively connected with an SD (secure digital) card, the switch, the loudspeaker, a voice recognition chip and a single chip microcomputer. The invention has the beneficial effects that: the method can help people who only speak one language among Chinese or English people to synthesize Chinese or English voices with speaker styles and emotions, so that people who speak different languages among the Chinese or English people can communicate naturally and smoothly.

Description

Chinese-English bilingual cross-language emotion voice synthesis device

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a Chinese-English bilingual cross-language emotion voice synthesis device.

Background

The voice is one of the most natural modes of communication between people and is also one of the most ideal modes of man-machine interaction, along with the rapid development of artificial intelligence, intelligent electronic products are more and more integrated into the life of people, and people have demands on voice synthesis technology, particularly on emotion voice synthesis technology capable of expressing the style and emotion of speakers. People live in a multi-nationality and multi-language world, people can not conveniently communicate with each other by speaking different languages in the life, and people urgently need to realize communication among different languages by means of a cross-language emotion voice synthesis device. In addition, dumb people, people with disabilities such as unclear mouth and teeth and the like also urgently need to realize dumb open speaking by means of the cross-language emotion voice synthesis device, and the speaking problem is solved.

Disclosure of Invention

The invention aims to provide a Chinese-English bilingual cross-language emotion voice synthesis device which can help Chinese or English people who only use one language to communicate naturally between Chinese and English, and also can help dumb people and other handicapped people to communicate naturally with normal people by means of the system.

In order to achieve the purpose, the technical scheme of the invention is as follows: a Chinese-English bilingual cross-language emotion voice synthesis device comprises a device body, wherein the device body is of a cuboid structure, an arm belt is connected between the upper surface and the lower surface of the device body to fix the device body on an arm, an LCD display is arranged above the front surface of the device body, a text input device is arranged below the front surface of the device body, a switch is arranged on the right side of the front surface of the device body between the LCD display and the text input device and is divided into a left key, a middle key and a right key, the left key is an emotion voice synthesis key for controlling Chinese, the middle key is a power switch key, the right key is an emotion voice synthesis key for controlling English, a loudspeaker is arranged on the right side of the upper surface of the device body, a wireless transceiver is arranged above the right surface of the device body and is used for being in wireless connection with equipment such as a mobile phone and a vehicle-mounted computer to expand application of more Chinese-English bilingual cross-language emotion voice synthesis, the device comprises a device body and is characterized in that an earphone hole is formed in the right side of the device body, a microphone is arranged below the earphone hole, the microphone and the earphone hole are integrated, a power supply is arranged below the rear of the device body, a voice synthesis FPGA is arranged inside the device body, the voice synthesis FPGA is respectively connected with an SD card, a switch, a loudspeaker, a voice recognition chip and a single chip microcomputer, the single chip microcomputer is connected with the voice recognition chip, the wireless transceiver and an LCD display, wherein the microphone is connected with the voice recognition chip, and the power supply is connected with the switch.

Compared with the prior art, the invention has the beneficial effects that: the method can help people who only speak one language among Chinese or English people to synthesize Chinese or English voices with speaker styles and emotions, so that people who speak different languages among the Chinese or English people can communicate naturally and smoothly, and the problem that communication cannot be performed due to different languages is solved; meanwhile, the Chinese-English bilingual cross-language emotion speech synthesis device can help the disabled such as dumb or unclear mouth and teeth to communicate with other people in natural Chinese and English, and solves the problem that people cannot speak.

In addition to the technical problems addressed by the present invention, the technical features constituting the technical solutions, and the advantageous effects brought by the technical features of the technical solutions described above, other technical problems solved by the present invention, other technical features included in the technical solutions, and advantageous effects brought by the technical features will be described in further detail with reference to the accompanying drawings.

Drawings

FIG. 1 is a schematic diagram of a device for synthesizing Chinese-English bilingual cross-language emotion speech according to the present invention.

FIG. 2 is a schematic diagram of an internal control system of the Chinese-English bilingual cross-language emotion speech synthesis apparatus according to the present invention.

FIG. 3 is a schematic diagram of a flow of implementing the Chinese-English bilingual cross-language emotion voice synthesis by the FPGA in the Chinese-English bilingual cross-language emotion voice device of the present invention.

In the figure: 1. the device comprises a wireless transceiver, 2 a single chip microcomputer, 3 a voice recognition chip, 4 a microphone, 5 an SD card, 6 an FPGA, 7 a switch, 8 a power supply, 9 a voice parameter generator, 10 a loudspeaker, 11 a text input device, 12 an LCD display, 13 an armband, 14 a Chinese-English bilingual cross-language emotion voice synthesis device body and 15 an earphone hole.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments and drawings. Here, the description of the embodiments is provided to help understanding of the present invention, but the present invention is not limited thereto. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

Referring to fig. 1, the apparatus for synthesizing a cross-lingual emotion speech of chinese-english bilingual of the present invention includes a body 14 of the apparatus, the body 14 is rectangular, and an arm band 13 is connected between the upper and lower surfaces of the body, so that the apparatus for synthesizing a cross-lingual emotion speech of chinese-english bilingual of the present invention can be fixed on an arm, and can be easily carried by being fixed on an arm due to its light weight and exquisite size. Above the front of the device body 14 is an LCD display 12 for displaying text and other information. Below the front of the device body 14 is a text input device 11 (keyboard) for inputting chinese or english text. The right side between the LCD display 12 and the text input device 11 is provided with a switch 7, the switch 7 is divided into three keys, the middle key is a power switch key for controlling the power supply of the Chinese-English bilingual cross-language speech synthesis device, the left key is an emotion speech synthesis key for controlling Chinese, and the right key is an emotion speech synthesis key for controlling English. The speaker 10 is on the right side above the device body 14 for playing synthesized chinese and english speech. The wireless transceiver 1 is arranged above the right side of the device body 14 and can be wirelessly connected with equipment such as a mobile phone and a vehicle-mounted computer, so that more applications of Chinese-English bilingual cross-language emotion voice synthesis are expanded. The earphone hole 15 is arranged below the wireless transceiver 1 and is used for connecting a wire or a wireless earphone, so that the wireless transceiver is convenient for a user to use. The microphone 4 is arranged below the earphone hole 15 and used for inputting voice, and the microphone 4 and the earphone hole 15 are integrated elements. The power supply 8 is arranged at the rear lower part of the device body 14, and the power supply 8 is used for supplying power to the whole Chinese-English bilingual cross-language emotion voice synthesis device.

The specific control system inside the Chinese-English bilingual cross-language emotion voice synthesis device is schematically shown in FIG. 2. The Chinese-English bilingual cross-language emotion voice synthesis device aims at two groups of people:

the first group of people: people who only speak one of the languages chinese and english. Under the condition, when people with different languages in Chinese and English communicate by means of the invention, the original voice signal of the opposite speaking is transmitted to the FPGA6 through the microphone 4 and the voice recognition chip 3 (the voice recognition chip can be purchased on the market, or the Chinese-English voice recognition chip can be customized in a company producing the voice recognition chip), the text information of the recognized voice is transmitted to the LCD display 12 through the singlechip 2 to display the text information, the text information is simultaneously connected with a wireless network of a mobile phone, a vehicle-mounted computer, wifi and the like through the wireless transceiver 1 on the singlechip 2 to perform text translation between Chinese and English, and the translated text information can be displayed on the LCD display 12 on the singlechip 2. The left and right keys on the switch 11 can control whether the text to be synthesized of the Chinese-English bilingual cross-language emotion speech synthesis device is Chinese or English, the keyboard of the text input device 7 can control whether the text to be synthesized is Chinese or English, and the text input device 7 can be set to be in a mode of automatically translating the text to be synthesized without manual control of a user. The text to be synthesized is transmitted to the FPGA6 through the single chip microcomputer, a speech synthesis system (speech synthesis system HTS based on hidden markov model, etc.) is loaded in the FPGA6, the speech synthesis system calls a corpus (a neutral english corpus, a neutral chinese corpus and a chinese emotion corpus with 11 emotions) stored in the SD card 5 and an original speech signal input by the microphone 4 to obtain an acoustic model of target language emotion speech similar to the speaker style through adaptive training, speech parameters are generated through a speech parameter generator 9 (STRAIGHT speech parameter generator, etc.), and then speech parameter signals are transmitted to the speaker 10, and synthesized target language emotion speech which can be understood by a user is played. When the user replies to the opposite party, the voice signal is input through the microphone 4, and the synthesized target language emotion voice which can be understood by the opposite party is synthesized through the steps.

The second group of people: dumb, unclear mouth, etc. (or people who like typing). The user inputs text by a keyboard of a text input device 11, the text information is displayed on an LCD display 12 through a singlechip 2 on one hand, and is translated into another language text of Chinese and English after being connected with a wireless network through a wireless transceiver and also displayed on the LCD display 12, the text of the language to be synthesized is selected and controlled through keys on the text input device 11 or a switch 7, the text signal of the language to be synthesized is input into an FPGA6 through the singlechip, the emotion voice parameter signal of the language to be synthesized is synthesized through a Chinese-English bilingual cross-language emotion voice synthesis system based on the FPGA (which is composed of a voice synthesis system loaded in the FPGA, an SD card 5 and a voice parameter generator 9), and the synthesized emotion voice parameter signal of the target language is transmitted to a loudspeaker 10 to play the synthesized emotion voice of the target language.

The schematic flow chart of the FPGA realization of the Chinese-English bilingual cross-language emotion voice synthesis in the Chinese-English bilingual cross-language emotion voice synthesis device of the invention is shown in FIG. 3. Firstly, the FPGA emotion voice synthesis system selects Chinese emotion voice material similar to speaker style from 11 emotion Chinese emotion corpus in SD card 5 according to speaker voice signal input by microphone 4, trains Chinese speaker related target emotion average acoustic model, then selects neutral target language (Chinese or English) voice material from SD card according to-be-synthesized language text input by single chip microcomputer 2, trains target language speaker related neutral average acoustic model of to-be-synthesized text, and transplants target emotion of Chinese speaker related target emotion average acoustic model similar to speaker style to target language speaker related to-be-synthesized language text In the neutral average acoustic model, a target emotion average acoustic model related to a target language speaker of a language text to be synthesized is obtained, a voice parameter signal is generated through a voice parameter generator 9, and then the voice parameter signal is transmitted to a loudspeaker 9 to generate target emotion voice of the language to be synthesized.

The invention relates to a Chinese-English bilingual cross-language emotion voice synthesis device, which comprises the following steps:

1. the inside of the device is provided with a 3.3V/5V direct current battery for supplying power to the device.

2. The armband 13 is arranged above and below the Chinese-English bilingual cross-language emotion voice synthesis device body, can be fixed on the arm, and is light in weight and convenient to carry.

3. The wireless transceiver 1 on the Chinese-English bilingual cross-language emotion voice synthesis device can be connected with wireless internet such as mobile phones, vehicle-mounted computers and WiFi to complete translation of Chinese-English texts, and can listen to music, broadcast and the like through the Chinese-English bilingual cross-language emotion voice synthesis device after being connected with the internet.

4. The Chinese-English bilingual cross-language emotion voice synthesis device can play a role in translating Chinese-English voices in real time, and the translated and synthesized voices have the style and emotion of a speaker, so that the device has high practical value and can promote the communication between Chinese and English people to a great extent.

5. The Chinese-English bilingual cross-language emotion voice synthesis device can help the disabled people such as the dumb, the unclear mouth and the like, can realize the 'dumb open-mouth speaking' by virtue of the text-voice conversion function of the device, and solves the problem of the speaking or the communication obstacle with the people of the disabled people such as the dumb, the unclear mouth and the like.

6. The Chinese-English bilingual cross-language emotion voice synthesis device can be applied to not only the problem of communication between people but also intelligent electronic products such as mobile phones, computers, robots and the like, and the intelligent electronic products can be more intelligent and humanized by being wired, wireless or embedded into the device, so that the development of artificial intelligence is promoted.

The foregoing merely represents embodiments of the present invention, which are described in some detail and detail, and therefore should not be construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A Chinese-English bilingual cross-language emotion voice synthesis device is characterized by comprising a device body, wherein the device body is of a cuboid structure, an arm belt is connected between the upper surface and the lower surface of the device body to fix the device body on an arm, an LCD (liquid crystal display) is arranged above the front surface of the device body, a text input device is arranged below the front surface of the device body, a switch is arranged on the right side of the front surface of the device body between the LCD and the text input device and is divided into a left key, a middle key and a right key, the left key is an emotion voice synthesis key for controlling Chinese, the middle key is a power switch key, the right key is an emotion voice synthesis key for controlling English, a loudspeaker is arranged on the right side of the upper surface of the device body, a wireless transceiver is arranged above the right surface of the device body and is used for being wirelessly connected with equipment such as a mobile phone, a vehicle-mounted computer and the like to expand application of more Chinese-English bilingual cross-language emotion voice, the device comprises a device body and is characterized in that an earphone hole is formed in the right side of the device body, a microphone is arranged below the earphone hole, the microphone and the earphone hole are integrated, a power supply is arranged below the rear of the device body, a voice synthesis FPGA is arranged inside the device body, the voice synthesis FPGA is respectively connected with an SD card, a switch, a loudspeaker, a voice recognition chip and a single chip microcomputer, the single chip microcomputer is connected with the voice recognition chip, the wireless transceiver and an LCD display, wherein the microphone is connected with the voice recognition chip, and the power supply is connected with the switch.