CN113707124A - Linkage broadcasting method and device of voice operation, electronic equipment and storage medium - Google Patents

Linkage broadcasting method and device of voice operation, electronic equipment and storage medium Download PDF

Info

Publication number
CN113707124A
CN113707124A CN202111001106.XA CN202111001106A CN113707124A CN 113707124 A CN113707124 A CN 113707124A CN 202111001106 A CN202111001106 A CN 202111001106A CN 113707124 A CN113707124 A CN 113707124A
Authority
CN
China
Prior art keywords
speech
pronunciation
voice
mouth shape
final
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111001106.XA
Other languages
Chinese (zh)
Inventor
张殷豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Bank Co Ltd
Original Assignee
Ping An Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Bank Co Ltd filed Critical Ping An Bank Co Ltd
Priority to CN202111001106.XA priority Critical patent/CN113707124A/en
Publication of CN113707124A publication Critical patent/CN113707124A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The invention relates to the field of artificial intelligence, and discloses a linkage broadcasting method of voice of a phone operation, which comprises the following steps: acquiring a speech to be broadcasted, and identifying the speech characters of the speech term sounds; extracting syllables of each character in the dialect characters, and identifying initial consonants and final sounds in the syllables; analyzing the mouth shape positions of the initial consonant and the final respectively, and determining pronunciation pictures of the initial consonant and the final according to the mouth shape positions; combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals; synthesizing the pronunciation short videos according to the position of each character in the dialect characters so as to generate a broadcasting mouth shape of the dialect voice, and executing the broadcasting of the dialect voice according to the broadcasting mouth shape. In addition, the invention also relates to a block chain technology, and the pronunciation picture can be stored in the block chain. The method and the device can realize the consistency of the mouth shape and the characters of the phonemic sound in the broadcasting process, and improve the experience degree of users.

Description

Linkage broadcasting method and device of voice operation, electronic equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, in particular to a linkage broadcast method and device of jargon tones, electronic equipment and a computer readable storage medium.
Background
The intelligent video communication is a process of providing an interface for interacting questions and answers with customers through a service system and confirming that the customers know important information of products purchased and related risks, and the data volume of off-line service handling can be reduced through the means of the intelligent video communication, so that more and more service handling is paperless and on-line. In the process of intelligent video, in order to enable user experience to be more vivid, a dynamic robot can be displayed on an intelligent video interface, the robot is matched with an opening when a voice broadcast is performed, and when the voice broadcast stops waiting for a client to answer, the corresponding mouth of the robot can be closed.
At present, the tactics is reported and is usually provided two pictures of mouth type sound, report the dynamic picture when the robot need speak promptly, switch into static picture when the robot need not speak, and this kind of mode can satisfy certain tactics and report the demand, but can't support the phase-match of tactics and mouth type for can appear the not smooth phenomenon of the mouth type of word and word linking when voice broadcast, thereby also can influence the user experience degree at intelligent video information in-process.
Disclosure of Invention
The invention provides a linkage broadcast method and device of voice of a voice operation, electronic equipment and a computer readable storage medium, and mainly aims to realize the consistency of mouth shapes and characters of the voice operation in the broadcast process and improve the experience degree of users.
In order to achieve the above object, the present invention provides a linkage broadcast method of conversational speech, including:
acquiring a speech term voice to be broadcasted, and identifying the speech-art characters of the speech term voice;
extracting syllables of each character in the dialect characters, and identifying initial consonants and final sounds in the syllables;
analyzing the mouth shape positions of the initial consonant and the final respectively, and determining pronunciation pictures of the initial consonant and the final according to the mouth shape positions;
combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals;
synthesizing the pronunciation short videos according to the position of each character in the dialect characters to generate a broadcasting mouth shape of the dialect voice, and executing the broadcasting of the dialect voice according to the broadcasting mouth shape.
Optionally, the recognizing the phonetic text of the phonetic speech includes:
performing feature coding on the speech term voice by using an encoder in a speech recognition model to obtain feature coded speech;
performing character sequence decoding on the feature coding voice by using a decoder in the voice recognition model to obtain a feature character sequence;
and extracting the text information of the characteristic text sequence to obtain the dialect text of the dialect term voice.
Optionally, the performing feature coding on the utterance speech by using an encoder in a speech recognition model to obtain a feature coded speech includes:
calculating weight values of Mel cepstral coefficients in the conversational speech using a self-attention module in the encoder;
updating the weight information of the Mel cepstrum coefficient in the conversational speech according to the weight value;
and activating the speech term voice after the weight information is updated by utilizing a feedforward neural network in the encoder to obtain the feature coding voice.
Optionally, the performing, by a decoder in the speech recognition model, a text sequence decoding on the feature-coded speech to obtain a feature text sequence includes:
performing character information mask on the feature coding voice by using a mask layer in the decoder to obtain feature character information;
calculating a text sequence of the characteristic text information by using an attention module in the decoder;
and outputting the character sequence by utilizing a full-connection neural network in the decoder to obtain a characteristic character sequence.
Optionally, the analyzing the mouth shape positions of the initial consonant and the final respectively includes:
respectively identifying the mouth shape opening and closing types of the initials and the finals, and respectively calculating the opening and closing dimensions of the initials and the finals according to the mouth shape opening and closing types so as to determine the mouth shape positions of the initials and the finals.
Optionally, the combining the pronunciation picture according to the positions of the initial consonant and the final in the syllable to obtain the pronunciation short video of the characters corresponding to the initial consonant and the final, includes:
and acquiring the sequence position of the pronunciation picture in the syllable, and sequentially carrying out video synthesis on the pronunciation picture according to the sequence position to form pronunciation short video of characters corresponding to the initial consonant and the final sound.
Optionally, after the synthesizing the pronunciation short video, the method further includes: and setting the broadcasting speed of the voice operation so that the voice term voice meets the auditory effect of the user in the broadcasting process.
In order to solve the above problems, the present invention also provides a linkage broadcast device of conversational speech, the device including:
the speech and technology character recognition module is used for acquiring speech term sounds to be broadcasted and recognizing the speech and technology characters of the speech term sounds;
the initial and final identification module is used for extracting syllables of each character in the dialect characters and identifying initial and final in the syllables;
the pronunciation picture determining module is used for respectively analyzing the mouth shape positions of the initial consonant and the final, and determining pronunciation pictures of the initial consonant and the final according to the mouth shape positions;
the pronunciation short video generation module is used for combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals;
and the speech term sound broadcasting module is used for synthesizing the pronunciation short videos according to the position of each character in the speech operation characters so as to generate a broadcasting mouth shape of the speech operation voice, and broadcasting the speech operation voice according to the broadcasting mouth shape.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, and the computer program is executed by the at least one processor to implement the linkage broadcast method of the verbal terminology tone described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium in which at least one computer program is stored, the at least one computer program being executed by a processor in an electronic device to implement the linkage broadcast method of the spoken language voice.
It can be seen that, by identifying the dialogical characters of the speech term sounds and the initial consonants and vowels contained in the characters, the composition structure of each character in the dialogical voice can be determined, and by analyzing the mouth shape positions of the initial consonant and the final, determining pronunciation pictures of the initials and the finals to generate pronunciation short videos of characters corresponding to the initials and the finals, the accuracy of the pronunciation mouth shape of each character can be guaranteed, so that the mouth shape of each character is consistent with the character in the pronunciation process, furthermore, the embodiment of the invention synthesizes the pronunciation short video according to the position of each character in the dialect characters to generate the broadcast mouth shape of the phonemic sound, thereby realizing the broadcast of the phonemic sound, the method can ensure that the mouth shape of the word-word connection is smooth when the speech term voice is broadcasted, and improve the experience of the user in the intelligent video process. Therefore, the linkage broadcast method, the linkage broadcast device, the electronic equipment and the storage medium of the phonetics voice can realize the consistency of the mouth shape and the characters of the phonetics voice in the broadcast process, and improve the experience degree of users.
Drawings
Fig. 1 is a schematic flow chart of a linkage broadcast method of conversational speech according to an embodiment of the present invention;
fig. 2 is a schematic block diagram of a linkage broadcast device for conversational speech according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an internal structure of an electronic device implementing a linkage broadcast method of speech voice according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a linkage broadcasting method of conversational voice. The execution main body of the linkage broadcast method of the conversational speech includes, but is not limited to, at least one of electronic devices, such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the application. In other words, the linkage broadcast method of the speech term tone may be executed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of a linkage broadcast method of conversational speech according to an embodiment of the present invention. In an embodiment of the present invention, the linkage broadcast method of the conversational speech includes:
and S1, acquiring the speech term voice to be broadcasted, and recognizing the speech words of the speech term voice.
In the embodiment of the invention, the phonetic term voice refers to voice sent to a user by an intelligent robot in a man-machine interaction process, for example, in a face-tagging scene of an intelligent video, the face-tagging problem voice sent by the intelligent robot, and the phonetic words refer to a text expression form of the phonetic term voice.
It should be noted that, before feature coding is performed on the speech term voice by using an encoder in a speech recognition model, the embodiment of the present invention further includes: and extracting Mel-scale Frequency Cepstral Coefficients (MFCC for short) of the speech to serve as input features of a subsequent speech recognition model, so that the accuracy of speech recognition is guaranteed.
As an embodiment of the present invention, the recognizing the phonetic text of the phonetic speech includes: performing feature coding on the speech term voice by using an encoder in a speech recognition model to obtain feature coded speech; performing character sequence decoding on the feature coding voice by using a decoder in the voice recognition model to obtain a feature character sequence; and extracting the text information of the characteristic text sequence to obtain the dialect text of the dialect term voice.
Further, in an optional embodiment of the present invention, the performing feature coding on the jargon voice by using an encoder in a speech recognition model to obtain a feature coded speech includes: calculating a weight value of a Mel cepstrum coefficient in the phonetics voice by using a self-attention module in the encoder, updating weight information of the Mel cepstrum coefficient in the phonetics voice according to the weight value, and activating the phonetics voice after the weight information is updated by using a feedforward neural network in the encoder to obtain feature coding voice.
The self-attention module is used for identifying the correlation between each Mel cepstrum coefficient and other Mel cepstrum coefficients in the conversational speech to update the weight information of the corresponding Mel cepstrum coefficient, so that the Mel cepstrum coefficient contains the contextual speech feature information, and the feedforward neural network is used for activating the conversational speech after the weight information is updated to realize data transmission.
Further, in an optional embodiment of the present invention, the performing text sequence decoding on the feature coded speech by using a decoder in the speech recognition model to obtain a feature text sequence includes: performing character information mask on the feature coding voice by using a mask layer in the decoder to obtain feature character information; calculating a text sequence of the characteristic text information by using an attention module in the decoder; and outputting the character sequence by utilizing a full-connection neural network in the decoder to obtain a characteristic character sequence.
The word information mask is used for performing length sequence alignment and information masking of context vectors on the jargon voices transmitted by the encoder, the word sequence calculation of the characteristic word information can be the same as the weight value calculation principle, and the output of the initial characteristic word sequence is realized through an activation function of the fully-connected neural network.
Further, in an optional embodiment of the present invention, the text information extraction of the characteristic text sequence may be implemented by a bundle search algorithm.
And S2, extracting syllables of each character in the dialect characters, and identifying initial consonants and vowels in the syllables.
It should be understood that the words are composed of different syllables, for example, the syllable of the word "fuse" is "r, o, n, g", the embodiment of the present invention determines the composition structure of each word in the dialect word by extracting the syllable of each word in the dialect word, and guarantees the precondition of recognizing the pronunciation mouth shape of each word in the subsequent dialect word, and further, the embodiment of the present invention guarantees the accuracy of the pronunciation mouth shape corresponding to the syllable by recognizing the initial consonant and the final consonant in the syllable to further split the composition attribute of the syllable. The initial consonant refers to consonant before the final, and forms a complete syllable together with the final, such as b, p, m, f, d, t, etc., and the final usually consists of a final, a final belly and a final tail, such as a, o, ai, ei, ui, ao, etc. For example, there are syllables [ guan ], where [ g ] is an initial and [ ua ] is a final, where [ a ] is a final, where [ u ] is a final and [ n ] is a final.
And S3, analyzing the mouth shape positions of the initial consonant and the final respectively, and determining the pronunciation pictures of the initial consonant and the final according to the mouth shape positions.
In the embodiment of the present invention, the mouth shape positions include an up-down opening and closing degree and a left-right elongation degree, and it should be understood that the opening and closing degrees and the elongation degrees of different initials and finals in the opening mouth and the closing mouth are different, so that the mouth shape pronunciation pictures of the subsequent initials and the finals are determined by analyzing the mouth shape positions of the initials and the finals.
As an embodiment of the present invention, the analyzing the mouth shape positions of the initial consonant and the final consonant respectively includes: respectively identifying the mouth shape opening and closing types of the initials and the finals, and respectively calculating the opening and closing dimensions of the initials and the finals according to the mouth shape opening and closing types so as to determine the mouth shape positions of the initials and the finals.
The mouth shape opening and closing type comprises an opening type and a closing type, and the opening and closing dimension refers to the maximum degree of the initial consonant and the final consonant in the opening type, such as the maximum opening and closing degree, the minimum opening and closing degree, the maximum elongation degree and the minimum elongation degree.
It should be understood that when the initial consonant and the final have a certain mouth shape position, it indicates that the mouth shape from which the initial consonant and the final are issued has a certain mouth shape picture, and therefore, in the embodiment of the present invention, the pronunciation pictures of the initial consonant and the final are determined according to the mouth shape position to generate the pronunciation mouth shapes of the characters corresponding to the initial consonant and the final.
Further, in the embodiment of the present invention, the pronunciation pictures of the initial consonant and the final are determined according to the mouth shape position, that is, the mouth shape picture of the initial consonant and the final at the mouth shape position is taken as the pronunciation picture of the initial, for example, if the mouth shape position of the initial is the opening and closing position with the maximum opening mouth, the mouth shape picture at the opening and closing position with the maximum opening mouth is taken as the pronunciation picture of the initial.
Furthermore, in order to ensure the safety and privacy of the pronunciation picture, the pronunciation picture can also be stored in a block chain node.
And S4, combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals.
In the embodiment of the present invention, the combining the pronunciation picture according to the positions of the initial consonant and the final in the syllable to obtain the pronunciation short video of the characters corresponding to the initial consonant and the final includes: and acquiring the sequence position of the pronunciation picture in the syllable, and sequentially carrying out video synthesis on the pronunciation picture according to the sequence position to form pronunciation short video of characters corresponding to the initial consonant and the final sound.
In an alternative embodiment, the video composition of the pronunciation picture can be realized by a ffmpeg-f image2-i image% d.jpg video.
Illustratively, syllables [ guan ] of an initial sound [ g ] and a final sound [ ua ] exist, wherein the pronunciation pictures of the initial sound [ g ] and the final sound [ ua ] comprise an initial sound picture [ g ], a vowel head picture [ u ], a vowel web picture [ a ] and a vowel tail picture [ n ], and the initial sound picture [ g ], the vowel head picture [ u ], the vowel web picture [ a ] and the vowel tail picture [ n ] are respectively 1, 2, 3 and 4 at the sequence position of the syllable [ guan ], so that pronunciation short videos which are sequentially combined by the initial sound picture [ g ], the vowel head picture [ u ], the vowel web picture [ a ] and the vowel tail picture [ n ] can be obtained.
It should be noted that, after obtaining the pronunciation short video of the letters corresponding to the initial consonant and the final, the embodiment of the present invention further includes: and fixing the pronunciation time of the pronunciation short video to a preset duration, wherein the preset duration can be set to 0.25s and can also be set according to an actual service scene.
And S5, synthesizing the pronunciation short videos according to the position of each character in the dialect characters to generate a broadcast mouth shape of the dialect voice, and broadcasting the dialect voice according to the broadcast mouth shape.
According to the position of each character in the dialect character, synthesizing the pronunciation short video to generate the broadcast mouth shape of the dialect voice, executing the broadcast of the dialect voice, and realizing the consistency of the broadcast mouth shape of the dialect voice and the character, so that the character in the dialect voice is naturally jointed with the mouth shape, and the natural speaking flow sense of the intelligent robot is presented.
Further, after synthesizing the pronunciation short video, the embodiment of the present invention further includes: and setting the broadcasting speed of the voice art so that the voice term voice meets the auditory effect of a user in the broadcasting process, wherein the broadcasting speed of the voice art can be set through a player.
It can be seen that, by identifying the dialogical characters of the speech term sounds and the initial consonants and vowels contained in the characters, the composition structure of each character in the dialogical voice can be determined, and by analyzing the mouth shape positions of the initial consonant and the final, determining pronunciation pictures of the initials and the finals to generate pronunciation short videos of characters corresponding to the initials and the finals, the accuracy of the pronunciation mouth shape of each character can be guaranteed, so that the mouth shape of each character is consistent with the character in the pronunciation process, furthermore, the embodiment of the invention synthesizes the pronunciation short video according to the position of each character in the dialect characters to generate the broadcast mouth shape of the phonemic sound, thereby realizing the broadcast of the phonemic sound, the method can ensure that the mouth shape of the word-word connection is smooth when the speech term voice is broadcasted, and improve the experience of the user in the intelligent video process. Therefore, the linkage broadcast method of the speech term sounds provided by the invention can realize the consistency of the mouth shape and the characters of the speech term sounds in the broadcast process, and improve the experience degree of users.
Fig. 2 is a functional block diagram of the speech linkage broadcast device according to the present invention.
The linkage broadcast device 100 of the spoken term tone according to the present invention may be installed in an electronic device. According to the realized function, the linkage broadcast device of the speech term sounds can comprise a speech character recognition module 101, an initial and final recognition module 102, a pronunciation picture determination module 103, a pronunciation short video generation module 104 and a speech term sound broadcast module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and can perform a fixed function, and is stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the speech and technology word recognition module 101 is configured to acquire a speech term sound to be broadcasted and recognize a speech and technology word of the speech term sound;
the initial and final identification module 102 is configured to extract syllables of each text in the dialect text, and identify an initial and a final in the syllables;
the pronunciation picture determination module 103 is configured to analyze mouth shape positions of the initial consonant and the final, and determine pronunciation pictures of the initial consonant and the final according to the mouth shape positions;
the pronunciation short video generation module 104 is configured to combine the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of the characters corresponding to the initials and the finals;
the speech term sound broadcasting module 105 is configured to synthesize the pronunciation short videos according to the position of each text in the speech technology text to generate a broadcasting mouth shape of the speech technology sound, and execute broadcasting of the speech technology sound according to the broadcasting mouth shape.
In detail, in the embodiment of the present invention, when the modules in the linkage broadcast device 100 for conversational speech are used, the same technical means as the linkage broadcast method for conversational speech described in fig. 1 is adopted, and the same technical effect can be produced, which is not described herein again.
As shown in fig. 3, the present invention is a schematic structural diagram of an electronic device 1 that implements a linkage broadcast method of speech term tones.
The electronic device 1 may include a processor 10, a memory 11, a communication bus 12, and a communication interface 13, and may further include a computer program, such as a linkage broadcast program of conversational speech, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device 1, connects various components of the electronic device 1 by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by operating or executing programs or modules stored in the memory 11 (for example, a linkage broadcast program for executing speech voice, etc.) and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used to store not only application software installed in the electronic device 1 and various types of data, such as codes of a linkage broadcast program of speech sounds, but also temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device 1 and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices 1. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
Fig. 3 shows only the electronic device 1 with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The linkage broadcast program of the conversational speech stored in the memory 11 of the electronic device 1 is a combination of a plurality of computer programs, and when running in the processor 10, the linkage broadcast program can realize:
acquiring a speech term voice to be broadcasted, and identifying the speech-art characters of the speech term voice;
extracting syllables of each character in the dialect characters, and identifying initial consonants and final sounds in the syllables;
analyzing the mouth shape positions of the initial consonant and the final respectively, and determining pronunciation pictures of the initial consonant and the final according to the mouth shape positions;
combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals;
synthesizing the pronunciation short videos according to the position of each character in the dialect characters to generate a broadcasting mouth shape of the dialect voice, and executing the broadcasting of the dialect voice according to the broadcasting mouth shape.
Specifically, the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1 for a specific implementation method of the computer program, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a non-volatile computer-readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device 1, may implement:
acquiring a speech term voice to be broadcasted, and identifying the speech-art characters of the speech term voice;
extracting syllables of each character in the dialect characters, and identifying initial consonants and final sounds in the syllables;
analyzing the mouth shape positions of the initial consonant and the final respectively, and determining pronunciation pictures of the initial consonant and the final according to the mouth shape positions;
combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals;
synthesizing the pronunciation short videos according to the position of each character in the dialect characters to generate a broadcasting mouth shape of the dialect voice, and executing the broadcasting of the dialect voice according to the broadcasting mouth shape.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A linkage broadcast method of conversational voice is characterized by comprising the following steps:
acquiring a speech term voice to be broadcasted, and identifying the speech-art characters of the speech term voice;
extracting syllables of each character in the dialect characters, and identifying initial consonants and final sounds in the syllables;
analyzing the mouth shape positions of the initial consonant and the final respectively, and determining pronunciation pictures of the initial consonant and the final according to the mouth shape positions;
combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals;
synthesizing the pronunciation short videos according to the position of each character in the dialect characters to generate a broadcasting mouth shape of the dialect voice, and executing the broadcasting of the dialect voice according to the broadcasting mouth shape.
2. The linkage broadcast method of the phonetic speech according to claim 1, wherein the recognizing the phonetic text of the phonetic speech comprises:
performing feature coding on the speech term voice by using an encoder in a speech recognition model to obtain feature coded speech;
performing character sequence decoding on the feature coding voice by using a decoder in the voice recognition model to obtain a feature character sequence;
and extracting the text information of the characteristic text sequence to obtain the dialect text of the dialect term voice.
3. The linkage broadcast method of conversational speech according to claim 2, wherein the feature coding the conversational speech by an encoder in a speech recognition model to obtain a feature coded speech comprises:
calculating weight values of Mel cepstral coefficients in the conversational speech using a self-attention module in the encoder;
updating the weight information of the Mel cepstrum coefficient in the conversational speech according to the weight value;
and activating the speech term voice after the weight information is updated by utilizing a feedforward neural network in the encoder to obtain the feature coding voice.
4. The linkage broadcast method of conversational speech according to claim 2, wherein the character sequence decoding is performed on the feature coded speech by using a decoder in the speech recognition model to obtain a feature character sequence, and the method comprises:
performing character information mask on the feature coding voice by using a mask layer in the decoder to obtain feature character information;
calculating a text sequence of the characteristic text information by using an attention module in the decoder;
and outputting the character sequence by utilizing a full-connection neural network in the decoder to obtain a characteristic character sequence.
5. The linkage broadcast method of conversational speech according to claim 1, wherein the analyzing the mouth shape positions of the initial consonant and the final consonant respectively comprises:
respectively identifying the mouth shape opening and closing types of the initials and the finals, and respectively calculating the opening and closing dimensions of the initials and the finals according to the mouth shape opening and closing types so as to determine the mouth shape positions of the initials and the finals.
6. The linkage broadcast method of the conversational speech according to any one of claims 1 to 5, wherein the step of combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals comprises the steps of:
and acquiring the sequence position of the pronunciation picture in the syllable, and sequentially carrying out video synthesis on the pronunciation picture according to the sequence position to form pronunciation short video of characters corresponding to the initial consonant and the final sound.
7. The linkage broadcast method of the conversational speech according to claim 1, wherein after the synthesizing the short sounding video, further comprising:
and setting the broadcasting speed of the voice operation so that the voice term voice meets the auditory effect of the user in the broadcasting process.
8. A linkage broadcast device of conversational speech, the device comprising:
the speech and technology character recognition module is used for acquiring speech term sounds to be broadcasted and recognizing the speech and technology characters of the speech term sounds;
the initial and final identification module is used for extracting syllables of each character in the dialect characters and identifying initial and final in the syllables;
the pronunciation picture determining module is used for respectively analyzing the mouth shape positions of the initial consonant and the final, and determining pronunciation pictures of the initial consonant and the final according to the mouth shape positions;
the pronunciation short video generation module is used for combining the pronunciation pictures according to the positions of the initials and the finals in the syllables to obtain pronunciation short videos of characters corresponding to the initials and the finals;
and the speech term sound broadcasting module is used for synthesizing the pronunciation short videos according to the position of each character in the speech operation characters so as to generate a broadcasting mouth shape of the speech operation voice, and broadcasting the speech operation voice according to the broadcasting mouth shape.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executed by the at least one processor to enable the at least one processor to execute the linkage broadcast method of the conversational speech according to any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements a linkage broadcast method of conversational speech according to any one of claims 1 to 7.
CN202111001106.XA 2021-08-30 2021-08-30 Linkage broadcasting method and device of voice operation, electronic equipment and storage medium Pending CN113707124A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111001106.XA CN113707124A (en) 2021-08-30 2021-08-30 Linkage broadcasting method and device of voice operation, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111001106.XA CN113707124A (en) 2021-08-30 2021-08-30 Linkage broadcasting method and device of voice operation, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113707124A true CN113707124A (en) 2021-11-26

Family

ID=78656493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111001106.XA Pending CN113707124A (en) 2021-08-30 2021-08-30 Linkage broadcasting method and device of voice operation, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113707124A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114449313A (en) * 2022-02-10 2022-05-06 上海幻电信息科技有限公司 Method and device for adjusting playing speed of sound and picture of video
CN114677634A (en) * 2022-05-30 2022-06-28 成都新希望金融信息有限公司 Surface label identification method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482975A (en) * 2008-01-07 2009-07-15 丰达软件(苏州)有限公司 Method and apparatus for converting words into animation
CN108763190A (en) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
US20200135171A1 (en) * 2017-02-28 2020-04-30 National Institute Of Information And Communications Technology Training Apparatus, Speech Synthesis System, and Speech Synthesis Method
CN113112575A (en) * 2021-04-08 2021-07-13 深圳市山水原创动漫文化有限公司 Mouth shape generation method and device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482975A (en) * 2008-01-07 2009-07-15 丰达软件(苏州)有限公司 Method and apparatus for converting words into animation
US20200135171A1 (en) * 2017-02-28 2020-04-30 National Institute Of Information And Communications Technology Training Apparatus, Speech Synthesis System, and Speech Synthesis Method
CN108763190A (en) * 2018-04-12 2018-11-06 平安科技(深圳)有限公司 Voice-based mouth shape cartoon synthesizer, method and readable storage medium storing program for executing
CN113112575A (en) * 2021-04-08 2021-07-13 深圳市山水原创动漫文化有限公司 Mouth shape generation method and device, computer equipment and storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114449313A (en) * 2022-02-10 2022-05-06 上海幻电信息科技有限公司 Method and device for adjusting playing speed of sound and picture of video
WO2023151424A1 (en) * 2022-02-10 2023-08-17 上海幻电信息科技有限公司 Method and apparatus for adjusting playback rate of audio picture of video
CN114449313B (en) * 2022-02-10 2024-03-26 上海幻电信息科技有限公司 Method and device for adjusting audio and video playing rate of video
CN114677634A (en) * 2022-05-30 2022-06-28 成都新希望金融信息有限公司 Surface label identification method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111292720B (en) Speech synthesis method, device, computer readable medium and electronic equipment
CN112086086A (en) Speech synthesis method, device, equipment and computer readable storage medium
KR20210146368A (en) End-to-end automatic speech recognition for digit sequences
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
CN109817244B (en) Spoken language evaluation method, device, equipment and storage medium
CN112397047A (en) Speech synthesis method, device, electronic equipment and readable storage medium
CN112765971B (en) Text-to-speech conversion method and device, electronic equipment and storage medium
CN112397056B (en) Voice evaluation method and computer storage medium
CN112735371B (en) Method and device for generating speaker video based on text information
CN113707124A (en) Linkage broadcasting method and device of voice operation, electronic equipment and storage medium
CN114495927A (en) Multi-modal interactive virtual digital person generation method and device, storage medium and terminal
CN113096242A (en) Virtual anchor generation method and device, electronic equipment and storage medium
CN111696521A (en) Method for training speech clone model, readable storage medium and speech clone method
CN115511704B (en) Virtual customer service generation method and device, electronic equipment and storage medium
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
Fellbaum et al. Principles of electronic speech processing with applications for people with disabilities
KR20190109651A (en) Voice imitation conversation service providing method and sytem based on artificial intelligence
CN114242093A (en) Voice tone conversion method and device, computer equipment and storage medium
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
CN110781329A (en) Image searching method and device, terminal equipment and storage medium
CN115050351A (en) Method and device for generating timestamp and computer equipment
CN114694633A (en) Speech synthesis method, apparatus, device and storage medium
CN113555003A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN112233648A (en) Data processing method, device, equipment and storage medium combining RPA and AI

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination