WO2021102647A1 - 数据处理方法、装置和存储介质 - Google Patents

数据处理方法、装置和存储介质 Download PDF

Info

Publication number
WO2021102647A1
WO2021102647A1 PCT/CN2019/120706 CN2019120706W WO2021102647A1 WO 2021102647 A1 WO2021102647 A1 WO 2021102647A1 CN 2019120706 W CN2019120706 W CN 2019120706W WO 2021102647 A1 WO2021102647 A1 WO 2021102647A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
user
tone color
data
timbre
Prior art date
Application number
PCT/CN2019/120706
Other languages
English (en)
French (fr)
Inventor
郝杰
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2019/120706 priority Critical patent/WO2021102647A1/zh
Priority to CN201980100970.XA priority patent/CN114514576A/zh
Publication of WO2021102647A1 publication Critical patent/WO2021102647A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition

Definitions

  • This application relates to terminal technology, in particular to a data processing method, device and storage medium.
  • the embodiments of the present application provide a data processing method, device, and storage medium.
  • the embodiment of the present application provides a data processing method, including:
  • Target tone color template from a tone color template database; using the selected target tone color template to convert the translation text into audio data matching the target tone color;
  • the audio data is output.
  • the embodiment of the present application also provides a data processing device, including:
  • the obtaining unit is configured to obtain the data to be processed
  • the first processing unit is configured to translate the first voice data in the to-be-processed data to obtain a translated text; perform image recognition on the first image data in the to-be-processed data to obtain a recognition result;
  • the second processing unit is configured to use the translated text and/or the recognition result to determine the target tone color template; select the target tone color template from the tone color template database; use the selected target tone color template to convert the translated text into and Audio data matching the target timbre;
  • the output unit is configured to output the audio data.
  • the embodiment of the present application further provides a data processing device, including a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor implements any of the above-mentioned methods when the program is executed. A step of.
  • the embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the instructions are executed by a processor, the steps of any one of the foregoing methods are implemented.
  • the data processing method, device, and storage medium provided by the embodiments of the application obtain the data to be processed; translate the first voice data in the data to be processed to obtain the translated text; and compare the first voice data in the data to be processed Perform image recognition on the image data to obtain the recognition result; use the translated text and/or recognition result to determine the target timbre template; select the target timbre template from the timbre template database; use the selected target timbre template to translate the translation text Converted into audio data matching the target timbre; outputting the audio data, thereby using the target timbre to play the conversation content of the second user using the second terminal to the first user using the first terminal, prompting
  • the first user has a strong interest in the content of the conversation of the second user, realizes the voice-changing communication between the second user and the first user, and brings a stronger sense of substitution to the first user Interactive Experience.
  • FIG. 1 is a schematic diagram of the implementation process of a data processing method according to an embodiment of the application
  • FIG. 2 is a schematic diagram of an implementation process of determining a target tone color template based on the translated text by the first terminal in an embodiment of the application;
  • FIG. 3 is a schematic diagram of an implementation process of a first terminal determining a target tone color template based on a recognition result of the first image data according to an embodiment of the application;
  • FIG. 4 is a schematic diagram of the implementation process of the first terminal determining the target tone color template based on the recognition result of the translated text and the first image data according to the embodiment of the application;
  • FIG. 5 is a schematic diagram 1 of an implementation process of generating audio data matching the target timbre by the first terminal according to an embodiment of the application;
  • FIG. 6 is a second schematic diagram of the implementation process of generating audio data matching the target tone color by the first terminal according to the embodiment of the application;
  • FIG. 7 is a third schematic diagram of the implementation process of generating audio data matching the target tone color by the first terminal according to the embodiment of the application.
  • FIG. 8 is a fourth schematic diagram of the implementation process of generating audio data matching the target tone color by the first terminal according to the embodiment of the application;
  • FIG. 9 is a schematic diagram 5 of the implementation process of generating audio data matching the target tone color by the first terminal according to the embodiment of the application.
  • FIG. 10a is a schematic diagram of an implementation process of the first terminal playing the conversation content of the second user through the target timbre according to the embodiment of the application;
  • FIG. 10b is a schematic diagram of another implementation process for the first terminal to play the conversation content of the second user through the target timbre according to the embodiment of this application;
  • FIG. 11 is a schematic diagram of interactive communication between a first user and a second user according to an embodiment of the application
  • FIG. 12 is a schematic diagram 1 of the structure of the data processing device according to an embodiment of the application.
  • FIG. 13 is a second schematic diagram of the structure of the data processing device according to an embodiment of the application.
  • the performance culture has become more and more popular, showing a trend of globalization, and gradually entering the public life in a specific form, such as the performance of the two-dimensional character.
  • the user can only see the body shape of the performer, but cannot hear the performer communicate with the audience through the personalized tone, and thus cannot bring the audience a more immersive interactive experience. Lead to poor performance.
  • performers can wear doll-shaped costumes such as Mickey Mouse and Donald Duck for role performances, and interact with the audience watching the performance.
  • the client can collect the performer’s audio and collect the The audio is sent to the server, and the server obtains the recognized text by recognizing the audio data, and then translates the recognized text to obtain the translation result; sends the translation result to the client, and broadcasts the voice through the headset device to achieve The interaction between the performer and the audience watching the performance, but the performer cannot use personalized tones such as Mickey Mouse, Donald Duck, or a celebrity such as Andy Lau to achieve voice-changing communication with the audience, thus failing to bring the audience a sense of substitution A stronger interactive experience, resulting in poor performance.
  • performers are even more unable to achieve cross-language communication with audiences of different mother tongues with personalized timbre.
  • the first terminal obtains the data to be processed; translates the first voice data in the data to be processed to obtain the translated text; and compares the first voice data in the data to be processed Perform image recognition on image data to obtain the recognition result; use the translated text and/or recognition result to determine the target tone color template; select the target tone color template from the tone color template database; use the selected target tone color template to translate the
  • the text is converted into audio data matching the target timbre (which may include online conversion and offline conversion); the audio data is output, so that the target timbre is used to play the conversation content of the second user using the second terminal to the user
  • the first user of the first terminal prompts the first user to have a strong interest in the content of the conversation of the second user, so as to realize a voice-changing communication between the second user and the first user.
  • FIG. 1 is a schematic diagram of the implementation flow of the data processing method according to the embodiment of the application; as shown in FIG. 1, the method includes:
  • Step 101 The first terminal obtains data to be processed.
  • the data to be processed includes: first voice data and first image data.
  • the first voice data includes voice data generated by the second user when the first user who uses the first terminal interacts with the second user who uses the second terminal.
  • the first image data includes image data of clothing worn by the second user when the second user interacts with the second user.
  • the specific types of the first terminal and the second terminal may not be limited in this application. For example, they may be smart phones, personal computers, notebook computers, tablet computers, and portable wearable devices.
  • the following describes how the first terminal obtains the to-be-processed data.
  • the second terminal may be provided with or connected to a voice collection module, such as a microphone, through which the voice of the second user is collected to obtain the first voice data;
  • the second terminal establishes communication with the first terminal, and transmits the collected first voice data to the first terminal through a wireless transmission module.
  • the wireless transmission module may be a Bluetooth module, a wireless fidelity (WiFi, Wireless Fidelity) module, or the like.
  • the second user when the second user interacts with the second user through role-playing, the second user initiates a dialogue with the currently popular music and songs, and the second terminal uses voice
  • the collection module collects the voice of the second user to obtain first voice data; the second terminal establishes communication with the first terminal, and sends the first voice data to the first terminal through the wireless transmission module.
  • the second terminal uses a voice collection module to collect the second user’s Voice, obtain the first voice data; the second terminal establishes communication with the first terminal, and sends the first voice data to the first terminal.
  • the second terminal may be provided with or connected to an image acquisition module, such as a camera, through the video acquisition module to perform image acquisition on the clothing worn by the second user to obtain the first image Data; the second terminal establishes communication with the first terminal, and transmits the collected first image data to the first terminal through a wireless transmission module.
  • an image acquisition module such as a camera
  • the second terminal uses image collection
  • the module collects the clothing worn by the second user to obtain the first image data; the second terminal establishes communication with the first terminal, and sends the first language image data to the first terminal through the wireless transmission module.
  • the second terminal uses the image acquisition module to collect the second user
  • the first image data is obtained for the worn clothing; the second terminal establishes communication with the first terminal, and sends the first image data to the first terminal through the wireless transmission module.
  • the second terminal sends the first voice data of the second user to the first terminal, and the first terminal may subsequently perform translation processing on the first voice data, thereby helping to use the first voice data.
  • a user understands the conversation content of the second user in a language familiar to him, thereby promoting smoother communication between the first user and the second user.
  • the second terminal sends the first voice data of the second user to the first terminal, and the first terminal can subsequently determine a target tone color template according to the content of the second user’s conversation, and pass the target
  • the timbre plays the content of the second user's speech to the first user, thereby helping the first user to deeply understand the content of the second user's speech.
  • the second terminal sends the first image data of the clothing worn by the second user to the first terminal, and the first terminal may subsequently determine the target tone color template according to the clothing worn by the second user , And play the conversation content of the second user to the first user through the target timbre, so as to arouse the first user's interest in the conversation content of the second user.
  • Step 102 The first terminal translates the first voice data in the to-be-processed data to obtain a translated text; performs image recognition on the first image data in the to-be-processed data to obtain a recognition result; State the translated text and/or recognition results, and determine the target tone color template.
  • the translating the first voice data in the to-be-processed data to obtain the translated text includes:
  • the translation model is used to translate a text in a first language into at least one text in a second language; the first language is different from the second language.
  • performing image recognition on the first image data in the to-be-processed data to obtain the recognition result includes:
  • performing image preprocessing on the first image data includes: performing data enhancement and normalization on the first image data.
  • the following describes how the first terminal determines the target tone color template.
  • the determination of the target tone color template by the first terminal may specifically include the following situations:
  • the first terminal determines a target tone color template based on the translated text corresponding to the first voice data
  • the first terminal determines the target tone color template based on the recognition result corresponding to the first image data
  • the first terminal determines a target tone color template based on the translated text and the recognition result in combination with the selection of the first user.
  • the second terminal sends the first voice data of the second user to the first terminal.
  • the first terminal can determine the target tone color template according to the conversation content of the second user.
  • the use of the translated text to determine the target tone color template includes: searching for a first text corresponding to a preset character string from the translated text; when searching from the translated text When the first text corresponding to the preset character string is reached, the target tone color template is determined based on the first text.
  • the recognition text corresponding to the dialogue initiated by the second user in response to the Huawei incident is "Ren Zhengfei is a great Entrepreneur”
  • the first text corresponding to the preset character string is "Ren Zhengfei”
  • the first text corresponding to the preset character string can be searched in the recognized text "Ren Zhengfei is a great entrepreneur”
  • Ren Zhengfei The tone color is the target tone color template.
  • FIG. 2 a schematic diagram of the implementation process of the first terminal determining the target tone color template based on the translated text is described, as shown in FIG. 2, including:
  • Step 1 The first terminal searches for the first text corresponding to the preset character string from the translated text;
  • Step 2 When the first text corresponding to the preset character string is searched from the translated text, the target tone color template is determined based on the first text.
  • a certain character included in the content of the second user’s conversation is used to determine a target tone color template, which can stimulate The first user is extremely interested in the conversation content initiated by the second user, and subsequently, the timbre of the character can be used to play the conversation content of the second user to the first user to achieve voice-changing communication.
  • the second terminal sends the first image data of the clothing worn by the second user to the first terminal, so that the first terminal can determine the target according to the clothing worn by the second user Tone color template.
  • the use of the recognition result to determine the target tone color template includes: determining whether the recognition result indicates that the first image corresponding to the first image data matches a preset image; When the recognition result indicates that the first image corresponding to the first image data matches the preset image, the target tone color template is determined based on the first image.
  • the second user when the second user interacts with the first user through role-playing, the second user is wearing a costume in the shape of Mickey Mouse, assuming that the preset image corresponds to The clothing is Mr. Mi’s clothing. Since the first image corresponding to the first image data matches the preset image, the Mickey Mouse tone is determined as the target tone template.
  • FIG. 3 a schematic diagram of the realization process of the first terminal determining the target tone color template based on the recognition result of the first image data is described, as shown in FIG. 3, including:
  • Step 1 The first terminal judges whether the recognition result of the first image data indicates that the first image corresponding to the first image data matches a preset image.
  • the clothing corresponding to the preset image is a suit, Mickey Mouse clothing, Tang teacher clothing, and so on.
  • the service worn by the second user is a suit, that is, the clothing corresponding to the first image is a suit.
  • Step 2 When the recognition result indicates that the first image corresponding to the first image data matches the preset image, determine the target tone color template based on the first image.
  • Step 3 When the recognition result indicates that the first image corresponding to the first image data does not match the preset image, the timbre template set as the default is used as the target timbre template.
  • the clothing worn by the second user may be used to determine the target tone color template.
  • the first user can be inspired Have great interest in the content of the conversation initiated by the second user, and subsequently can use the timbre corresponding to the clothing worn by the second user to play the content of the conversation of the second user to the first user, To achieve the effect of vocal unity.
  • the second terminal sends the first voice data and first image data of the second user to the first terminal.
  • the first terminal can be based on the clothing worn by the second user.
  • the content of the conversation of the second user and the selection of the first user determine a target tone color template.
  • the first terminal determines the target tone color template based on the recognition result of the translated text and the first image data, as shown in FIG. 4, including:
  • Step 1 The first terminal judges whether the translated text contains the first text corresponding to the preset character string; and judges whether the recognition result of the first image data represents the first image corresponding to the first image data and the preset image match.
  • Step 2 When the first text corresponding to the preset character string is searched from the translated text and the recognition result indicates that the first image corresponding to the first image data matches the preset image, a prompt message is displayed .
  • the prompt information is used to prompt the user to select a desired tone color template from the tone color template corresponding to the first text and the tone color template corresponding to the first image.
  • a list of tone color templates may also be displayed on the display interface, where the list of tone color templates is used by the user to select the desired tone color template.
  • Step 3 Receive a first operation for the prompt information; in response to the first operation, use the timbre template selected by the user as the target timbre template.
  • the target timbre template may be determined according to the timbre template selected by the user.
  • the first user can be motivated to The content of the conversation initiated by the second user generates great interest, and the timbre selected by the first user can subsequently be used to play the content of the conversation of the second user to the first user to improve the first user’s Satisfaction.
  • the tone color template set as the default is used as the target tone color template.
  • Step 103 The first terminal selects the target tone color template from the tone color template database, and uses the selected target tone color template to convert the translated text into audio data matching the target tone color.
  • the first terminal before the first terminal selects the target tone color template from the tone color template database, it also needs to establish a tone color template database.
  • the method further includes:
  • Input the training data in the input layer of the convolutional neural network, and perform input-to-output mapping on the training data in at least one feature extraction layer of the convolutional neural network to obtain at least two feature data;
  • a tone color template database is generated.
  • a piece of voice data may refer to a user's voice data collected after authorization by the user.
  • the use of the convolutional neural network can quickly collect the timbre of different users, and by cloning the user’s timbre, personalized timbre templates of multiple classic characters, such as the timbre templates of Mickey Mouse characters, and celebrities can be obtained.
  • said using the selected target timbre template to convert the translated text into audio data matching the target timbre includes:
  • the selected target timbre template is used to convert the translated text into audio data matching the target timbre.
  • the target language of the receiver can be determined according to the audio of the receiver of the audio data; it can also be determined according to the text information input by the receiver of the audio data.
  • the method further includes:
  • the converted text is converted into audio data matching the target tone color.
  • the following describes how the first terminal uses the target tone color template to generate audio data.
  • the first terminal uses the selected target tone color template to generate audio data in combination with the translated text, which may specifically include the following situations:
  • the target timbre template is used to perform text-to-speech (TTS, Text To Speech) conversion on the translated text to obtain audio data.
  • TTS Text To Speech
  • the target timbre template is used in combination with the intonation of the second user to perform TTS conversion on the translated text to obtain audio data.
  • the target timbre template is used in combination with the intonation and emotion of the second user to perform TTS conversion on the translated text to obtain audio data.
  • the target timbre template is used to perform TTS conversion on the translated text in combination with the intonation, emotion, and speech rate of the second user to obtain audio data.
  • multiple target tone color templates are used to perform TTS conversion on the translated text to obtain audio data.
  • said using the selected target timbre template to convert the translation text into audio data matching the target timbre includes: the first terminal uses the selected target timbre template to perform the translation on the translation text. Text-to-speech (TTS, Text To Speech) TTS conversion generates audio data matching the target timbre.
  • TTS Text-to-speech
  • a target timbre is taken as an example to describe a schematic diagram of the realization process of the first terminal generating audio data matching the target timbre, as shown in FIG. 5, including:
  • Step 1 The first terminal determines the target tone color template.
  • the recognition text corresponding to the dialogue initiated by the second user to the first user in response to the Huawei incident is "Ren Zhengfei is a great entrepreneur", and the first terminal uses Ren Zhengfei's timbre template as the target timbre based on the translated text corresponding to the recognition text template.
  • Step 2 The first terminal uses the target tone color template to perform TTS conversion on the translated text to generate audio data matching the target tone color.
  • the first terminal broadcasts the conversation content of the second user "Ren Zhengfei is a great entrepreneur" to the first user through Ren Zhengfei's tone, thereby inspiring the first user to have a strong sense of the second user's conversation content interest.
  • the method when generating audio data matching the target timbre, the method further includes: the first terminal performs feature extraction on the first voice data to obtain intonation features; and using the selected target timbre template , In combination with the intonation feature, perform TTS conversion on the translated text to generate audio data matching the target timbre.
  • the intonation feature may characterize the priority of the second user's voice.
  • performing intonation feature extraction on the first speech data to obtain the intonation feature includes: extracting the fundamental frequency value of the voiced segment from the first speech data using an autocorrelation method; The silent and unvoiced segments in the speech data are interpolated to finally obtain the fundamental frequency curve; the fundamental frequency curve is fitted to obtain a continuous smooth audio curve; the obtained continuous smooth audio curve is logarithmic and filtered to obtain Intonation characteristics.
  • the first terminal may use the target tone determined based on the clothing worn by the second user to combine the second user’s intonation to convert the The conversation content of the second user is played to the first user.
  • the second user’s conversation content is played to the first user by combining the target timbre determined based on the second user’s conversation content and the second user’s intonation. In this way, the first user not only There will be great interest in the content of the conversation of the second user, and a sense of intimacy with the second user himself.
  • a schematic diagram of the realization process of the first terminal generating audio data matching the target timbre is described, as shown in Fig. 6, including:
  • Step 1 The first terminal determines the target tone color template.
  • the first terminal uses Ren Zhengfei's tone color template as a target tone color template based on the translated text corresponding to the recognized text.
  • Step 2 The first terminal performs feature extraction on the first voice data of the second user to obtain intonation features.
  • Step 3 The first terminal uses the target tone color template to perform TTS conversion on the translated text in combination with the intonation feature to generate audio data matching the target tone color.
  • the first terminal broadcasts the conversation content of the second user "Ren Zhengfei is a great entrepreneur" to the first user through the tone of Ren Zhengfei and the tone of the second user.
  • the method when generating audio data matching the target timbre, further includes: performing feature extraction on the first voice data by the first terminal to obtain emotional features; and using the selected target timbre template And generating audio data matching the target timbre by combining the emotional feature and the intonation feature.
  • the emotional feature may represent the emotions generated by the second user during the conversation, such as anger, fear, sadness, and the like.
  • performing emotional feature extraction on the first voice data to obtain emotional features may include: extracting formant features from the voice data; and identifying the user's emotional features based on the extracted formant features.
  • the first terminal may determine the target timbre based on the clothing worn by the second user, combined with the tone and emotion of the second user.
  • the content of the second user’s conversation is played to the first user; or, the second user’s voice is combined with the tone and emotion of the second user through a target tone determined based on the content of the second user’s conversation.
  • the content of the conversation is played to the first user, so that the first user will not only have great interest in the content of the conversation of the second user, but also become curious about the second user himself.
  • FIG. 7 a schematic diagram of the realization process of the first terminal generating audio data matching the target timbre is described, as shown in FIG. 7, including:
  • Step 1 The first terminal determines the target tone color template.
  • the first terminal uses Ren Zhengfei's tone color template as a target tone color template based on the translated text corresponding to the recognized text.
  • Step 2 The first terminal performs feature extraction on the first voice data of the second user to obtain intonation and emotional features.
  • Step 3 The first terminal uses the target tone color template to perform TTS conversion on the translated text in combination with the intonation and emotional characteristics to generate audio data matching the target tone color.
  • the first terminal broadcasts the conversation content of the second user "Ren Zhengfei is a great entrepreneur" to the first user through Ren Zhengfei's timbre, combined with the tone and emotion of the second user.
  • the method when generating audio data matching the target timbre, the method further includes:
  • the speaking rate feature represents the amount of vocabulary spoken by the second user in a unit time.
  • the first terminal performs feature extraction on the first voice data of the second user, and the process of obtaining the speech rate feature includes: counting the vocabulary amount in a unit time based on the first voice data; Vocabulary, obtain the characteristics of speaking speed.
  • the first terminal may combine the tone, emotion, and tone of the second user with the target tone determined based on the clothing worn by the second user.
  • the speech rate plays the conversation content of the second user to the first user; or, through the target tone determined based on the conversation content of the second user, combined with the second user’s intonation, emotion, and speech rate
  • the content of the conversation of the second user is played to the first user, so that the first user will not only have a great interest in the content of the conversation of the second user, but also of the second user himself. Be curious.
  • FIG. 8 a schematic diagram of the realization process of generating audio data matching the target timbre by the first terminal is described, as shown in FIG. 8, including:
  • Step 1 The first terminal determines the target tone color template.
  • the first terminal uses Ren Zhengfei's tone color template as a target tone color template based on the translated text corresponding to the recognized text.
  • Step 2 The first terminal performs feature extraction on the first voice data of the second user to obtain features of intonation, emotion, and speech rate.
  • Step 3 The first terminal uses the target tone color template to perform TTS conversion on the translated text in combination with the characteristics of intonation, emotion, and speech rate to generate audio data matching the target tone color.
  • the first terminal broadcasts the conversation content of the second user "Ren Zhengfei is a great entrepreneur" to the first user through Ren Zhengfei's timbre, combined with the intonation, emotion, and speech speed of the second user.
  • the number of target timbre templates determined by the first terminal according to the recognition text of the second user is at least two; when generating audio data matching the target timbre, the method further includes: The first terminal segments the translated text according to the number of target timbre templates to obtain at least two paragraphs; using the at least two target timbre templates to perform TTS conversion on the at least two paragraphs to obtain at least Two audio segments; splicing the at least two audio segments to obtain audio data.
  • the first terminal may determine a plurality of target timbres based on the conversation content of the second user, and use the plurality of target timbres to transfer the first terminal Second, the conversation content of the user is played to the first user. In this way, the first user will be extremely curious about the conversation content of the second user.
  • FIG. 9 a schematic diagram of the implementation process of the first terminal generating audio data matching the target timbre is described, as shown in FIG. 9, including:
  • Step 1 The first terminal determines multiple target tone color templates.
  • the first terminal uses the tone color templates of Ren Zhengfei and Andy Lau as the target tone color templates based on the translated text corresponding to the recognized text.
  • Step 2 The first terminal translates the first voice data of the second user to obtain a translated text, and segments the translated text to obtain two paragraphs.
  • Step 3 The first terminal uses two target tone color templates to perform TTS conversion on the two paragraphs to generate two audio clips;
  • Step 4 Splicing the two audio clips to obtain audio data.
  • Step 104 Output the audio data.
  • the first terminal may play audio matching the target tone color through an audio output module.
  • the audio output module may be implemented by a speaker of the first terminal.
  • the first terminal when the second user interacts with the first user, the first terminal does not use the timbre of the second user to play the conversation content of the second user to the first user. Instead, the user plays the conversation content of the second user to the first user through a target tone determined based on the clothing worn by the second user.
  • the first terminal When the second user interacts with the first user, the first terminal does not use the timbre of the second user to play the conversation content of the second user to the first user, but The conversation content of the second user is played to the first user through a target timbre determined based on the conversation content of the second user.
  • the first terminal When the second user interacts with the first user, the first terminal does not use the timbre of the second user to play the conversation content of the second user to the first user, but The content of the conversation of the second user is played to the first user based on the target timbre selected by the first user.
  • FIG. 10a a schematic diagram of the implementation process of the first terminal playing the second user's conversation content through the target timbre is described, as shown in FIG. 10a, including:
  • Step 1 The second terminal sends the first voice data and the first image data of the second user to the first terminal.
  • the second terminal uses a microphone to collect the audio of the second user such as "hello, where are you from?" to obtain the first voice data, and uses a camera to collect the Mickey Mouse worn by the second user, Obtain the first image data; send the first voice data and the first image data to the first terminal through a wireless transmission module.
  • Step 2 The first terminal translates the first voice data to obtain a translated text; performs image recognition on the first image data to obtain a recognition result.
  • Step 3 The first terminal uses the translated text and/or the recognition result to determine a target tone color template.
  • Step 4 The first terminal selects the target tone color template from the tone color template database, and uses the selected target tone color template to convert the translated text into audio data matching the target tone color.
  • Step 5 The first terminal outputs the audio data.
  • the first terminal broadcasts the translated text "children, where are you from” corresponding to the conversation content of the second user to the first user through the Mickey Mouse timbre.
  • FIG. 10b a schematic diagram of the implementation process of the first terminal playing the conversation content of the second user through the target timbre is described, as shown in FIG. 10b, including:
  • Step 1 The second terminal sends the first voice data and the first image data of the second user to the first terminal.
  • the second terminal uses a microphone to collect the audio of the second user, such as "hello, where are you from?", to obtain the first voice data; and uses a camera to collect the Mickey Mouse worn by the second user to obtain the first image data ; Send the first voice data and the first image data to the first terminal through a wireless transmission module.
  • Step 2 The first terminal translates the first voice data to obtain a translated text; performs image recognition on the first image data to obtain a recognition result.
  • Step 3 The first terminal uses the translated text and/or the recognition result to determine a target tone color template.
  • Step 4 The first terminal selects the target tone color template from the tone color template database, and sends the translated text and the target tone color template to the server.
  • the server uses the target tone color template to convert the translated text into audio data matching the target tone color; and returns the audio data to the first terminal.
  • Step 5 The first terminal receives and outputs audio data sent by the server.
  • the first terminal plays the translated text "children, where are you from” corresponding to the conversation content of the second user to the first user through the Mickey Mouse timbre.
  • the first terminal may convert the translated text into audio data matching the target timbre; or the server may convert the translated text into audio data matching the target timbre, and support online conversion and Offline conversion, the implementation method is more flexible.
  • the first terminal receives the data to be processed sent by the second terminal; translates the first voice data in the data to be processed to obtain the translated text; Perform image recognition on the first image data to obtain the recognition result; use the translated text and/or recognition result to determine the target timbre template; select the target timbre template from the timbre template database; use the selected target timbre template to compare all
  • the translated text is converted into audio data matching the target timbre (can include online conversion and offline conversion); the audio data is output, so that the target timbre can be used to play the conversation content of the second user using the second terminal
  • the first user is encouraged to have a strong interest in the content of the conversation of the second user, so as to realize the voice-changing communication between the second user and the first user.
  • FIG. 12 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application; as shown in FIG. 12, the data processing device includes:
  • the obtaining unit 121 is configured to obtain data to be processed
  • the first processing unit 122 is configured to translate the first voice data in the to-be-processed data to obtain a translated text; perform image recognition on the first image data in the to-be-processed data to obtain a recognition result;
  • the second processing unit 123 is configured to use the translated text and/or the recognition result to determine the target tone color template; select the target tone color template from the tone color template database; use the selected target tone color template to convert the translated text to the target tone Audio data with timbre matching;
  • the output unit 124 is configured to output the audio data.
  • the first processing unit 122 is configured to perform voice recognition on the first voice data by using a voice recognition technology to obtain recognized text;
  • the translation model is used to translate a text in a first language into at least one text in a second language; the first language is different from the second language.
  • the first processing unit 122 is configured to perform image preprocessing on the first image data in the to-be-processed data to obtain preprocessed first image data;
  • Image recognition technology is used to perform image recognition on the extracted feature data, and the recognition result is obtained.
  • the image preprocessing includes: performing data enhancement and normalization on the first image data.
  • the second processing unit 123 is configured to search for a first text corresponding to a preset character string from the translated text
  • the target tone color template is determined based on the first text.
  • the second processing unit 123 is configured to determine whether the recognition result indicates that the first image corresponding to the first image data matches a preset image
  • the target tone color template is determined based on the first image.
  • the second processing unit 123 is configured to obtain the target language of the receiver of the audio data
  • the selected target timbre template is used to convert the translated text into audio data matching the target timbre.
  • the second processing unit 123 is configured to convert the translated text into text matching the target language when the language corresponding to the translated text and the target language belong to different languages; Using the selected target tone color template, the converted text is converted into audio data matching the target tone color.
  • the device further includes:
  • a generating unit configured to collect at least two voice data by the first terminal; use the at least two voice data as training data;
  • Input the training data in the input layer of the convolutional neural network, and perform input-to-output mapping on the training data in at least one feature extraction layer of the convolutional neural network to obtain at least two feature data;
  • a tone color template database is generated.
  • the second processing unit 123 is configured to:
  • TTS conversion is performed on the translated text to generate audio data matching the target tone color.
  • the second processing unit 123 is configured to:
  • TTS conversion is performed on the translated text to generate audio data matching the target timbre.
  • the second processing unit 123 is configured to:
  • the selected target timbre template is combined with the emotional characteristics and intonation characteristics to generate audio data matching the target timbre.
  • the second processing unit 123 is configured to:
  • the second processing unit 123 is configured to:
  • the first terminal segments the translated text according to the number of target timbre templates to obtain at least two paragraphs;
  • the at least two audio segments are spliced to obtain audio data.
  • the acquisition unit 121 and the output unit 124 can be implemented by the communication interface in the data processing device; the first processing unit 122 and the second processing unit 123 can be implemented by the data processing device Processor implementation.
  • the device provided in the above embodiment performs data processing
  • only the division of the above-mentioned program modules is used as an example.
  • the above-mentioned processing can be allocated by different program modules as needed, that is, the terminal
  • the internal structure is divided into different program modules to complete all or part of the processing described above.
  • the device provided in the foregoing embodiment and the data processing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • FIG. 13 is a schematic diagram of the hardware composition structure of the data processing device of the embodiment of the application, as shown in FIG. 13,
  • the data processing device 130 includes a memory 133, a processor 132, and a computer program that is stored on the memory 133 and can run on the processor 132; the processor 132 located in the data processing device implements one or the above data processing device side when the program is executed.
  • the processor 132 located in the data processing device 130 executes the program, it realizes: receiving the to-be-processed data sent by the second terminal; the to-be-processed data is obtained by the second terminal; Translate to obtain the translated text; and select a target tone color template from the tone color template database; use the selected target tone color template to convert the translated text into audio data matching the target tone color; output the audio data.
  • the processor 132 located in the data processing device 130 executes the program, it realizes: acquiring a target language matching the language of the recipient of the audio data;
  • the selected target timbre template is used to convert the translated text into audio data matching the target timbre.
  • the processor 132 located in the data processing device 130 executes the program, it realizes: acquiring a target language matching the language of the recipient of the audio data;
  • the converted text is converted into audio data matching the target tone color.
  • the processor 132 located in the data processing device 130 executes the program, it realizes: collecting at least one voice data;
  • a tone color template database is generated.
  • the processor 132 located in the data processing device 130 executes the program, it realizes that: using the selected target timbre template, perform a text-to-speech TTS conversion on the translated text to generate audio data matching the target timbre.
  • the processor 132 located in the data processing device 130 executes the program, it realizes: sending the translated text and the target timbre template to the data processing device; the translated text and the target timbre template are used for the data processing device to respond to the TTS conversion of the translated text to generate audio data matching the target timbre;
  • the processor 132 located in the data processing device 130 executes the program, it is realized that the audio data is synchronously output as the data to be processed is acquired.
  • the data processing device further includes a communication interface 131; various components in the data processing device are coupled together through the bus system 134. It can be understood that the bus system 134 is configured to implement connection and communication between these components. In addition to the data bus, the bus system 134 also includes a power bus, a control bus, and a status signal bus.
  • the memory 133 in this embodiment may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory can be a read-only memory (ROM, Read Only Memory), a programmable read-only memory (PROM, Programmable Read-Only Memory), an erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM), Ferromagnetic Random Access Memory (FRAM), Flash Memory, Magnetic Surface Memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage.
  • the volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • SSRAM synchronous static random access memory
  • Synchronous Static Random Access Memory Synchronous Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM synchronous connection dynamic random access memory
  • DRRAM Direct Rambus Random Access Memory
  • the memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.
  • the method disclosed in the foregoing embodiments of the present application may be applied to the processor 132 or implemented by the processor 132.
  • the processor 132 may be an integrated circuit chip with signal processing capabilities. In the implementation process, the steps of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 132 or instructions in the form of software.
  • the aforementioned processor 132 may be a general-purpose processor, a DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like.
  • the processor 132 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as execution and completion by a hardware decoding processor, or execution and completion by a combination of hardware and software modules in the decoding processor.
  • the software module may be located in a storage medium, and the storage medium is located in a memory.
  • the processor 132 reads information in the memory and completes the steps of the foregoing method in combination with its hardware.
  • the embodiment of the present application also provides a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium.
  • a storage medium which is specifically a computer storage medium, and more specifically, a computer-readable storage medium.
  • Stored thereon are computer instructions, that is, a computer program, which is a method provided by one or more technical solutions on the data processing device side when the computer instructions are executed by a processor.
  • the disclosed method and smart device can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit;
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
  • a person of ordinary skill in the art can understand that all or part of the steps of the above method embodiments can be implemented by a program instructing relevant hardware.
  • the foregoing program can be stored in a computer readable storage medium, and when the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.
  • the aforementioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a data processing device, or a network device, etc.) executes all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

一种数据处理方法、装置和存储介质。其中,所述方法包括:第一终端获取待处理数据(101);所述第一终端对所述待处理数据中的第一语音数据进行翻译,得到翻译文本,并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果,利用所述翻译文本和/或识别结果,确定目标音色模板(102);所述第一终端从音色模板数据库中选取所述目标音色模板,利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据(103);输出所述音频数据(104)。

Description

数据处理方法、装置和存储介质 技术领域
本申请涉及终端技术,具体涉及一种数据处理方法、装置及存储介质。
背景技术
随着文化的快速发展,表演文化越来越受欢迎,呈现出全球化趋势,并以特定形式逐渐走进大众生活中,比如二次元人物表演等。通常,在二次元人物表演活动中,用户只能看到表演者的体型,但不能听到表演者以个性化音色与观众实现变声交流,进而无法给观众带来代入感更强的互动体验,从而导致表演效果不佳。
发明内容
本申请实施例提供一种数据处理方法、装置和存储介质。
本申请实施例提供一种数据处理方法,包括:
获取待处理数据;
对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;利用所述翻译文本和/或识别结果,确定目标音色模板;
从音色模板数据库中选取所述目标音色模板;利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据;
输出所述音频数据。
本申请实施例还提供一种数据处理装置,包括:
获取单元,配置为获取待处理数据;
第一处理单元,配置为对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;
第二处理单元,配置为利用所述翻译文本和/或识别结果,确定目标音色模板;从音色模板数据库中选取所述目标音色模板;利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据;
输出单元,配置为输出所述音频数据。
本申请实施例又提供了一种数据处理装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述任一所述方法的步骤。
本申请实施例还提供了一种存储介质,其上存储有计算机指令,所述指令被处理器执行时实现上述任一所述方法的步骤。
本申请实施例提供的数据处理方法、装置和存储介质,获取待处理数据;对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;利用所述翻译文本和/或识别结果,确定目标音色模板;从音色模板数据库中选取所述目标音色模板;利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据;输出所述音频数据,从而使用所述目标音 色将使用所述第二终端的第二用户的谈话内容播放给使用所述第一终端的第一用户,促使所述第一用户对所述第二用户的谈话内容产生浓厚的兴趣,实现所述第二用户与所述第一用户之间的变声交流,给所述第一用户带来代入感更强的互动体验。
附图说明
图1为本申请实施例数据处理方法的实现流程示意图;
图2为本申请实施例第一终端基于所述翻译文本确定目标音色模板的实现流程示意图;
图3为本申请实施例第一终端基于所述第一图像数据的识别结果确定目标音色模板的实现流程示意图;
图4为本申请实施例第一终端基于所述翻译文本和所述第一图像数据的识别结果确定目标音色模板的实现流程示意图;
图5为本申请实施例第一终端生成与所述目标音色匹配的音频数据的实现流程示意图一;
图6为本申请实施例第一终端生成与所述目标音色匹配的音频数据的实现流程示意图二;
图7为本申请实施例第一终端生成与所述目标音色匹配的音频数据的实现流程示意图三;
图8为本申请实施例第一终端生成与所述目标音色匹配的音频数据的实现流程示意图四;
图9为本申请实施例第一终端生成与所述目标音色匹配的音频数据的实现流程示意图五;
图10a为本申请实施例第一终端通过目标音色播放第二用户的谈话内容的一种实现流程示意图;
图10b为本申请实施例第一终端通过目标音色播放第二用户的谈话内容的又一种实现流程示意图;
图11为本申请实施例第一用户与第二用户进行互动交流的示意图;
图12为本申请实施例的数据处理装置的组成结构示意图一;
图13为本申请实施例的数据处理装置的组成结构示意图二。
具体实施方式
在对本申请实施例的技术方案进行详细说明之前,首先对相关技术进行简单说明。
相关技术中,随着文化的快速发展,表演文化越来越受欢迎,呈现出全球化趋势,并以特定形式逐渐走进大众生活中,比如二次元人物表演等。目前,在二次元人物表演中,用户只能看到表演者的体型,但不能听到表演者以个性化音色与观众实现变声交流,进而无法给观众带来代入感更强的互动体验,从而导致表演效果不佳。
实际应用中,表演者可以穿上米老鼠、唐老鸭等玩偶形状的服饰进行角色表演,并与观看表演的观众进行互动,在表演过程中,客户端可以采集表演者的音频,将采集的音频发送给服务端,所述服务端通过对音频数据进行识别,得到识别文本,再对所述识别文本进行翻译,得到翻译结果;将翻译结果发送给客户端,通过耳机设备播报语音,以实现表演者与观看表演的观众之间的互动,但表演者不能以个性化音色如米老鼠、唐老鸭音色、某个明星如刘德华的音色与观众实现变声交流,进而无法给观众带来代入感 更强的互动体验,从而导致表演效果不佳。在全球化背景下,表演者更无法以个性化音色与不同母语的观众实现跨语言交流。
基于此,在本申请的各种实施例中,第一终端获取待处理数据;对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;利用所述翻译文本和/或识别结果,确定目标音色模板;从音色模板数据库中选取所述目标音色模板;利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据(可以包括在线转换和离线转换);输出所述音频数据,从而使用所述目标音色将使用所述第二终端的第二用户的谈话内容播放给使用所述第一终端的第一用户,促使所述第一用户对所述第二用户的谈话内容产生浓厚的兴趣,实现所述第二用户与所述第一用户之间的变声交流。
下面结合附图及具体实施例对本申请作进一步详细的说明。
本申请实施例提供了一种数据处理方法,应用于第一终端,图1为本申请实施例数据处理方法的实现流程示意图;如图1所示,所述方法包括:
步骤101:第一终端获取待处理数据。所述待处理数据包括:第一语音数据、第一图像数据。
其中,所述第一语音数据包括使用所述第一终端的第一用户与使用第二终端的第二用户进行互动交流时所述第二用户产生的语音数据。所述第一图像数据包括所述第二用户与所述第二用户进行互动交流时所述第二用户所穿服装的图像数据。所述第一终端和第二终端的具体类型,本申请可以不做限定,例如可以为智能手机、个人计算机、笔记本电脑、平板电脑和便携式可穿戴设备等。
下面对所述第一终端如何获取所述待处理数据进行说明。
在一实施例中,所述第二终端可以设置有或者连接有语音采集模块,如麦克风,通过所述语音采集模块对所述第二用户的声音进行采集,得到所述第一语音数据;所述第二终端与所述第一终端建立通信,通过无线传输模块将采集的所述第一语音数据传输至所述第一终端。所述无线传输模块可以是蓝牙模块、无线保真(WiFi,Wireless Fidelity)模块等。
举例来说,在二次元表演场景下,所述第二用户通过角色扮演方式与所述第二用户进行互动交流时,所述第二用户针对当前流行的音乐歌曲发起对话,第二终端利用语音采集模块采集所述第二用户的语音,得到第一语音数据;第二终端与第一终端建立通信,通过无线传输模块将第一语音数据发送至第一终端。在应用同传的会议场景下,所述第二用户与所述第二用户进行互动交流时,所述第二用户针对香港问题发起对话,第二终端利用语音采集模块采集所述第二用户的语音,得到第一语音数据;第二终端与第一终端建立通信,将第一语音数据发送至第一终端。
在另一实施例中,所述第二终端可以设置有或者连接有图像采集模块,如摄像头,通过所述视频采集模块对对所述第二用户所穿的服装进行图像采集,得到第一图像数据;所述第二终端与所述第一终端建立通信,通过无线传输模块将采集的所述第一图像数据传输至所述第一终端。
举例来说,在二次元表演场景下,所述第二用户通过角色扮演方式与所述第二用户进行互动交流时,所述第二用户身穿米老鼠形状的服装,第二终端利用图像采集模块采集所述第二用户所穿的服装,得到第一图像数据;第二终端与第一终端建立通信,通过无线传输模块将第一语图像数据发送至第一终端。在应用同传的会议场景下,所述第二用户与所述第二用户进行互动交流时,所述第二用户身穿一身笔挺的西装,第二终端利用图像采集模块采集所述第二用户所穿的服装,得到第一图像数据;第二终端与第一终端建立通信,通过无线传输模块将第一图像数据发送至第一终端。
这里,所述第二终端将所述第二用户的第一语音数据发送给所述第一终端,所述第一终端后续可以针对所述第一语音数据进行翻译处理,从而帮助使用所述第一用户通过自身熟悉的语种弄懂所述第二用户的谈话内容,进而促进所述第一用户与所述第二用户之间的交流更顺畅。
这里,所述第二终端将所述第二用户的第一语音数据发送给所述第一终端,所述第一终端后续可以根据所述第二用户的谈话内容确定目标音色模板,并通过目标音色将所述第二用户的谈话内容播放给所述第一用户,从而帮助所述第一用户深刻理解所述第二用户的演讲内容。
这里,所述第二终端将所述第二用户所穿服装的第一图像数据发送给所述第一终端,则所述第一终端后续可以根据所述第二用户所穿服装确定目标音色模板,并通过目标音色将所述第二用户的谈话内容播放给所述第一用户,从而激发起所述第一用户对所述第二用户的谈话内容的兴趣。
步骤102:所述第一终端对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;利用所述翻译文本和/或识别结果,确定目标音色模板。
在一实施例中,所述对所述待处理数据中的第一语音数据进行翻译,得到翻译文本,包括:
采用语音识别技术对所述第一语音数据进行语音识别,获得识别文本;
运用预设的翻译模型对所述识别文本进行翻译,获得所述翻译文本。
其中,所述翻译模型,用于将第一语种的文本翻译为至少一种第二语种的文本;所述第一语种与第二语种不同。
在一实施例中,对所述待处理数据中的第一图像数据进行图像识别,得到识别结果,包括:
对所述待处理数据中的第一图像数据进行图像预处理,得到预处理后的第一图像数据;从所述预处理后的第一图像数据中提取特征数据;采用图像识别技术对提取的特征数据进行图像识别,获得识别结果。
其中,对所述第一图像数据进行图像预处理包括:对所述第一图像数据进行数据加强、归一化等。
下面对所述第一终端如何确定目标音色模板进行说明。
所述第一终端确定目标音色模板,具体可以包括以下几种情况:
第一种情况,所述第一终端基于所述第一语音数据对应的翻译文本,确定目标音色模板;
第二种情况,所述第一终端基于所述第一图像数据对应的识别结果,确定目标音色模板;
第三种情况,所述第一终端基于所述翻译文本和所述识别结果,结合所述第一用户的选择,确定目标音色模板。
实际应用时,所述第二终端将所述第二用户的第一语音数据发送给所述第一终端,如此,所述第一终端可以根据所述第二用户的谈话内容确定目标音色模板。
基于此,在一实施例中,所述利用所述翻译文本,确定目标音色模板,包括:从所述翻译文本中搜索与预设字符串对应的第一文本;当从所述翻译文本中搜索到与预设字符串对应的所述第一文本时,基于所述第一文本,确定所述目标音色模板。
举例来说,在应用同传的会议场景下,所述第二用户与所述第一用户进行互动交流时,所述第二用户针对华为事件发起的对话对应的识别文本为“任正非是个了不起的企业家”,假设预设字符串对应的第一文本为“任正非”,由于识别文本“任正非是个了不 起的企业家”中能够搜索到与预设字符串对应的所述第一文本,因此确定任正非的音色为目标音色模板。
在一示例中,以第一语音数据对应的翻译文本为例,描述第一终端基于所述翻译文本确定目标音色模板的实现流程示意图,如图2所示,包括:
步骤1:第一终端从所述翻译文本中搜索与预设字符串对应的第一文本;
假设预设字符串对应的第一文本为“任正非”或“刘德华”等。翻译文本为“刘德华是我最喜欢的明星之一”。
步骤2:当从所述翻译文本中搜索到与预设字符串对应的所述第一文本时,基于所述第一文本,确定所述目标音色模板。
需要说明的是,这里,为了促进所述第二用户与所述第一用户的互动交流,利用所述第二用户的谈话内容中所包含的某个人物,确定目标音色模板,如此,能够激发所述第一用户对所述第二用户所发起的谈话内容产生极大的兴趣,后续可以使用该人物的音色将所述第二用户的谈话内容播放给所述第一用户,实现变声交流。
实际应用时,所述第二终端将所述第二用户所穿服装的第一图像数据发送给所述第一终端,如此,所述第一终端可以根据所述第二用户所穿服装确定目标音色模板。
基于此,在一实施例中,所述利用所述识别结果,确定目标音色模板,包括:判断所述识别结果是否表征第一图像数据对应的第一图像与预设图像相匹配;当所述识别结果表征第一图像数据对应的第一图像与预设图像相匹配时,基于所述第一图像,确定所述目标音色模板。
举例来说,在二次元表演场景下,所述第二用户通过角色扮演方式与所述第一用户进行互动交流时,所述第二用户身穿米老鼠形状的服装,假设预设图像对应的服装为米老师服装,由于所述第一图像数据对应的第一图像与预设图像相匹配,因此确定米老鼠音色为目标音色模板。
在一示例中,以第一图像数据的识别结果为例,描述第一终端基于所述第一图像数据的识别结果确定目标音色模板的实现流程示意图,如图3所示,包括:
步骤1:第一终端判断所述第一图像数据的识别结果是否表征第一图像数据对应的第一图像与预设图像相匹配。
假设预设图像对应的服装为西装、米老鼠服装、唐老师服装等等。所述第二用户身穿的服务为西装,即所述第一图像对应的服装为西装。
步骤2:当所述识别结果表征第一图像数据对应的第一图像与预设图像相匹配时,基于所述第一图像,确定所述目标音色模板。
步骤3:当所述识别结果表征第一图像数据对应的第一图像与预设图像不匹配时,将设置为默认的音色模板作为所述目标音色模板。
需要说明的是,这里,为了促进所述第二用户与所述第一用户的互动交流,可以利用所述第二用户身穿的服装,确定目标音色模板,如此,能够激发所述第一用户对所述第二用户所发起的谈话内容产生极大的兴趣,后续可以使用与所述第二用户所穿的服装对应的音色将所述第二用户的谈话内容播放给所述第一用户,达到人声合一的效果。
实际应用时,所述第二终端将所述第二用户的第一语音数据、第一图像数据发送给所述第一终端,如此,所述第一终端可以根据所述第二用户所穿服装、所述第二用户的谈话内容和所述第一用户的选择确定目标音色模板。
在一示例中,描述第一终端基于所述翻译文本和所述第一图像数据的识别结果确定目标音色模板的实现流程示意图,如图4所示,包括:
步骤1:第一终端判断翻译文本中是否包含与预设字符串对应的第一文本;并判断所述第一图像数据的识别结果是否表征第一图像数据对应的第一图像与预设图像相匹 配。
步骤2:当从所述翻译文本中搜索到与预设字符串对应的所述第一文本且所述识别结果表征第一图像数据对应的第一图像与预设图像相匹配时,显示提示信息。
其中,所述提示信息用于提示用户从所述第一文本对应的音色模板和所述第一图像对应的音色模板中选择所需的音色模板。
这里,还可以在显示界面显示音色模板列表,所述音色模板列表用于用户选择自身所需的音色模板。
步骤3:接收针对所述提示信息的第一操作;响应所述第一操作,将用户选择的音色模板作为目标音色模板。
需要说明的是,这里,为了促进所述第二用户与所述第一用户的互动交流,可以根据用户自身选取的音色模板,确定目标音色模板,如此,能够激发所述第一用户对所述第二用户所发起的谈话内容产生极大的兴趣,后续可以使用所述第一用户所选的音色将所述第二用户的谈话内容播放给所述第一用户,提高所述第一用户的满意度。
这里,当从所述翻译文本中未搜索到与预设字符串对应的所述第一文本且所述识别结果表征第一图像数据对应的第一图像与预设图像不匹配时,还可以将设置为默认的音色模板作为所述目标音色模板。
步骤103:所述第一终端从音色模板数据库中选取所述目标音色模板,利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据。
实际应用时,所述第一终端从音色模板数据库中选取所述目标音色模板之前,还需要建立音色模板数据库。
基于此,在一实施例中,所述方法还包括:
所述第一终端采集至少两个语音数据;将所述至少两个语音数据作为训练数据;
在卷积神经网络的输入层输入所述训练数据,在所述卷积神经网络的至少一层特征提取层对所述训练数据进行输入到输出的映射,得到至少两个特征数据;
基于所述至少两个特征数据,得到至少两个音色模板;
基于所述至少两个音色模板,生成音色模板数据库。
其中,一个语音数据可以是指经过用户授权后采集的一个用户的语音数据。
需要说明的是,利用所述卷积神经网络,能够实现快速采集不同用户的音色,通过对用户音色的克隆,得到多个经典角色的个性化音色模板,如米老鼠角色的音色模板、明星人物的音色模板、二次元人物的音色模板等。
在一实施例中,所述利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据,包括:
所述第一终端获取与所述音频数据的接收者的目标语言;
判断所述翻译文本对应的语言与所述目标语言是否属于同一个语种;
当确定所述翻译文本对应的语言与所述目标语言属于同一个语种时,利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据。
实际应用时,可以根据所述音频数据的接收者的音频确定接收者的目标语言;还可以是根据所述音频数据的接收者输入的文本信息进行确定。
在一实施例中,所述方法还包括:
当所述翻译文本对应的语言与所述目标语言属于不同语种时,将所述翻译文本转换为与所述目标语言匹配的文本;
利用选取的目标音色模板,将转换得到的文本转换为与目标音色匹配的音频数据。
在全球化场景下,通过将所述翻译文本转换为与音频数据的接收者的目标语言匹配的文本,能够帮助使用所述第一用户通过自身熟悉的语种听懂所述第二用户的谈话内 容,进而保证双方能够顺畅地交流。
下面对所述第一终端如何利用目标音色模板生成音频数据进行说明。
所述第一终端利用选取的目标音色模板,结合所述翻译文本生成音频数据,具体可以包括以下几种情况:
第一种情况,利用目标音色模板,对所述翻译文本进行文本到语音(TTS,Text To Speech)转换,得到音频数据。
第二种情况,利用目标音色模板,结合所述第二用户的语调,对所述翻译文本进行TTS转换,得到音频数据。
第三种情况,利用目标音色模板,结合所述第二用户的语调、情感,对所述翻译文本进行TTS转换,得到音频数据。
第四种情况,利用目标音色模板,结合所述第二用户的语调、情感、语速,对所述翻译文本进行TTS转换,得到音频数据。
第五种情况,利用多个目标音色模板,对所述翻译文本进行TTS转换,得到音频数据。
在一实施例中,所述利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据,包括:所述第一终端利用选取的目标音色模板,对所述翻译文本进行文本到语音(TTS,Text To Speech)TTS转换,生成与所述目标音色匹配的音频数据。
在一示例中,以目标音色为例,描述第一终端生成与所述目标音色匹配的音频数据的实现流程示意图,如图5所示,包括:
步骤1:第一终端确定目标音色模板。
第二用户针对华为事件向第一用户发起的对话对应的识别文本为“任正非是个了不起的企业家”,所述第一终端基于所述识别文本对应的翻译文本,将任正非的音色模板作为目标音色模板。
步骤2:所述第一终端利用所述目标音色模板,对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据。
所述第一终端通过任正非的音色将第二用户的谈话内容“任正非是个了不起的企业家”播放给所述第一用户,从而激发所述第一用户对所述第二用户的谈话内容的浓厚兴趣。
在一实施例中,生成与所述目标音色匹配的音频数据时,所述方法还包括:所述第一终端对所述第一语音数据进行特征提取,得到语调特征;利用选取的目标音色模板,结合所述语调特征,对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据。
其中,所述语调特征可以表征所述第二用户发出声音的轻重缓急。
在一实施例中,对所述第一语音数据进行语调特征提取,得到语调特征,包括:使用自相关法从所述第一语音数据中提取浊音段的基频值;并对所述第一语音数据中的无声段和清音段进行插值处理,最终得到基频曲线;对所述基频曲线进行拟合,得到连续光滑音频曲线;对得到的连续光滑音频曲线取对数、滤波处理,得到语调特征。
当所述第二用户与所述第一用户进行互动交流时,所述第一终端可以通过基于所述第二用户所穿的服装确定的目标音色,结合所述第二用户的语调将所述第二用户的谈话内容播放给所述第一用户。或者,通过基于所述第二用户的谈话内容确定的目标音色,结合所述第二用户的语调将所述第二用户的谈话内容播放给所述第一用户,如此,所述第一用户不仅对所述第二用户的谈话内容会产生极大的兴趣,而且对所述第二用户本人也会产生亲近感。
在一示例中,以目标音色和第二用户的语调为例,描述第一终端生成与所述目标音 色匹配的音频数据的实现流程示意图,如图6所示,包括:
步骤1:第一终端确定目标音色模板。
第二用户针对华为事件向第一用户发起的对话对应的识别文本为“任正非是个了不起的企业家”。
所述第一终端基于所述识别文本对应的翻译文本,将任正非的音色模板作为目标音色模板。
步骤2:所述第一终端对所述第二用户的第一语音数据进行特征提取,得到语调特征。
步骤3:所述第一终端利用所述目标音色模板,结合所述语调特征,对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据。
所述第一终端通过任正非的音色,结合所述第二用户的语调,将所述第二用户的谈话内容“任正非是个了不起的企业家”播放给所述第一用户。
在一实施例中,生成与所述目标音色匹配的音频数据时,所述方法还包括:所述第一终端对所述第一语音数据进行特征提取,得到情感特征;利用选取的目标音色模板,结合所述情感特征和语调特征,生成与所述目标音色匹配的音频数据。
其中,所述情感特征可以表征所述第二用户在谈话时所产生的情感,例如生气、恐惧、悲伤等。
具体来说,对所述第一语音数据进行情感特征提取,得到情感特征的过程可以包括:从所述语音数据中提取共振峰特征;基于提取的共振峰特征,识别用户的情感特征。
当所述第二用户与所述第一用户进行互动交流时,所述第一终端可以通过基于所述第二用户所穿的服装确定的目标音色,结合所述第二用户的语调、情感将所述第二用户的谈话内容播放给所述第一用户;或者,通过基于所述第二用户的谈话内容确定的目标音色,结合所述第二用户的语调、情感将所述第二用户的谈话内容播放给所述第一用户,如此,所述第一用户不仅对所述第二用户的谈话内容会产生极大的兴趣,而且对所述第二用户本人也会产生好奇。
在一示例中,以目标音色和第二用户的语调、情感为例,描述第一终端生成与所述目标音色匹配的音频数据的实现流程示意图,如图7所示,包括:
步骤1:第一终端确定目标音色模板。
第二用户针对华为事件向第一用户发起的对话对应的识别文本为“任正非是个了不起的企业家”。
所述第一终端基于所述识别文本对应的翻译文本,将任正非的音色模板作为目标音色模板。
步骤2:所述第一终端对所述第二用户的第一语音数据进行特征提取,得到语调、情感特征。
步骤3:所述第一终端利用所述目标音色模板,结合所述语调、情感特征,对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据。
所述第一终端通过任正非的音色,结合所述第二用户的语调、情感,将所述第二用户的谈话内容“任正非是个了不起的企业家”播放给所述第一用户。
在一实施例中,生成与所述目标音色匹配的音频数据时,所述方法还包括:
所述第一终端对所述第一语音数据进行特征提取,得到语速特征;
利用选取的目标音色模板,结合所述语速特征、所述情感特征和所述语调特征,生成与所述目标音色匹配的音频数据
其中,所述语速特征表征所述第二用户在单位时间内说出的词汇量。
这里,所述第一终端对所述第二用户的第一语音数据进行特征提取,得到语速特征 的过程包括:基于所述第一语音数据,统计在单位时间内的词汇量;基于统计的词汇量,获得语速特征。
当所述第二用户与所述第一用户进行互动交流时,所述第一终端可以通过基于所述第二用户所穿的服装确定的目标音色,结合所述第二用户的语调、情感、语速将所述第二用户的谈话内容播放给所述第一用户;或者,通过基于所述第二用户的谈话内容确定的目标音色,结合所述第二用户的语调、情感、语速将所述第二用户的谈话内容播放给所述第一用户,如此,所述第一用户不仅对所述第二用户的谈话内容会产生极大的兴趣,而且对所述第二用户本人也会产生好奇。
在一示例中,以目标音色和第二用户的语调、情感、语速为例,描述第一终端生成与所述目标音色匹配的音频数据的实现流程示意图,如图8所示,包括:
步骤1:第一终端确定目标音色模板。
第二用户针对华为事件向第一用户发起的对话对应的识别文本为“任正非是个了不起的企业家”。
所述第一终端基于所述识别文本对应的翻译文本,将任正非的音色模板作为目标音色模板。
步骤2:所述第一终端对所述第二用户的第一语音数据进行特征提取,得到语调、情感、语速特征。
步骤3:所述第一终端利用所述目标音色模板,结合所述语调、情感、语速特征,对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据。
所述第一终端通过任正非的音色,结合所述第二用户的语调、情感、语速,将所述第二用户的谈话内容“任正非是个了不起的企业家”播放给所述第一用户。
在一实施例中,所述第一终端根据所述第二用户的识别文本确定的目标音色模板的数量为至少两个;生成与所述目标音色匹配的音频数据时,所述方法还包括:所述第一终端按照目标音色模板的数量对所述翻译文本进行分段,得到至少两个段落;利用所述至少两个目标音色模板,对所述至少两个段落分别进行TTS转换,获得至少两个音频片段;对所述至少两个音频片段进行拼接,得到音频数据。
当所述第二用户与所述第一用户进行互动交流时,所述第一终端可以通过基于所述第二用户的谈话内容确定多个目标音色,通过所述多个目标音色将所述第二用户的谈话内容播放给所述第一用户,如此,所述第一用户对所述第二用户的谈话内容会产生极大的好奇。
在一示例中,以多个目标音色为例,描述第一终端生成与所述目标音色匹配的音频数据的实现流程示意图,如图9所示,包括:
步骤1:第一终端确定多个目标音色模板。
第二用户针对华为事件向第一用户发起的对话对应的识别文本为“任正非是个了不起的企业家、刘德华是我最喜欢的明星之一”。
所述第一终端基于所述识别文本对应的翻译文本,将任正非和刘德华的音色模板作为目标音色模板。
步骤2:所述第一终端对所述第二用户的第一语音数据进行翻译,得到翻译文本,对翻译文本进行分段,得到两个段落。
步骤3:所述第一终端利用两个目标音色模板,对所述两个段落进行TTS转换,生成两个音频片段;
步骤4:对所述两个音频片段进行拼接,得到音频数据。
步骤104:输出所述音频数据。
这里,所述第一终端可以通过音频输出模块播放与所述目标音色匹配的音频。其中, 所述音频输出模块可通过所述第一终端的扬声器实现。
具体来说,当所述第二用户与所述第一用户进行互动交流时,所述第一终端不使用所述第二用户的音色将所述第二用户的谈话内容播放给所述第一用户,而是通过基于所述第二用户所穿的服装确定的目标音色将所述第二用户的谈话内容播放给所述第一用户。
当所述第二用户与所述第一用户进行互动交流时,所述第一终端不使用所述第二用户的音色将所述第二用户的谈话内容播放给所述第一用户,而是通过基于所述第二用户的谈话内容确定的目标音色将所述第二用户的谈话内容播放给所述第一用户。
当所述第二用户与所述第一用户进行互动交流时,所述第一终端不使用所述第二用户的音色将所述第二用户的谈话内容播放给所述第一用户,而是通过基于所述第一用户所选取的目标音色将所述第二用户的谈话内容播放给所述第一用户。
在一示例中,以离线转换为例,描述第一终端通过目标音色播放第二用户的谈话内容的实现流程示意图,如图10a所示,包括:
步骤1:第二终端将第二用户的第一语音数据、第一图像数据发送至第一终端。
如图11所示,所述第二终端利用麦克风采集第二用户的音频如“hello,where are you from?”,得到第一语音数据,利用摄像头采集所述第二用户所穿的米老鼠,得到第一图像数据;通过无线传输模块将所述第一语音数据、第一图像数据发送至所述第一终端。
步骤2:所述第一终端对所述第一语音数据进行翻译,得到翻译文本;对所述第一图像数据进行图像识别,得到识别结果。
步骤3:所述第一终端利用所述翻译文本和/或识别结果,确定目标音色模板。
步骤4:所述第一终端从音色模板数据库中选择所述目标音色模板,利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据。
步骤5:所述第一终端输出所述音频数据。
如图11所示,所述第一终端通过米老鼠音色将所述第二用户的谈话内容对应的翻译文本“小朋友,你来自哪里啊”播放给所述第一用户。
在一示例中,以在线转换为例,描述所述第一终端通过目标音色播放第二用户的谈话内容的实现流程示意图,如图10b所示,包括:
步骤1:第二终端将第二用户的第一语音数据、第一图像数据发送至第一终端。
所述第二终端利用麦克风采集第二用户的音频如“hello,where are you from?”,得到第一语音数据;并利用摄像头采集所述第二用户所穿的米老鼠,得到第一图像数据;通过无线传输模块将所述第一语音数据、第一图像数据发送至所述第一终端。
步骤2:所述第一终端对所述第一语音数据进行翻译,得到翻译文本;对所述第一图像数据进行图像识别,得到识别结果。
步骤3:所述第一终端利用所述翻译文本和/或识别结果,确定目标音色模板。
步骤4:所述第一终端从音色模板数据库中选择所述目标音色模板,将所述翻译文本和目标音色模板发送至服务器。
服务器利用所述目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据;将所述音频数据返回给所述第一终端。
步骤5:所述第一终端接收服务器发送的音频数据并输出。
所述第一终端通过米老鼠音色将所述第二用户的谈话内容对应的翻译文本“小朋友,你来自哪里啊”播放给所述第一用户。
这里,可以由所述第一终端将所述翻译文本转换为与目标音色匹配的音频数据;也可以由所述服务器将所述翻译文本转换为与目标音色匹配的音频数据,同时支持在线转换和离线转换,实现方式更加灵活。
应理解,上述实施例中说明各步骤的顺序并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本发明实施例提供的数据处理方法,第一终端接收第二终端发送的待处理数据;对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;利用所述翻译文本和/或识别结果,确定目标音色模板;从音色模板数据库中选取所述目标音色模板;利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据(可以包括在线转换和离线转换);输出所述音频数据,从而可以使用所述目标音色将使用所述第二终端的第二用户的谈话内容播放给使用所述第一终端的第一用户,促使所述第一用户对所述第二用户的谈话内容产生浓厚的兴趣,实现所述第二用户与所述第一用户之间的变声交流。
为实现本申请实施例的数据处理方法,本申请实施例还提供了一种数据处理装置,设置在第一终端上。图12为本申请实施例的数据处理装置的组成结构示意图;如图12所示,所述数据处理装置包括:
获取单元121,配置为获取待处理数据;
第一处理单元122,配置为对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;
第二处理单元123,配置为利用所述翻译文本和/或识别结果,确定目标音色模板;从音色模板数据库中选取目标音色模板;利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据;
输出单元124,配置为输出所述音频数据。
在一实施例中,所述第一处理单元122,配置为采用语音识别技术对所述第一语音数据进行语音识别,获得识别文本;
运用预设的翻译模型对所述识别文本进行翻译,获得所述翻译文本。
其中,所述翻译模型,用于将第一语种的文本翻译为至少一种第二语种的文本;所述第一语种与第二语种不同。
在一实施例中,所述第一处理单元122,配置为对所述待处理数据中的第一图像数据进行图像预处理,得到预处理后的第一图像数据;
从所述预处理后的第一图像数据中提取特征数据;
采用图像识别技术对提取的特征数据进行图像识别,获得识别结果。
其中,所述图像预处理包括:对所述第一图像数据进行数据加强、归一化等。
在一实施例中,所述第二处理单元123,配置为从所述翻译文本中搜索与预设字符串对应的第一文本;
当从所述翻译文本中搜索到与预设字符串对应的所述第一文本时,基于所述第一文本,确定所述目标音色模板。
在一实施例中,所述第二处理单元123,配置为判断所述识别结果是否表征第一图像数据对应的第一图像与预设图像相匹配;
当所述识别结果表征第一图像数据对应的第一图像与预设图像相匹配时,基于所述第一图像,确定所述目标音色模板。
在一实施例中,所述第二处理单元123,配置为获取与所述音频数据的接收者的目标语言;
判断所述翻译文本对应的语言与所述目标语言是否属于同一语种;
当确定所述翻译文本对应的语言与所述目标语言属于同一语种时,利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据。
在一实施例中,所述第二处理单元123,配置为当所述翻译文本对应的语言与所述 目标语言属于不同语种时,将所述翻译文本转换为与所述目标语言匹配的文本;利用选取的目标音色模板,将转换得到的文本转换为与目标音色匹配的音频数据。
在一实施例中,所述装置还包括:
生成单元,配置为所述第一终端采集至少两个语音数据;将所述至少两个语音数据作为训练数据;
在卷积神经网络的输入层输入所述训练数据,在所述卷积神经网络的至少一层特征提取层对所述训练数据进行输入到输出的映射,得到至少两个特征数据;
基于所述至少两个特征数据,得到至少两个音色模板;
基于所述至少两个音色模板,生成音色模板数据库。
在一实施例中,所述第二处理单元123,配置为:
利用选取的目标音色模板,对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据。
在一实施例中,所述第二处理单元123,配置为:
所述第一终端对所述第一语音数据进行特征提取,得到语调特征;
利用选取的目标音色模板,结合所述语调特征,对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据。
在一实施例中,所述第二处理单元123,配置为:
所述第一终端对所述第一语音数据进行特征提取,得到情感特征;
利用选取的目标音色模板,结合所述情感特征和语调特征,生成与所述目标音色匹配的音频数据。
在一实施例中,所述第二处理单元123,配置为:
所述第一终端对所述第一语音数据进行特征提取,得到语速特征;
利用选取的目标音色模板,结合所述语速特征、所述情感特征和所述语调特征,生成与所述目标音色匹配的音频数据
在一实施例中,所述第二处理单元123,配置为:
所述第一终端按照目标音色模板的数量对所述翻译文本进行分段,得到至少两个段落;
利用所述至少两个目标音色模板,对所述至少两个段落分别进行TTS转换,获得至少两个音频片段;
对所述至少两个音频片段进行拼接,得到音频数据。
实际应用时,所述获取单元121、输出单元124可通过所述数据处理装置中的通信接口实现;所述第一处理单元122、所述第二处理单元123均可由所述数据处理装置中的处理器实现。
需要说明的是:上述实施例提供的装置在进行数据处理时,仅以上述各程序模块的划分进行举例说明,实际应用中,可以根据需要而将上述处理分配由不同的程序模块完成,即将终端的内部结构划分成不同的程序模块,以完成以上描述的全部或者部分处理。另外,上述实施例提供的装置与数据处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
基于上述设备的硬件实现,本申请实施例还提供了一种数据处理装置,设置在第一终端上,图13为本申请实施例的数据处理装置的硬件组成结构示意图,如图13所示,数据处理装置130包括存储器133、处理器132及存储在存储器133上并可在处理器132上运行的计算机程序;位于数据处理装置的处理器132执行所述程序时实现上述数据处理装置侧一个或多个技术方案提供的方法。
具体地,位于数据处理装置130的处理器132执行所述程序时实现:接收第二终端 发送的待处理数据;所述待处理数据是所述第二终端获取的;对所述待处理数据进行翻译,得到翻译文本;并从音色模板数据库中选取目标音色模板;利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据;输出所述音频数据。
位于数据处理装置130的处理器132执行所述程序时实现:获取与所述音频数据的接收者的语种匹配的目标语言;
判断所述翻译文本对应的语言是否与所述目标语言相同;
当确定所述翻译文本对应的语言与所述目标语言相同时,利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据。
位于数据处理装置130的处理器132执行所述程序时实现:获取与所述音频数据的接收者的语种匹配的目标语言;
当所述翻译文本对应的语言与所述目标语言不同时,将所述翻译文本转换为与所述目标语言匹配的文本;
利用选取的目标音色模板,将转换得到的文本转换为与目标音色匹配的音频数据。
位于数据处理装置130的处理器132执行所述程序时实现:采集至少一个语音数据;
针对所述至少一个语音数据中每个语音数据,提取相应语音数据的特征数据;
利用提取的特征数据,基于神经网络模型,确定对应的音色模板;
基于确定的音色模板,生成音色模板数据库。
位于数据处理装置130的处理器132执行所述程序时实现:利用选取的目标音色模板,对所述翻译文本进行文本转语音TTS转换,生成与所述目标音色匹配的音频数据。
位于数据处理装置130的处理器132执行所述程序时实现:将所述翻译文本和目标音色模板发送至数据处理装置;所述翻译文本和目标音色模板用于供所述数据处理装置对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据;
接收所述数据处理装置发送的与所述目标音色匹配的音频数据。
位于数据处理装置130的处理器132执行所述程序时实现:所述音频数据随着所述待处理数据的获取进行同步输出。
需要说明的是,位于数据处理装置130的处理器132执行所述程序时实现的具体步骤已在上文详述,这里不再赘述。
可以理解,数据处理装置还包括通信接口131;数据处理装置中的各个组件通过总线系统134耦合在一起。可理解,总线系统134配置为实现这些组件之间的连接通信。总线系统134除包括数据总线之外,还包括电源总线、控制总线和状态信号总线等。
可以理解,本实施例中的存储器133可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate  Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本申请实施例描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
上述本申请实施例揭示的方法可以应用于处理器132中,或者由处理器132实现。处理器132可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器132中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器132可以是通用处理器、DSP,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器132可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器,处理器132读取存储器中的信息,结合其硬件完成前述方法的步骤。
本申请实施例还提供了一种存储介质,具体为计算机存储介质,更具体的为计算机可读存储介质。其上存储有计算机指令,即计算机程序,该计算机指令被处理器执行时上述数据处理装置侧一个或多个技术方案提供的方法。
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和智能设备,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以全部集成在一个第二处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、数据处理装置、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
需要说明的是:“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。
另外,本申请实施例所记载的技术方案之间,在不冲突的情况下,可以任意组合。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。

Claims (12)

  1. 一种数据处理方法,包括:
    获取待处理数据;
    对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;
    利用所述翻译文本和/或识别结果,确定目标音色模板;
    从音色模板数据库中选取所述目标音色模板,利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据;
    输出所述音频数据。
  2. 根据权利要求1所述的方法,其中,所述利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据,包括:
    获取所述音频数据的接收者的目标语言;
    判断所述翻译文本对应的语言是否与所述目标语言属于同一语种;
    当确定所述翻译文本对应的语言与所述目标语言属于同一语种时,利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据。
  3. 根据权利要求2所述的方法,其中,所述方法还包括:
    当所述翻译文本对应的语言与所述目标语言属于不同语种时,将所述翻译文本转换为与所述目标语言匹配的文本;
    利用选取的目标音色模板,将转换得到的文本转换为与目标音色匹配的音频数据。
  4. 根据权利要求1至3任一项所述的方法,其中,所述利用所述翻译文本,确定目标音色模板,包括:
    从所述翻译文本中搜索与预设字符串对应的第一文本;
    当从所述翻译文本中搜索到与预设字符串对应的所述第一文本时,基于所述第一文本,确定所述目标音色模板。
  5. 根据权利要求1至3任一项所述的方法,其中,所述利用所述识别结果,确定目标音色模板,包括:
    判断所述识别结果是否表征第一图像数据对应的第一图像与预设图像相匹配;
    当所述识别结果表征第一图像数据对应的第一图像与预设图像相匹配时,基于所述第一图像,确定所述目标音色模板。
  6. 根据权利要求1所述的方法,其中,所述利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据,包括:
    利用选取的目标音色模板,对所述翻译文本进行文本到语音TTS转换,生成与所述目标音色匹配的音频数据。
  7. 根据权利要求6所述的方法,其中,生成与所述目标音色匹配的音频数据时,所述方法还包括:
    对所述第一语音数据进行特征提取,得到语调特征;
    利用选取的目标音色模板,结合所述语调特征,对所述翻译文本进行TTS转换,生成与所述目标音色匹配的音频数据。
  8. 根据权利要求7所述的方法,其中,生成与所述目标音色匹配的音频数据时,所述方法还包括:
    对所述第一语音数据进行特征提取,得到情感特征;
    利用选取的目标音色模板,结合所述情感特征和语调特征,生成与所述目标音色匹 配的音频数据。
  9. 根据权利要求8所述的方法,其中,生成与所述目标音色匹配的音频数据时,所述方法还包括:
    对所述第一语音数据进行特征提取,得到语速特征;
    利用选取的目标音色模板,结合所述语速特征、所述情感特征和所述语调特征,生成与所述目标音色匹配的音频数据。
  10. 一种数据处理装置,包括:
    获取单元,配置为获取待处理数据;
    第一处理单元,配置为对所述待处理数据中的第一语音数据进行翻译,得到翻译文本;并对所述待处理数据中的第一图像数据进行图像识别,得到识别结果;
    第二处理单元,配置为利用所述翻译文本和/或识别结果,确定目标音色模板;从音色模板数据库中选取所述目标音色模板;利用选取的目标音色模板,将所述翻译文本转换为与目标音色匹配的音频数据;
    输出单元,配置为输出所述音频数据。
  11. 一种数据处理装置,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现权利要求1至9任一项所述方法的步骤。
  12. 一种存储介质,其上存储有计算机指令,所述指令被处理器执行时实现权利要求1至9任一项所述方法的步骤。
PCT/CN2019/120706 2019-11-25 2019-11-25 数据处理方法、装置和存储介质 WO2021102647A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/120706 WO2021102647A1 (zh) 2019-11-25 2019-11-25 数据处理方法、装置和存储介质
CN201980100970.XA CN114514576A (zh) 2019-11-25 2019-11-25 数据处理方法、装置和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/120706 WO2021102647A1 (zh) 2019-11-25 2019-11-25 数据处理方法、装置和存储介质

Publications (1)

Publication Number Publication Date
WO2021102647A1 true WO2021102647A1 (zh) 2021-06-03

Family

ID=76129013

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/120706 WO2021102647A1 (zh) 2019-11-25 2019-11-25 数据处理方法、装置和存储介质

Country Status (2)

Country Link
CN (1) CN114514576A (zh)
WO (1) WO2021102647A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114783403A (zh) * 2022-02-18 2022-07-22 腾讯科技(深圳)有限公司 有声读物的生成方法、装置、设备、存储介质及程序产品

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130338997A1 (en) * 2007-03-29 2013-12-19 Microsoft Corporation Language translation of visual and audio input
CN107992485A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种同声传译方法及装置
CN108231062A (zh) * 2018-01-12 2018-06-29 科大讯飞股份有限公司 一种语音翻译方法及装置
CN109543021A (zh) * 2018-11-29 2019-03-29 北京光年无限科技有限公司 一种面向智能机器人的故事数据处理方法及系统
CN109658916A (zh) * 2018-12-19 2019-04-19 腾讯科技(深圳)有限公司 语音合成方法、装置、存储介质和计算机设备
CN110401671A (zh) * 2019-08-06 2019-11-01 董玉霞 一种同传翻译系统及同传翻译终端
CN110415680A (zh) * 2018-09-05 2019-11-05 满金坝(深圳)科技有限公司 一种同声传译方法、同声传译装置以及一种电子设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130338997A1 (en) * 2007-03-29 2013-12-19 Microsoft Corporation Language translation of visual and audio input
CN107992485A (zh) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 一种同声传译方法及装置
CN108231062A (zh) * 2018-01-12 2018-06-29 科大讯飞股份有限公司 一种语音翻译方法及装置
CN110415680A (zh) * 2018-09-05 2019-11-05 满金坝(深圳)科技有限公司 一种同声传译方法、同声传译装置以及一种电子设备
CN109543021A (zh) * 2018-11-29 2019-03-29 北京光年无限科技有限公司 一种面向智能机器人的故事数据处理方法及系统
CN109658916A (zh) * 2018-12-19 2019-04-19 腾讯科技(深圳)有限公司 语音合成方法、装置、存储介质和计算机设备
CN110401671A (zh) * 2019-08-06 2019-11-01 董玉霞 一种同传翻译系统及同传翻译终端

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114783403A (zh) * 2022-02-18 2022-07-22 腾讯科技(深圳)有限公司 有声读物的生成方法、装置、设备、存储介质及程序产品

Also Published As

Publication number Publication date
CN114514576A (zh) 2022-05-17

Similar Documents

Publication Publication Date Title
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
CN108806656B (zh) 歌曲的自动生成
WO2021083071A1 (zh) 语音转换、文件生成、播音、语音处理方法、设备及介质
JP6876752B2 (ja) 応答方法及び装置
CN108806655B (zh) 歌曲的自动生成
CN106898340B (zh) 一种歌曲的合成方法及终端
US20200126566A1 (en) Method and apparatus for voice interaction
WO2020253509A1 (zh) 面向情景及情感的中文语音合成方法、装置及存储介质
JP2019057273A (ja) 情報をプッシュする方法及び装置
JP2023022150A (ja) 双方向音声翻訳システム、双方向音声翻訳方法及びプログラム
CN110675886B (zh) 音频信号处理方法、装置、电子设备及存储介质
WO2019242414A1 (zh) 语音处理方法、装置、存储介质及电子设备
CN112840396A (zh) 用于处理用户话语的电子装置及其控制方法
CN109543021B (zh) 一种面向智能机器人的故事数据处理方法及系统
KR20200027331A (ko) 음성 합성 장치
JP2004101901A (ja) 音声対話装置及び音声対話プログラム
TW202018696A (zh) 語音識別方法、裝置及計算設備
JP2023527473A (ja) オーディオ再生方法、装置、コンピュータ可読記憶媒体及び電子機器
CN116917984A (zh) 交互式内容输出
CN116312471A (zh) 语音迁移、语音交互方法、装置、电子设备及存储介质
CN114283820A (zh) 多角色语音的交互方法、电子设备和存储介质
WO2021102647A1 (zh) 数据处理方法、装置和存储介质
US11790913B2 (en) Information providing method, apparatus, and storage medium, that transmit related information to a remote terminal based on identification information received from the remote terminal
JP7333371B2 (ja) 話者分離基盤の自動通訳方法、話者分離基盤の自動通訳サービスを提供するユーザ端末、及び、話者分離基盤の自動通訳サービス提供システム
KR102584436B1 (ko) 화자분리 기반 자동통역 서비스를 제공하는 시스템, 사용자 단말 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19954459

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19954459

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21.11.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19954459

Country of ref document: EP

Kind code of ref document: A1