CN113516962A - Voice broadcasting method and device, storage medium and electronic equipment - Google Patents

Voice broadcasting method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113516962A
CN113516962A CN202110378920.7A CN202110378920A CN113516962A CN 113516962 A CN113516962 A CN 113516962A CN 202110378920 A CN202110378920 A CN 202110378920A CN 113516962 A CN113516962 A CN 113516962A
Authority
CN
China
Prior art keywords
voice
audio
broadcast
emotion
electronic equipment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110378920.7A
Other languages
Chinese (zh)
Other versions
CN113516962B (en
Inventor
楚晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Oppo Mobile Telecommunications Corp Ltd
Original Assignee
Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Oppo Mobile Telecommunications Corp Ltd filed Critical Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority to CN202110378920.7A priority Critical patent/CN113516962B/en
Publication of CN113516962A publication Critical patent/CN113516962A/en
Application granted granted Critical
Publication of CN113516962B publication Critical patent/CN113516962B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The embodiment of the application discloses a voice broadcast method, a voice broadcast device, a storage medium and electronic equipment, wherein the technical scheme provided by the embodiment of the application is applied to a voice cache server, the voice cache server sends a first voice synthesis request to the voice synthesis server through intercepting the electronic equipment, acquires a first broadcast text carried by the first voice synthesis request, acquires the first emotion audio when the current storage is the first emotion audio corresponding to the recording of the first broadcast text, and feeds the first emotion audio back to the electronic equipment as the first synthesized audio corresponding to the first broadcast text for the electronic equipment to perform voice broadcast. This scheme can be under the emotion audio's that the current storage real person recorded the condition, feeds back the emotion audio to electronic equipment as the synthetic audio that the report text corresponds to improve voice broadcast's effect.

Description

Voice broadcasting method and device, storage medium and electronic equipment
Technical Field
The present application relates to the field of voice processing technologies, and in particular, to a voice broadcasting method, apparatus, storage medium, and electronic device.
Background
At present, the voice broadcast technology is implemented by using a voice synthesis method, that is, corresponding voice is synthesized according to a corresponding broadcast text and is played to a user. However, the voice generated by the voice synthesis method in the related art sounds hard, antique, lacks emotion, and is poor in a broadcasting effect.
Disclosure of Invention
The embodiment of the application provides a voice broadcast method, a voice broadcast device, a storage medium and electronic equipment, which can improve the voice broadcast effect.
In a first aspect, an embodiment of the present application provides a voice broadcast method, which is applied to a voice cache server, and includes:
intercepting a first voice synthesis request sent by electronic equipment to a voice synthesis server;
acquiring a first broadcast text carried by the first voice synthesis request;
when a recorded first emotion audio corresponding to the first broadcast text is stored currently, acquiring the first emotion audio;
and feeding back the first emotion audio to the electronic equipment as a first synthetic audio corresponding to the first broadcast text so as to enable the electronic equipment to perform voice broadcast.
In a second aspect, an embodiment of the present application provides a voice broadcasting method, which is applied to a voice synthesis server, and includes:
receiving a third voice synthesis request sent by the electronic equipment;
acquiring a third broadcast text carried by the third voice synthesis request;
when a recorded third emotion audio corresponding to the third broadcast text is stored currently, acquiring the third emotion audio;
and feeding back the third emotion audio to the electronic equipment as a fifth synthesized audio corresponding to the third broadcast text, so that the electronic equipment can perform voice broadcast.
In a third aspect, an embodiment of the present application provides a voice broadcast device, which is applied to a voice cache server, and includes:
the system comprises an interception module, a voice synthesis server and a voice synthesis server, wherein the interception module is used for intercepting a first voice synthesis request sent by electronic equipment to the voice synthesis server;
the first acquisition module is used for acquiring a first broadcast text carried by the first voice synthesis request;
the second obtaining module is used for obtaining the first emotion audio when the recorded first emotion audio corresponding to the first broadcast text is stored currently;
and the feedback module is used for feeding back the first emotion audio to the electronic equipment as a first synthesized audio corresponding to the first broadcast text so as to enable the electronic equipment to perform voice broadcast.
In a fourth aspect, the present application provides a voice broadcast device, which is applied to a voice synthesis server, and is characterized by including:
the receiving module is used for receiving a third voice synthesis request sent by the electronic equipment;
a third obtaining module, configured to obtain a third broadcast text carried by the third voice synthesis request;
the fourth obtaining module is used for obtaining a third emotion audio when the recorded third emotion audio corresponding to the third broadcast text is stored currently;
and the second feedback module is used for feeding the third emotion audio back to the electronic equipment as a fifth synthesized audio corresponding to the third broadcast text, so that the electronic equipment can perform voice broadcast.
In a fifth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the voice broadcasting method provided in any embodiment of the present application.
In a sixth aspect, an embodiment of the present application further provides an electronic device, which includes a processor and a memory, where the memory has a computer program, and the processor is configured to execute the voice broadcast method provided in any embodiment of the present application by calling the computer program.
The technical scheme that this application embodiment provided is applied to pronunciation cache server, and pronunciation cache server sends the first speech synthesis request of speech synthesis server through intercepting electronic equipment, acquires the first broadcast text that first speech synthesis request carried, there is the correspondence at the present storage when the first emotion audio of the recording of first broadcast text, acquire first emotion audio, will first emotion audio is as corresponding the first synthesized audio of first broadcast text feeds back extremely electronic equipment, so that electronic equipment carries out the voice broadcast. The scheme can feed the emotion audio back to the electronic equipment as the synthetic audio corresponding to the broadcast text under the condition that the emotion audio recorded by the real person is stored currently, so that the broadcast voice has emotion expression of tone such as happy, excited, sad, angry and the like, and the voice broadcast effect is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a first flowchart of a voice broadcast method according to an embodiment of the present application.
Fig. 2 is a schematic view of a first application scenario of a voice broadcast method according to an embodiment of the present application.
Fig. 3 is a second flowchart of a voice broadcast method according to an embodiment of the present application.
Fig. 4 is a third flowchart illustrating a voice broadcast method according to an embodiment of the present application.
Fig. 5 is a schematic view of a second application scenario of a voice broadcast method according to an embodiment of the present application.
Fig. 6 is a first schematic structural diagram of a voice broadcast device provided in an embodiment of the present application.
Fig. 7 is a second schematic structural diagram of a voice broadcast device according to an embodiment of the present application.
Fig. 8 is a schematic structural diagram of a first electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are within the scope of the present application.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
The embodiment of the application provides a voice broadcast method, and an execution main body of the voice broadcast method can be the voice broadcast device provided by the embodiment of the application or a server integrated with the voice broadcast device, wherein the voice broadcast device can be realized in a hardware or software mode.
An embodiment of the present application provides a voice broadcasting method, please refer to fig. 1, where the voice broadcasting method is applied to a voice cache server, and may include the following steps:
101. a first voice synthesis request sent by the electronic equipment to the voice synthesis server is intercepted.
The voice synthesis request refers to a voice synthesis request sent to the voice synthesis server after the electronic device acquires the broadcast text, so that the voice synthesis server synthesizes a corresponding synthesis audio according to the broadcast text carried by the voice synthesis request and feeds the synthesis audio back to the electronic device to be broadcast as the broadcast audio.
The electronic device here is an electronic device of the voice assistant client, and may be, for example, a smart phone, a tablet computer, or the like. Corresponding to the voice cache server and the voice synthesis server provided by the scheme.
According to the scheme, a first voice synthesis request sent by the electronic equipment to the voice synthesis server is intercepted through the voice cache server.
102. And acquiring a first broadcast text carried by the first voice synthesis request.
The broadcast text is character content corresponding to the audio to be broadcast, and the corresponding content is obtained after a voice instruction sent by a user is identified.
For example, the audio given the instruction by the user may be "can sing" sweet honey "to me", the corresponding broadcast text may be the lyrics of the song "sweet honey", and for example, the audio given the instruction by the user may be "talk a bar of laughter to me", the corresponding broadcast text may be a story of laughter, and so on.
Generally speaking, voice interaction includes the following steps:
(1) the user wakes up the voice assistant client through the wake-up word and issues an instruction through voice;
(2) the voice assistant client sends the received audio of the instruction issued by the user to the voice assistant server;
(3) the voice assistant server terminal identifies the audio frequency given by the instruction of the user as a text;
(4) the voice assistant server performs semantic processing on the recognized text to obtain a corresponding intention result;
(5) the voice assistant server further processes the intention result to obtain a broadcast text;
(6) the voice assistant server side sends the broadcast text to the voice assistant client side;
(7) after receiving the broadcast text, the voice assistant client sends the broadcast text to a server for voice synthesis to obtain corresponding synthesized audio, and feeds the corresponding synthesized audio back to the voice assistant client;
(8) and the voice assistant client feeds the received synthetic audio back to the user as broadcast audio.
It is understood that the solution of the present application addresses the contents of the above-mentioned links (7) and (8).
In the embodiment of the application, for example, after the voice cache server intercepts the first voice synthesis request, the first broadcast text carried by the first voice synthesis request is acquired.
103. When the recorded first emotion audio corresponding to the first broadcast text is stored currently, the first emotion audio is acquired.
The emotion audio refers to emotion-rich audio recorded by real people, for example, the emotion audio can be real people voice with emotional expression such as happiness, excitement, sadness, anger and the like.
In the embodiment of the application, for example, emotion audios and/or synthesized audios corresponding to different broadcast texts are stored in the voice cache server, the voice cache server may detect whether a first emotion audio corresponding to a first broadcast text is stored in the voice cache server, and if the first emotion audio corresponding to the first broadcast text is stored in the voice cache server, the first emotion audio is acquired.
104. The first emotion audio is fed back to the electronic equipment as first synthetic audio corresponding to the first broadcast text, so that the electronic equipment can broadcast the first emotion audio in a voice mode.
It is understood that, referring to fig. 2, the electronic device sends the voice synthesis request to the voice synthesis server to synthesize the synthesized audio corresponding to the broadcasted text through the voice synthesis server, and in the embodiment of the present application, the voice cache server intercepts the first voice synthesis request originally sent to the voice synthesis server. Therefore, when the voice cache server detects that the first emotion audio corresponding to the first broadcast text carried in the first voice synthesis request is stored, the first emotion audio is acquired, is used as the first synthesis audio corresponding to the first broadcast text and is fed back to the electronic equipment so as to be used for voice broadcast of the electronic equipment.
In particular implementation, the present application is not limited by the execution sequence of the described steps, and some steps may be performed in other sequences or simultaneously without conflict.
Therefore, the voice broadcasting method provided by the embodiment of the application is applied to the voice cache server, the voice cache server sends the first voice synthesis request to the voice synthesis server through intercepting the electronic equipment, acquires the first broadcast text carried by the first voice synthesis request, acquires the first emotion audio when the current storage is the first emotion audio corresponding to the recording of the first broadcast text, and feeds the first emotion audio back to the electronic equipment as the first synthesized audio corresponding to the first emotion audio, so that the electronic equipment can perform voice broadcasting. The scheme can feed the emotion audio back to the electronic equipment as the synthetic audio corresponding to the broadcast text under the condition that the emotion audio recorded by the real person is stored currently, so that the broadcast voice has emotion expression of tone such as happy, excited, sad, angry and the like, and the voice broadcast effect is improved.
The method according to the preceding embodiment is illustrated in further detail below by way of example.
An embodiment of the present application provides a voice broadcasting method, please refer to fig. 3, where the voice broadcasting method is applied to a voice cache server, and may include the following steps:
201. a first voice synthesis request sent by the electronic equipment to the voice synthesis server is intercepted.
In one embodiment, the step of "intercepting the first speech synthesis request sent by the electronic device to the speech synthesis server" may further include the following steps:
(1) acquiring a history broadcast text of which the request times carried in the voice synthesis request are not less than a first preset time;
(2) and acquiring and storing the recorded emotion audio corresponding to the history broadcast text.
The first preset times can be set according to actual conditions, for example, the number of times of requests corresponding to all historical broadcast texts can be counted, and the broadcast texts with the number of times of requests exceeding 10000 are subjected to real-person sound optimal recording to obtain corresponding recorded emotion audios; for example, the broadcast text with the request frequency exceeding 100000 is subjected to real-person sound optimal recording to obtain the corresponding recorded emotion audio.
That is, emotion audios recorded by real-person sound of frequently-used history broadcast texts are obtained, and the emotion audios corresponding to the history broadcast texts are stored. When the history broadcast texts are requested in the voice synthesis request sent by the electronic equipment, the corresponding emotion audios are fed back to the electronic equipment so as to enable the electronic equipment to carry out voice broadcast.
In one embodiment, intercepting the first speech synthesis request sent by the electronic device to the speech synthesis server at this step may further include:
(1) acquiring a keyword of a real-time hotspot;
(2) acquiring a hot text related to the keyword;
(3) and acquiring and storing the recorded emotion audio corresponding to the hot text.
The real-time hot spot is a hot search event on the network, for example, the hot search event is "major archaeological discovery of san starry heap site", and the keyword here may be "san starry heap site", "cultural relic unearthed from san starry heap", and the hot text corresponding to the keyword "san starry heap" may be related literal contents such as the historical site introduction, the excavation history, the unearthed cultural relic, the important vestige, and the historical value of the san starry heap, and the literal contents are recorded to obtain the corresponding emotion audio.
For example, when the electronic device receives a voice of a user, "i want to know the san starry monument site", the corresponding broadcast text may be the text content of the "san starry monument site introduction", and the corresponding broadcast audio is the emotion audio corresponding to the text content of the "san starry monument site introduction".
For another example, if the user continues to make a voice "i want to know more information about the san bank site", the corresponding broadcast text may be the text content related to the "digging process", "unearthed cultural relics", "important heritage", "historical value", and the like, and the corresponding broadcast audio is the emotion audio corresponding to the text content related to the "digging process", "unearthed cultural relics", "important heritage", "historical value", and the like. It should be noted that the above examples relate to semantic recognition and intention recognition of speech, and do not relate to the technical problem to be solved by the present application, and therefore are not specifically described herein.
Of course, the emotion audio may be obtained in advance in the above manner, or may be obtained in advance in other manners, which is not limited specifically herein.
According to the embodiment of the application, the emotion audios are acquired in advance, so that when the electronic equipment terminal requests related contents, the corresponding emotion audios are fed back to the electronic equipment, and the broadcasting effect is better.
The first voice synthesis request refers to a voice synthesis request sent by the electronic equipment to the voice synthesis server after the electronic equipment acquires the broadcast text, so that the voice synthesis server synthesizes a corresponding synthesis audio according to the broadcast text carried by the voice synthesis request and feeds the synthesis audio back to the electronic equipment to be used as the broadcast audio for broadcasting.
According to the scheme, a first voice synthesis request sent by the electronic equipment to the voice synthesis server is intercepted through the voice cache server.
202. And acquiring a first broadcast text carried by the first voice synthesis request.
After the voice cache server intercepts the first voice synthesis request, a first broadcast text carried by the first voice synthesis request is obtained.
203. When a first emotion audio which corresponds to the first broadcast text and is matched with the specified voiceprint feature of the first voice synthesis request is stored currently, the first emotion audio is obtained.
As described above, the voice cache server may obtain, in advance, emotion audios corresponding to commonly used broadcast texts and emotion audios corresponding to text contents related to real-time hotspots, and when the voice cache server intercepts a first voice synthesis request and obtains a first broadcast text carried by the first voice synthesis request, the voice cache server may detect whether the first broadcast text is stored in the voice cache server, and the first emotion audio whose voiceprint feature matches a voiceprint feature specified by the first voice synthesis request is stored, and if the first emotion audio corresponding to the first broadcast text is stored, the first emotion audio is obtained.
Wherein the voice assistant in the electronic device provides a selection of voices for a plurality of different voiceprint characteristics, such as female voice 1, female voice 2, male voice 1, male voice 2, and so on. If a certain electronic device uses the female voice 1 as the interactive voice of the voice assistant, the voice synthesis request sent by the electronic device carries the voiceprint feature corresponding to the female voice 1, and the voiceprint feature corresponding to the female voice 1 is the designated voiceprint feature, so that the voice interactive voice is consistent. It is understood that the pre-fetched emotion audios also include multiple versions of different voiceprint features, such as one announcement text for emotion audio of female voice 1, emotion audio of female voice 2, emotion audio of male voice 1, and emotion audio of male voice 2.
204. The first emotion audio is fed back to the electronic equipment as first synthetic audio corresponding to the first broadcast text, so that the electronic equipment can broadcast the first emotion audio in a voice mode.
As described above, the electronic device sends the voice synthesis request to the voice synthesis server to synthesize the synthesized audio corresponding to the broadcast text through the voice synthesis server, and the voice cache server intercepts the first voice synthesis request originally sent to the voice synthesis server, so that when the voice cache server detects that the first emotion audio corresponding to the first broadcast text carried in the first voice synthesis request is stored, the first emotion audio is obtained and is used as the first synthesized audio corresponding to the first broadcast text and is fed back to the electronic device, so that the electronic device can perform voice broadcast.
In one embodiment, the step of intercepting the first speech synthesis request sent by the electronic device to the speech synthesis server may further include:
(1) when the recorded first emotion audio corresponding to the first broadcast text is not stored at present, the first voice synthesis request is sent to a voice synthesis server, so that the voice synthesis server carries out audio synthesis processing according to the first broadcast text carried by the first voice synthesis request to generate second synthesized audio, and the second synthesized audio is fed back to the electronic equipment to be provided for the electronic equipment to carry out voice broadcast.
It can be understood that, after the voice cache server intercepts the first voice synthesis request, it is detected that the first emotion audio of the first broadcast text written corresponding to the first voice synthesis request does not exist in the storage space of the voice cache server, the voice cache server sends the first voice synthesis request to the voice server, so that the voice synthesis server performs audio synthesis on the first broadcast text carried by the first voice synthesis request to obtain a corresponding second synthesized audio, and then feeds the second synthesized audio back to the electronic device, so that the electronic device performs voice broadcast.
In one embodiment, the step of "generating the second synthesized audio" may further include:
(1) the second synthesized audio is obtained and stored.
In the embodiment of the application, the voice cache server obtains the second synthesized audio generated by the voice synthesis server, and stores the second synthesized audio for use when a broadcast text corresponding to the second synthesized audio is requested in the next voice interaction.
In one embodiment, the step "after intercepting the first speech synthesis request sent by the electronic device to the speech synthesis server" may further include:
(1) when the first emotion audio corresponding to the recording of the first broadcast text is not stored currently, but the second synthesized audio corresponding to the first broadcast text is stored currently, the second synthesized audio is fed back to the electronic equipment so that the electronic equipment can broadcast the second synthesized audio.
It can be understood that, in the embodiment of the present application, the voice cache server intercepts a first voice synthesis request originally sent to the voice synthesis server, and after the voice cache server intercepts the first voice synthesis request, the voice cache server may detect whether a first emotion audio corresponding to a first broadcast text is stored in the voice cache server, and if the first emotion audio corresponding to the first broadcast text is stored in the voice cache server, the first emotion audio is obtained; if the first emotion audio is not stored, whether a second synthesized audio corresponding to the first broadcast text is stored or not can be detected, and if the second synthesized audio exists, the second synthesized audio is fed back to the electronic equipment so that the electronic equipment can perform voice broadcast.
In one embodiment, the step of "obtaining and storing the second synthesized audio" may further include:
(1) and if the request times of the second synthetic audio in the preset time length are not more than a second preset time, deleting the second synthetic audio.
In the implementation of the application, the emotion audio and the synthesized audio generated by the voice synthesis server are stored through the voice cache server, and when the request times of the synthesized audio in the voice cache server within the preset time length are smaller than a second preset time, the second synthesized audio is deleted, namely the synthesized audio which is missed for a long time is deleted, so that the load of a storage space is reduced.
The preset time length and the second preset times can be set according to actual conditions. For example, the preset time period may be one month, one week, three days, etc., and the second preset number may be 0, 10, 20, etc.
In an embodiment, after the step of "using the first emotion audio as the first synthesized audio corresponding to the first broadcast text, feeding back the first synthesized audio to the electronic device for voice broadcast by the electronic device", the method may further include:
(1) intercepting a second voice synthesis request sent by the electronic equipment to the voice synthesis server, wherein the second voice synthesis request carries a second broadcast text;
(2) acquiring a second emotion audio corresponding to a part of text in the second broadcast text and acquiring a third synthesized audio corresponding to another part of text in the second broadcast text;
(3) splicing the second emotion audio and the third synthesized audio to obtain a spliced audio;
(4) and feeding back the spliced audio to the electronic equipment as a fourth synthetic audio corresponding to the second broadcast text, so that the electronic equipment can perform voice broadcast.
For example, the audio command issued by the user may be "how so today's weather", and the corresponding second broadcast text may be text content corresponding to today's weather, for example, "shenzhen south mountain is cloudy and rains today, and 23 degrees to 31 degrees, but not forget to take an umbrella". A part of the second broadcast text 'forgetting to carry the umbrella' can be a second emotion audio which is recorded in advance, and the second emotion audio can have a vivid tone or a key tone and the like; the second broadcasts the other part of the text: "Shenzhen nan shan is cloudy today and changes rains, 23 degrees to 31 degrees" can be through obtaining the corresponding report text that obtains after the particular case of weather today, can carry out speech synthesis with this part of report text through speech synthesis server, obtains corresponding third synthesis pronunciation, then splices second emotion pronunciation and third synthesis pronunciation, obtains fourth synthesis pronunciation, feeds back to electronic equipment to the electronic equipment carries out the speech broadcast.
It will be appreciated that the voiceprint characteristics of the second emotion audio and the third synthesized audio are consistent. In addition, the splicing audio may be the emotion audio and the synthesized audio, or the emotion audio and the emotion audio, or the synthesized audio and the synthesized audio, which is not limited specifically herein.
In the embodiment of the application, when audio is spliced, a complete sentence can be split according to the positions of punctuation marks, for example, the second broadcast text can be split into "i sing does not rely on song yet, and cannot trust you to listen, i sing how you see a sweet honey", the broadcast text can be split into "i sing does not rely on song yet", "cannot trust you to listen", "i sing" how you see a sweet honey ", then emotion audio or synthesized audio which is stored in the voice cache server and corresponds to the second broadcast text segment obtained by splitting is obtained, if some second broadcast text segments do not exist in the voice cache server, then voice synthesis can be performed through the voice synthesis server, and then all second broadcast text segments are obtained to perform audio splicing, so that a corresponding fourth synthesized audio is obtained.
Therefore, the voice broadcasting method provided by the embodiment of the application is applied to the voice cache server, the voice cache server sends the first voice synthesis request to the voice synthesis server through intercepting the electronic equipment, acquires the first broadcast text carried by the first voice synthesis request, acquires the first emotion audio when the current storage is the first emotion audio corresponding to the recording of the first broadcast text, and feeds the first emotion audio back to the electronic equipment as the first synthesized audio corresponding to the first emotion audio, so that the electronic equipment can perform voice broadcasting. This scheme can be under the emotion audio's that the current storage real person recorded the condition, feeds back the emotion audio to electronic equipment as the synthetic audio that the report text corresponds to improve voice broadcast's effect.
An embodiment of the present application further provides a voice broadcasting method, please refer to fig. 4, where the voice broadcasting method is applied to a voice synthesis server, and may include the following steps:
301. and receiving a third voice synthesis request sent by the electronic equipment.
The voice cache server and the voice synthesis server in the foregoing embodiment are two completely independent modules, and the voice synthesis server in this embodiment integrates all functions of the voice cache server and the voice synthesis server in the foregoing embodiment.
In this embodiment, the speech synthesis server receives a third speech synthesis request sent by the electronic device.
The voice synthesis request refers to a voice synthesis request sent to the voice synthesis server after the electronic device acquires the broadcast text, so that the voice synthesis server synthesizes a corresponding synthesis audio according to the broadcast text carried by the voice synthesis request and feeds the synthesis audio back to the electronic device to be broadcast as the broadcast audio.
302. And acquiring a third broadcast text carried by the third voice synthesis request.
The broadcast text is character content corresponding to the audio to be broadcast, and the corresponding content is obtained after a voice instruction sent by a user is identified.
In this embodiment, for example, after receiving the third speech synthesis request, the speech synthesis server obtains a third broadcast text carried in the third speech synthesis request.
303. And when the recorded third emotion audio corresponding to the third broadcast text is currently stored, acquiring the third emotion audio.
In this embodiment, for example, the speech synthesis server stores emotion audios and/or synthesized audios corresponding to different broadcast texts, and the speech synthesis server may detect whether a third emotion audio corresponding to a third broadcast text is stored in the speech synthesis server, and if the third emotion audio corresponding to the third broadcast text is stored in the speech synthesis server, obtain the third emotion audio.
304. And feeding back the third emotion audio to the electronic equipment as a fifth synthesized audio corresponding to the third broadcast text so as to enable the electronic equipment to perform voice broadcast.
In this embodiment, referring to fig. 5, when the voice synthesis server detects that a third emotion audio corresponding to a third broadcast text carried in a third voice synthesis request is stored, the third emotion audio is acquired, and is used as a fifth synthesized audio corresponding to the third broadcast text and fed back to the electronic device, so that the electronic device performs voice broadcast.
It can be understood that the speech synthesis server in this embodiment integrates all functions of the speech cache server and the speech synthesis server in the foregoing implementation, so that the speech synthesis server provided in this embodiment can implement the functions corresponding to all the speech cache servers, and the functions that can be implemented by the speech cache server have been specifically described above, and are not described here again.
Therefore, the voice broadcasting method provided by the embodiment of the application is applied to a voice synthesis server, the voice synthesis server receives a third voice synthesis request sent by an electronic device, acquires a third broadcast text carried by the third voice synthesis request, and acquires the third emotion audio when the current storage corresponds to the third emotion audio recorded in the third broadcast text, wherein the third emotion audio serves as the fifth synthesis audio corresponding to the third broadcast text and is fed back to the electronic device, so that the electronic device can perform voice broadcasting. This scheme can be under the emotion audio's that the current storage real person recorded the condition, feeds back the emotion audio to electronic equipment as the synthetic audio that the report text corresponds to improve voice broadcast's effect.
In one embodiment, a voice broadcasting device is also provided. Referring to fig. 6, fig. 6 is a schematic view of a first structure of a voice broadcasting device 400 according to an embodiment of the present application. Wherein this voice broadcast device 400 is applied to voice cache server, and this voice broadcast device 400 includes interception module 401, first acquisition module 402, second acquisition module 403, first feedback module 404, as follows:
the intercepting module 401 is configured to intercept a first speech synthesis request sent by the electronic device to the speech synthesis server;
a first obtaining module 402, configured to obtain a first broadcast text carried in the first voice synthesis request;
a second obtaining module 403, configured to obtain a first emotion audio when a recorded first emotion audio corresponding to the first broadcast text is currently stored;
a first feedback module 404, configured to feed back the first emotion audio to the electronic device as a first synthesized audio corresponding to the first broadcast text, so that the electronic device performs voice broadcast.
In an implementation manner, the second obtaining module 403 may also be configured to obtain a history broadcast text in which the number of request times carried in the voice synthesis request is not less than a first preset number of times; and acquiring and storing the recorded emotion audio corresponding to the history broadcast text.
In an embodiment, the second obtaining module 403 may be further configured to obtain a keyword of a real-time hotspot; acquiring a hot text related to the keyword; and acquiring and storing the recorded emotion audio corresponding to the hot text.
In an embodiment, the second obtaining module 403 may be further configured to obtain the first emotion audio when the first emotion audio corresponding to the first broadcast text and matching the voiceprint feature with the specified voiceprint feature of the first speech synthesis request is currently stored.
In one embodiment, the second obtaining module 403 may be further configured to obtain and store a second synthesized audio synthesized by the speech synthesis server.
In an embodiment, the voice broadcasting device may further include a deleting module, where the deleting module is configured to delete the second synthesized audio if the number of requests of the second synthesized audio within a preset time period is not greater than a second preset number.
In an implementation manner, the intercepting module 401 may be further configured to intercept a second voice synthesis request sent by the electronic device to the voice synthesis server, where the second voice synthesis request carries a second broadcast text.
In an implementation manner, the second obtaining module 403 may be further configured to obtain a second emotion audio corresponding to a part of the second broadcast text, and obtain a third synthesized audio corresponding to another part of the second broadcast text.
In an embodiment, the apparatus for broadcasting voice may further include a splicing module, where the splicing module may splice the second emotion audio and the third synthesized audio to obtain a spliced audio.
In an implementation manner, the first feedback module 404 may be further configured to feed back the spliced audio to the electronic device as a fourth synthesized audio corresponding to the second broadcast text, so that the electronic device performs voice broadcast.
It should be noted that the voice broadcast device provided in the embodiment of the present application and the voice broadcast method in the embodiment in which the voice cache server and the voice synthesis server are two completely independent modules belong to the same concept, and any method provided in the voice broadcast method embodiment can be implemented by the voice broadcast device, and a specific implementation process thereof is described in detail in the voice broadcast method embodiment and is not described herein again.
As can be seen from the above, the voice broadcasting device 400 provided in the embodiment of the present application intercepts, by the intercepting module 401, a first voice synthesis request sent by the electronic device to the voice synthesis server; a first broadcast text carried by the first voice synthesis request is acquired through a first acquisition module 402; when the recorded first emotion audio corresponding to the first broadcast text is currently stored, the second obtaining module 403 obtains the first emotion audio; the first emotion audio is fed back to the electronic device as a first synthesized audio corresponding to the first broadcast text through a first feedback module 404, so that the electronic device can perform voice broadcast. This scheme can be under the emotion audio's that the current storage real person recorded the condition, feeds back the emotion audio to electronic equipment as the synthetic audio that the report text corresponds to improve voice broadcast's effect.
In one embodiment, a voice broadcasting device is also provided. Referring to fig. 7, fig. 7 is a second schematic structural diagram of a voice broadcasting device according to an embodiment of the present application. Wherein this voice broadcast device 500 is applied to the speech synthesis server, and this voice broadcast device 500 includes receiving module 501, third acquisition module 502, fourth acquisition module 503 and second feedback module 504, as follows:
a receiving module 501, configured to receive a third speech synthesis request sent by an electronic device;
a third obtaining module 502, configured to obtain a third broadcast text carried in the third speech synthesis request;
a fourth obtaining module 503, configured to obtain a third emotion audio when a recorded third emotion audio corresponding to the third broadcast text is currently stored;
and a second feedback module 504, configured to feed the third emotion audio back to the electronic device as a fifth synthesized audio corresponding to the third broadcast text, so that the electronic device performs voice broadcast.
It should be noted that the voice broadcast method in the embodiment in which the voice broadcast device provided in the embodiment of the present application and the voice synthesis server integrate all functions of the voice cache server and the voice synthesis server in the foregoing embodiment belongs to the same concept, and any method provided in the embodiment of the voice broadcast method can be implemented by the voice broadcast device, and a specific implementation process thereof is described in detail in the embodiment of the voice broadcast method, and is not described herein again.
As can be seen from the above, the voice broadcasting device 500 provided in the embodiment of the present application receives, through the receiving module 501, the third voice synthesis request sent by the electronic device; a third obtaining module 502 obtains a third broadcast text carried by the third voice synthesis request; when a recorded third emotion audio corresponding to the third broadcast text is currently stored, a fourth obtaining module 503 obtains the third emotion audio; and feeding back the third emotion audio to the electronic equipment as a fifth synthesized audio corresponding to the third broadcast text through a second feedback module 504, so that the electronic equipment can perform voice broadcast. This scheme can be under the emotion audio's that the current storage real person recorded the condition, feeds back the emotion audio to electronic equipment as the synthetic audio that the report text corresponds to improve voice broadcast's effect.
The embodiment of the application also provides the electronic equipment. It should be noted that the electronic device herein refers to a device corresponding to a server, and may be, for example, a server, and specifically, a voice cache server and a voice synthesis server provided in the present application. Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. The electronic device 600 comprises a processor 601 and a memory 602. The processor 601 is electrically connected to the memory 602.
The processor 601 is a control center of the electronic device 600, connects various parts of the whole electronic device by using various interfaces and lines, and performs various functions of the electronic device and processes data by running or calling a computer program stored in the memory 602 and calling data stored in the memory 602, thereby performing overall monitoring of the electronic device.
The memory 602 may be used to store computer programs and data. The memory 602 stores computer programs comprising instructions executable in the processor. The computer program may constitute various functional modules. The processor 601 executes various functional applications and data processing by calling a computer program stored in the memory 602.
In this embodiment, the processor 601 in the electronic device 600 loads instructions corresponding to one or more processes of the computer program into the memory 602 according to the following steps, and the processor 601 runs the computer program stored in the memory 602, thereby implementing various functions:
intercepting a first voice synthesis request sent by electronic equipment to a voice synthesis server;
acquiring a first broadcast text carried by the first voice synthesis request;
when a recorded first emotion audio corresponding to the first broadcast text is stored currently, acquiring the first emotion audio;
and feeding back the first emotion audio to the electronic equipment as a first synthetic audio corresponding to the first broadcast text so as to enable the electronic equipment to perform voice broadcast.
In one embodiment, before the processor 601 performs intercepting the first speech synthesis request sent by the electronic device to the speech synthesis server, it may perform: acquiring a history broadcast text of which the request times carried in the voice synthesis request are not less than a first preset time; and acquiring and storing the recorded emotion audio corresponding to the history broadcast text.
In one embodiment, before the processor 601 performs intercepting the first speech synthesis request sent by the electronic device to the speech synthesis server, it may perform: acquiring a keyword of a real-time hotspot; acquiring a hot text related to the keyword; and acquiring and storing the recorded emotion audio corresponding to the hot text.
In one embodiment, when a first emotion audio corresponding to the first broadcast text and matching the voiceprint feature with the specified voiceprint feature of the first speech synthesis request is currently stored, processor 601 may perform the obtaining of the first emotion audio.
In one embodiment, after the processor 601 executes intercepting the first speech synthesis request sent by the electronic device to the speech synthesis server, the following steps may be executed: when the recorded first emotion audio corresponding to the first broadcast text is not stored at present, the first voice synthesis request is sent to a voice synthesis server, so that the voice synthesis server carries out audio synthesis processing according to the first broadcast text carried by the first voice synthesis request to generate second synthesized audio, and the second synthesized audio is fed back to the electronic equipment to be used for the electronic equipment to carry out voice broadcast.
In one embodiment, after the processor 601 performs generating the second synthesized audio, it may perform: the second synthesized audio is obtained and stored.
In one embodiment, after the processor 601 executes the acquiring and storing of the second synthesized audio, it may execute: and if the request times of the second synthetic audio in the preset time length are not more than a second preset time, deleting the second synthetic audio.
In one embodiment, after the processor 601 executes the feedback of the first emotion audio as the first synthesized audio corresponding to the first broadcast text to the electronic device for the electronic device to perform voice broadcast, the following steps may be executed: intercepting a second voice synthesis request sent by the electronic equipment to the voice synthesis server, wherein the second voice synthesis request carries a second broadcast text; acquiring a second emotion audio corresponding to a part of text in the second broadcast text and acquiring a third synthesized audio corresponding to another part of text in the second broadcast text; splicing the second emotion audio and the third synthesized audio to obtain a spliced audio; and feeding back the spliced audio to the electronic equipment as a fourth synthetic audio corresponding to the second broadcast text, so that the electronic equipment can perform voice broadcast.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and a part which is not described in detail in a certain embodiment may be referred to the above detailed description of the voice broadcast method, and is not described herein again.
It should be noted that, for the voice broadcasting method described in the embodiment of the present application, it can be understood by those skilled in the art that all or part of the process for implementing the voice broadcasting method described in the embodiment of the present application can be completed by controlling the relevant hardware through a computer program, where the computer program can be stored in a computer-readable storage medium, such as a memory, and executed by at least one processor, and during the execution process, the process of the embodiment of the voice broadcasting method can be included. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a Random Access Memory (RAM), or the like.
For the voice broadcast device in the embodiment of the present application, each functional module may be integrated in one processing chip, or each module may exist alone physically, or two or more modules are integrated in one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium, such as a read-only memory, a magnetic or optical disk, or the like.
Furthermore, the terms "first", "second", and "third", etc. in this application are used to distinguish different objects, and are not used to describe a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to only those steps or modules listed, but rather, some embodiments may include other steps or modules not listed or inherent to such process, method, article, or apparatus.
The voice broadcasting method, the voice broadcasting device, the voice broadcasting storage medium and the electronic device provided by the embodiments of the present application are described in detail above, and a specific example is applied in the text to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understanding the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (14)

1. A voice broadcasting method is applied to a voice cache server and is characterized by comprising the following steps:
intercepting a first voice synthesis request sent by electronic equipment to a voice synthesis server;
acquiring a first broadcast text carried by the first voice synthesis request;
when a recorded first emotion audio corresponding to the first broadcast text is stored currently, acquiring the first emotion audio;
and feeding back the first emotion audio to the electronic equipment as a first synthetic audio corresponding to the first broadcast text so as to enable the electronic equipment to perform voice broadcast.
2. The voice broadcasting method according to claim 1, before intercepting the first voice synthesis request transmitted from the electronic device to the voice synthesis server, further comprising:
acquiring a history broadcast text of which the request times carried in the voice synthesis request are not less than a first preset time;
and acquiring and storing the recorded emotion audio corresponding to the history broadcast text.
3. The voice broadcasting method according to claim 1, wherein before intercepting the first voice synthesis request transmitted from the electronic device to the voice synthesis server, further comprising:
acquiring a keyword of a real-time hotspot;
acquiring a hot text related to the keyword;
and acquiring and storing the recorded emotion audio corresponding to the hot text.
4. The voice broadcasting method according to claim 1, wherein the acquiring the first emotion audio when the recorded first emotion audio corresponding to the first broadcast text is currently stored comprises:
and when a first emotion audio which corresponds to the first broadcast text and is matched with the specified voiceprint feature of the first voice synthesis request is stored currently, acquiring the first emotion audio.
5. The voice broadcasting method according to claim 1, further comprising, after intercepting the first voice synthesis request transmitted by the electronic device to the voice synthesis server:
when the recorded first emotion audio corresponding to the first broadcast text is not stored at present, the first voice synthesis request is sent to a voice synthesis server, so that the voice synthesis server carries out audio synthesis processing according to the first broadcast text carried by the first voice synthesis request to generate second synthesized audio, and the second synthesized audio is fed back to the electronic equipment to be used for the electronic equipment to carry out voice broadcast.
6. The voice broadcasting method according to claim 5, further comprising, after generating the second synthesized audio:
the second synthesized audio is obtained and stored.
7. The voice broadcasting method according to claim 6, further comprising, after intercepting the first voice synthesis request transmitted by the electronic device to the voice synthesis server:
when the first emotion audio corresponding to the recording of the first broadcast text is not stored currently, but the second synthesized audio corresponding to the first broadcast text is stored currently, the second synthesized audio is fed back to the electronic equipment so that the electronic equipment can broadcast the second synthesized audio.
8. The voice broadcasting method according to claim 6, further comprising, after acquiring and storing the second synthesized audio:
and if the request times of the second synthetic audio in the preset time length are not more than a second preset time, deleting the second synthetic audio.
9. The method of claim 6, wherein after the first emotion audio is fed back to the electronic device as a first synthesized audio corresponding to the first broadcast text for voice broadcast by the electronic device, the method further comprises:
intercepting a second voice synthesis request sent by the electronic equipment to the voice synthesis server, wherein the second voice synthesis request carries a second broadcast text;
acquiring a second emotion audio corresponding to a part of text in the second broadcast text and acquiring a third synthesized audio corresponding to another part of text in the second broadcast text;
splicing the second emotion audio and the third synthesized audio to obtain a spliced audio;
and feeding back the spliced audio to the electronic equipment as a fourth synthetic audio corresponding to the second broadcast text, so that the electronic equipment can perform voice broadcast.
10. A voice broadcasting method is applied to a voice synthesis server and is characterized by comprising the following steps:
receiving a third voice synthesis request sent by the electronic equipment;
acquiring a third broadcast text carried by the third voice synthesis request;
when a recorded third emotion audio corresponding to the third broadcast text is stored currently, acquiring the third emotion audio;
and feeding back the third emotion audio to the electronic equipment as a fifth synthesized audio corresponding to the third broadcast text, so that the electronic equipment can perform voice broadcast.
11. The utility model provides a voice broadcast device, is applied to voice cache server, its characterized in that includes:
the system comprises an interception module, a voice synthesis server and a voice synthesis server, wherein the interception module is used for intercepting a first voice synthesis request sent by electronic equipment to the voice synthesis server;
the first acquisition module is used for acquiring a first broadcast text carried by the first voice synthesis request;
the second obtaining module is used for obtaining the first emotion audio when the recorded first emotion audio corresponding to the first broadcast text is stored currently;
and the first feedback module is used for feeding back the first emotion audio to the electronic equipment as a first synthesized audio corresponding to the first broadcast text so as to enable the electronic equipment to perform voice broadcast.
12. The utility model provides a voice broadcast device, is applied to speech synthesis server, its characterized in that includes:
the receiving module is used for receiving a third voice synthesis request sent by the electronic equipment;
a third obtaining module, configured to obtain a third broadcast text carried by the third voice synthesis request;
the fourth obtaining module is used for obtaining a third emotion audio when the recorded third emotion audio corresponding to the third broadcast text is stored currently;
and the second feedback module is used for feeding the third emotion audio back to the electronic equipment as a fifth synthesized audio corresponding to the third broadcast text, so that the electronic equipment can perform voice broadcast.
13. A computer storage medium having a computer program stored thereon, characterized by causing a computer to execute a voice broadcasting method according to any one of claims 1 to 10 when the computer program runs on the computer.
14. An electronic device comprising a processor and a memory, the memory storing a computer program, wherein the processor is configured to execute the voice broadcasting method according to any one of claims 1 to 10 by calling the computer program.
CN202110378920.7A 2021-04-08 2021-04-08 Voice broadcasting method and device, storage medium and electronic equipment Active CN113516962B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110378920.7A CN113516962B (en) 2021-04-08 2021-04-08 Voice broadcasting method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110378920.7A CN113516962B (en) 2021-04-08 2021-04-08 Voice broadcasting method and device, storage medium and electronic equipment

Publications (2)

Publication Number Publication Date
CN113516962A true CN113516962A (en) 2021-10-19
CN113516962B CN113516962B (en) 2024-04-02

Family

ID=78061397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110378920.7A Active CN113516962B (en) 2021-04-08 2021-04-08 Voice broadcasting method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113516962B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1584980A (en) * 2004-06-01 2005-02-23 安徽中科大讯飞信息科技有限公司 Method for synthetic output with prompting sound and text sound in speech synthetic system
CN1945692A (en) * 2006-10-16 2007-04-11 安徽中科大讯飞信息科技有限公司 Intelligent method for improving prompting voice matching effect in voice synthetic system
US20080235024A1 (en) * 2007-03-20 2008-09-25 Itzhack Goldberg Method and system for text-to-speech synthesis with personalized voice
CN105355193A (en) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 Speech synthesis method and device
CN111261139A (en) * 2018-11-30 2020-06-09 上海擎感智能科技有限公司 Character personification broadcasting method and system
CN112133281A (en) * 2020-09-15 2020-12-25 北京百度网讯科技有限公司 Voice broadcasting method and device, electronic equipment and storage medium
CN112489620A (en) * 2020-11-20 2021-03-12 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1584980A (en) * 2004-06-01 2005-02-23 安徽中科大讯飞信息科技有限公司 Method for synthetic output with prompting sound and text sound in speech synthetic system
CN1945692A (en) * 2006-10-16 2007-04-11 安徽中科大讯飞信息科技有限公司 Intelligent method for improving prompting voice matching effect in voice synthetic system
US20080235024A1 (en) * 2007-03-20 2008-09-25 Itzhack Goldberg Method and system for text-to-speech synthesis with personalized voice
CN105355193A (en) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 Speech synthesis method and device
CN111261139A (en) * 2018-11-30 2020-06-09 上海擎感智能科技有限公司 Character personification broadcasting method and system
CN112133281A (en) * 2020-09-15 2020-12-25 北京百度网讯科技有限公司 Voice broadcasting method and device, electronic equipment and storage medium
CN112489620A (en) * 2020-11-20 2021-03-12 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment

Also Published As

Publication number Publication date
CN113516962B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN110473546B (en) Media file recommendation method and device
CN108831437B (en) Singing voice generation method, singing voice generation device, terminal and storage medium
US20150179175A1 (en) Identification of utterance subjects
US10824664B2 (en) Method and apparatus for providing text push information responsive to a voice query request
US11640832B2 (en) Emotion-based voice interaction method, storage medium and terminal device using pitch, fluctuation and tone
US20140172419A1 (en) System and method for generating personalized tag recommendations for tagging audio content
CN109979450B (en) Information processing method and device and electronic equipment
KR20130081176A (en) Mobile terminal and mothod for controling of the same
CN104252464A (en) Information processing method and information processing device
CN108899036A (en) A kind of processing method and processing device of voice data
JP2020003774A (en) Method and apparatus for processing speech
CN109346057A (en) A kind of speech processing system of intelligence toy for children
US20220093103A1 (en) Method, system, and computer-readable recording medium for managing text transcript and memo for audio file
CN110968673B (en) Voice comment playing method and device, voice equipment and storage medium
CN113761268A (en) Playing control method, device, equipment and storage medium of audio program content
CN111563182A (en) Voice conference record storage processing method and device
CN103559242A (en) Method for achieving voice input of information and terminal device
CN112102807A (en) Speech synthesis method, apparatus, computer device and storage medium
CN113516962A (en) Voice broadcasting method and device, storage medium and electronic equipment
CN109616116B (en) Communication system and communication method thereof
CN115563262B (en) Processing method and related device for dialogue data in machine voice call-out scene
CN114783408A (en) Audio data processing method and device, computer equipment and medium
CN114049875A (en) TTS (text to speech) broadcasting method, device, equipment and storage medium
CN112712793A (en) ASR (error correction) method based on pre-training model under voice interaction and related equipment
KR102376552B1 (en) Voice synthetic apparatus and voice synthetic method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant