WO2022077927A1 - 生成播报语音的方法、装置、设备和计算机存储介质 - Google Patents

生成播报语音的方法、装置、设备和计算机存储介质 Download PDF

Info

Publication number
WO2022077927A1
WO2022077927A1 PCT/CN2021/097840 CN2021097840W WO2022077927A1 WO 2022077927 A1 WO2022077927 A1 WO 2022077927A1 CN 2021097840 W CN2021097840 W CN 2021097840W WO 2022077927 A1 WO2022077927 A1 WO 2022077927A1
Authority
WO
WIPO (PCT)
Prior art keywords
broadcast
scene
voice
template
speech
Prior art date
Application number
PCT/CN2021/097840
Other languages
English (en)
French (fr)
Inventor
丁世强
黄际洲
吴迪
Original Assignee
北京百度网讯科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京百度网讯科技有限公司 filed Critical 北京百度网讯科技有限公司
Priority to JP2021576765A priority Critical patent/JP2023502815A/ja
Priority to EP21827167.4A priority patent/EP4012576A4/en
Priority to KR1020217042726A priority patent/KR20220051136A/ko
Priority to US17/622,922 priority patent/US20220406291A1/en
Publication of WO2022077927A1 publication Critical patent/WO2022077927A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • G06F40/35Discourse or dialogue representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information

Definitions

  • the present application relates to the technical field of computer applications, and in particular, to a method, apparatus, device and computer storage medium for generating and broadcasting speech under the speech technology and knowledge graph technology.
  • the voice broadcast meets the needs of users to a large extent in terms of sound, because the content of the voice broadcast is fixed in each scenario, the effect is not satisfactory. For example, at the beginning of the navigation, no matter what kind of voice package the user uses, "start to go" is broadcast.
  • the present application can provide a method, apparatus, device and computer storage medium for generating broadcast voice, so as to improve the effect of voice broadcast.
  • the present application provides a method for generating broadcast speech, including:
  • the broadcast template is filled with the speech to generate broadcast speech.
  • the present application provides a device for generating broadcast voice, including:
  • the speech acquisition module is used to obtain the speech matching the scene from the voice package
  • a template acquisition module used to acquire a broadcast template pre-configured for the scene
  • the voice generation module is used for filling the broadcast template with the speech to generate the broadcast voice.
  • the present application provides an electronic device, comprising:
  • the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described above.
  • the present application provides a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to execute the above method.
  • the present application utilizes the vocabulary matching the scene in the voice packet to fill the broadcast template to obtain the broadcast voice, so that the broadcast voice can well reflect the personality characteristics of the entity object of the voice packet, which greatly improves the performance of the voice packet.
  • users can feel that they are really listening to the speech of the entity object of the voice package.
  • Fig. 1 is the principle schematic diagram of generating broadcast voice in the prior art
  • FIG. 2 shows an exemplary system architecture to which embodiments of the present application may be applied
  • FIG. 3 is a flow chart of a main method provided by the embodiment of the present application.
  • FIG. 4 is a schematic diagram of the principle of generating a broadcast voice provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a method for mining style speech provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of a method for mining knowledge words provided by an embodiment of the present application.
  • FIG. 7 is an example diagram of a partial knowledge graph provided by an embodiment of the present application.
  • FIG. 8 is a structural diagram of an apparatus for generating a broadcast voice provided by an embodiment of the present application.
  • FIG. 9 is a block diagram of an electronic device used to implement an embodiment of the present application.
  • Generating broadcast text can include but is not limited to two situations:
  • the other is active broadcast text generation. That is, during the voice broadcast process of a function, the voice broadcast is actively performed. For example, in the process of navigation, actively broadcast the broadcast text such as "start to go", “turn left ahead” and so on. In this case, the broadcast text is generated after scene analysis is mainly performed based on the current actual situation.
  • the timbre information in the voice packet After generating the broadcast text, use the timbre information in the voice packet to perform speech synthesis to obtain the voice to be broadcast.
  • the voice content broadcast by different voice packets is the same, and only the timbre is different.
  • the user uses his son's voice package and uses a celebrity's voice package to broadcast "find the nearest cafe, located at ***" in the scenario of "querying a cafe".
  • FIG. 2 shows an exemplary system architecture of the method for generating broadcast voice or the apparatus for generating broadcast voice according to the embodiments of the present application.
  • the system architecture may include terminal devices 101 and 102 , a network 103 and a server 104 .
  • the network 103 is a medium used to provide a communication link between the terminal devices 101 , 102 and the server 104 .
  • the network 103 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
  • the user can interact with the server 104 through the network 103 using the terminal devices 101 and 102 .
  • Various applications may be installed on the terminal devices 101 and 102 , such as voice interaction applications, map applications, web browser applications, communication applications, and the like.
  • the terminal devices 101 and 102 may be various electronic devices that support voice broadcast. Including but not limited to smartphones, tablets, laptops, smart wearables, etc.
  • the apparatus for generating broadcast voice provided by this application can be set and run in the above-mentioned server 104 , or can be set and run in the terminal devices 101 and 102 . It can be implemented as a plurality of software or software modules (for example, used to provide distributed services), or can be implemented as a single software or software module, which is not specifically limited herein.
  • the server 104 may be a single server or a server group composed of multiple servers. It should be understood that the numbers of terminal devices, networks and servers in FIG. 2 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
  • FIG. 3 is a flow chart of a main method provided by an embodiment of the present application. As shown in FIG. 3 , the method may include the following steps:
  • the utterance matching the scene is obtained from the speech package.
  • the voice packet in addition to the timbre information, also includes various vocabulary information.
  • words can be understood as the way of speaking. When expressing the same meaning, different expressions can be used, that is, different words are used. In this embodiment of the present application, for the same scene, different speech packets may use different words.
  • the words include at least one of address words, style words and knowledge words. Salutation is a way of addressing users. Style is the expression of a particular style. Knowledge discourse is an expression based on specific knowledge content.
  • style speech For the same scene "speeding”, when a user uses his family's voice package, it reflects a warm-hearted style, and the style speech can be used "speeding, be careful and pay attention to safety when driving” .
  • Comedian's voice package it reflects a funny style, and can use the style phrase "We are just an ordinary driver, don't pretend to drive F1 and slow down”.
  • a broadcast template preconfigured for the scene is acquired.
  • a broadcast template may be configured for each scene in advance.
  • a broadcast template may include a combination of at least one discourse.
  • the broadcast template is filled with the acquired vocabulary to generate broadcast speech.
  • the broadcast template corresponds to the scene, and there are personalized words that match the scene in the voice package.
  • the broadcast text obtained after filling the broadcast template with this vocabulary can well reflect the entity objects of the voice package (such as son, son, etc.). Wife, a celebrity, etc.) correspond to the personality characteristics, which greatly improves the broadcast effect and makes users feel that they are really listening to the physical object of the voice package.
  • the timbre information in the voice packet can be further used for speech synthesis to finally generate the broadcast voice. This part is the same as that in the prior art and will not be described in detail.
  • the present application utilizes the speech information in the speech packet in the process of generating the broadcast text.
  • the method for generating the speech information in the voice packet will be described in detail below with reference to the embodiments.
  • the salutation in the voice package it can be set by the user.
  • components such as an input box or an option for a salutation phrase may be provided to the user in the setting interface for the voice package for the user to input or select the salutation phrase.
  • a user who uses a son's voice package can be provided with a setting interface for the voice package, which includes "dad”, “mother”, “grandpa”, “husband”, “wife”, and “grandma” , "Grandma”, “Grandpa”, “Baby” and other commonality options for users to choose.
  • An input box can also be provided for user input.
  • pre-set content can be obtained, for example, pre-set by R&D personnel, service providers, and the like.
  • stylistic words can be obtained in advance through search engines.
  • the steps shown in Figure 5 can be used:
  • a search keyword is obtained by splicing preset style keywords and scene keywords.
  • the style keyword can also be set by the user.
  • an input box of style keywords or a component of selection items may be provided to the user in the setting interface for the voice package for the user to input or select.
  • options for style keywords such as "intimate”, “funny”, “overbearing”, and “tiktok style” may be provided for the user to select on the setting interface of the voice package.
  • stylistic discourse candidates are selected from the search result text corresponding to the search keywords.
  • the scene keywords are "coffee shop” and “coffee”
  • the style keyword of the voice package currently used by the user is "warm heart”
  • the keywords "coffee shop warm heart” and “coffee shop warm heart” can be constructed.
  • “warm heart” after searching separately, you can get the text of the search result, such as the title and abstract of the search result.
  • the top N search result texts are taken as stylistic discourse candidates. where N is a preset positive integer.
  • the above-mentioned corrections to the stylistic speech candidates may be that the developer can adjust, combine and select the stylistic speech candidates to obtain the final stylized speech.
  • a salutation slot can also be added to the style of speech.
  • other correction methods can also be used.
  • pre-set content can be acquired, for example, pre-set by R&D personnel, service providers, and the like.
  • knowledge words can be pre-mined based on knowledge graphs.
  • the steps shown in Figure 6 can be used:
  • the voice package corresponds to a certain entity object, and reflects the timbre of the entity object. For example, if the user adopts the voice package of a relative, the entity corresponding to the voice package is the relative. For another example, if the user adopts the voice package of star A, the entity corresponding to the voice package is star A.
  • Each entity has its corresponding knowledge graph. Therefore, in this step, the knowledge graph of the entity corresponding to the voice package can be obtained.
  • each knowledge node contains specific content and associations with other knowledge nodes. Take part of the content of the knowledge graph shown in Figure 7 as an example. Taking the voice package of "Star A” as an example, its corresponding entity is “Star A". In the knowledge map, it can include, for example, “Whistleblower”, “Luckin Coffee”, “Central Academy of Drama”, “Hangzhou City” " and other knowledge nodes, in which the relationship between "Whistleblower” and “Star A” is "Hot Movie", the relationship between "Luckin Coffee” and “Star A” is “Advertisement Endorsement", "Central Academy of Drama” The relationship with “Star A” is “Graduate School", and the relationship between "Hangzhou City” and “Star A” is “Birthplace".
  • the scene keyword can be matched with the content and association relationship of the knowledge node.
  • a knowledge vocabulary corresponding to the scene is generated by using the acquired knowledge node and the vocabulary template corresponding to the scene.
  • a discourse template of knowledge discourse can be preset. For example, for the scene "inquire about the movie theater", the phrase template "Come to the movie theater to watch my new movie [movie name]" can be set, and after the knowledge node "whistleblower" is determined in step 602, it is filled in The slot [movie name] in the phrase template will generate the knowledge phrase "Come to the cinema and watch my new movie "The Whistleblower".
  • the voice package there may be some or all of the salutation, style, and knowledge words.
  • the matching may be performed based on the text similarity, for example, if the text similarity between the words and the keywords of the scene is greater than or equal to a preset similarity threshold, it is considered a match. In this way, you can find more comprehensive words that are closer to the scene.
  • step 303 The implementations of "obtaining the broadcast template pre-configured for the scene” in the above step 302 and "using the acquired vocabulary to fill the broadcast template to generate the broadcast voice" in step 303 are described below with reference to the embodiments.
  • At least one broadcast template and attribute information of each broadcast template may be configured in advance for each scene.
  • the broadcast template includes a combination of at least one language, and in addition to the above-mentioned address language, style language, and knowledge language, it may further include basic language, and the basic language can be stored on the server side.
  • the attribute information may include at least one of priority and inter-discourse constraint rules, and the like.
  • speech synthesis can be performed based on the timbre information in the voice packet to obtain the broadcast voice. This way of generating the broadcast voice makes the voice heard by the user as if it was spoken by his own son, which is very heart-warming and has a strong personalization effect.
  • FIG. 8 is a structural diagram of an apparatus for generating a broadcast voice provided by an embodiment of the present application.
  • the apparatus may be an application located in a local terminal, or may also be a plug-in or a software development kit (Software Development Kit, SDK) located in the application of the local terminal. ) and other functional units, or can also be located on the server side.
  • the apparatus may include: a speech acquisition module 00 , a template acquisition module 10 and a speech generation module 20 , and may further include: a first mining module 30 and a second mining module 40 .
  • the main functions of each constituent unit are as follows:
  • the speech acquisition module 00 is used for acquiring the speech matching the scene from the voice packet.
  • the vocabulary acquisition module 00 can determine the keywords of the scene; and acquire the vocabulary matching the keywords of the scene from the voice package.
  • the discourse includes at least one of address discourse, style discourse and knowledge discourse.
  • the template obtaining module 10 is configured to obtain a broadcast template pre-configured for the scene.
  • the template acquisition module 10 can determine at least one broadcast template pre-configured for the scene and attribute information of each broadcast template, and the broadcast template includes a combination of at least one vocabulary; according to the attribute information of each broadcast template and the attribute information of each broadcast template The voice package selects one broadcast template configured for the scene from at least one broadcast template.
  • the voice generation module 20 is used for filling the broadcast template by using the words to generate the broadcast voice.
  • the speech generation module 20 may include: a text generation submodule 21 and a speech synthesis submodule 22 .
  • the text generation sub-module 21 is used to fill in the broadcast template by using the words to generate the broadcast text.
  • the speech synthesis sub-module 22 is configured to perform speech synthesis on the broadcast text by using the timbre information in the speech packet to obtain the broadcast speech.
  • the salutation in the voice package it can be set by the user.
  • components such as an input box or an option for a salutation phrase may be provided to the user in the setting interface for the voice package for the user to input or select the salutation phrase.
  • pre-set content can be obtained, for example, pre-set by R&D personnel, service providers, and the like.
  • the stylistic words can be obtained in advance by the first mining module 30 through a search engine.
  • the first mining module 30 is used to obtain the style words in the voice packet by mining in advance in the following manner:
  • the second mining module 40 is used to obtain the knowledge words in the voice packet by mining in advance in the following manner:
  • the knowledge vocabulary of the corresponding scene is generated.
  • the present application further provides an electronic device and a readable storage medium.
  • FIG. 9 it is a block diagram of an electronic device for generating a method for broadcasting speech according to an embodiment of the present application.
  • Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers.
  • Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices.
  • the components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the application described and/or claimed herein.
  • the electronic device includes: one or more processors 901, a memory 902, and interfaces for connecting various components, including a high-speed interface and a low-speed interface.
  • the various components are interconnected using different buses and may be mounted on a common motherboard or otherwise as desired.
  • the processor may process instructions executed within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface.
  • multiple processors and/or multiple buses may be used with multiple memories and multiple memories, if desired.
  • multiple electronic devices may be connected, each providing some of the necessary operations (eg, as a server array, a group of blade servers, or a multiprocessor system).
  • a processor 901 is taken as an example in FIG. 9 .
  • the memory 902 is the non-transitory computer-readable storage medium provided by the present application.
  • the memory stores instructions executable by at least one processor, so that the at least one processor executes the method for generating broadcast speech provided by the present application.
  • the non-transitory computer-readable storage medium of the present application stores computer instructions, and the computer instructions are used to cause a computer to execute the method for generating broadcast speech provided by the present application.
  • the memory 902 can be used to store non-transitory software programs, non-transitory computer-executable programs and modules, such as program instructions/modules corresponding to the method for generating broadcast voice in the embodiments of the present application.
  • the processor 901 executes various functional applications and data processing of the server by running the non-transitory software programs, instructions and modules stored in the memory 902, that is, implementing the method for generating broadcast speech in the above method embodiments.
  • the memory 902 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the electronic device, and the like. Additionally, memory 902 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 902 may optionally include memory located remotely from processor 901, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.
  • the electronic device may further include: an input device 903 and an output device 904 .
  • the processor 901 , the memory 902 , the input device 903 and the output device 904 may be connected by a bus or in other ways, and the connection by a bus is taken as an example in FIG. 9 .
  • the input device 903 can receive input numerical or character information, and generate key signal input related to user settings and function control of the electronic device, such as a touch screen, keypad, mouse, trackpad, touchpad, pointing stick, one or more Input devices such as mouse buttons, trackballs, joysticks, etc.
  • Output devices 904 may include display devices, auxiliary lighting devices (eg, LEDs), haptic feedback devices (eg, vibration motors), and the like.
  • the display device may include, but is not limited to, a liquid crystal display (LCD), a light emitting diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.
  • Various implementations of the systems and techniques described herein can be implemented in digital electronic circuitry, integrated circuit systems, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
  • the processor which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
  • machine-readable medium and “computer-readable medium” refer to any computer program product, apparatus, and/or apparatus for providing machine instructions and/or data to a programmable processor ( For example, magnetic disks, optical disks, memories, programmable logic devices (PLDs), including machine-readable media that receive machine instructions as machine-readable signals.
  • machine-readable signal refers to any signal used to provide machine instructions and/or data to a programmable processor.
  • the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer.
  • a display device eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor
  • a keyboard and pointing device eg, a mouse or trackball
  • Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.
  • the systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system.
  • the components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
  • a computer system can include clients and servers.
  • Clients and servers are generally remote from each other and usually interact through a communication network.
  • the relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Acoustics & Sound (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种生成播报语音的方法、装置、设备和计算机存储介质,涉及语音技术和知识图谱技术领域。具体实现方案为:从语音包中获取与场景匹配的话术,以及,获取预先为所述场景配置的播报模板;利用所述话术对所述播报模板进行填充,生成播报语音。本申请能够使得播报语音能够很好地体现该语音包的实体对象的人格特点,大大提高了播报效果。

Description

生成播报语音的方法、装置、设备和计算机存储介质
本申请要求了申请日为2020年10月15日,申请号为2020111059358发明名称为“生成播报语音的方法、装置、设备和计算机存储介质”的中国专利申请的优先权。
技术领域
本申请涉及计算机应用技术领域,特别涉及语音技术和知识图谱技术下生成播报语音的方法、装置、设备和计算机存储介质。
背景技术
随着用户对于智能终端功能需求的不断提高,越来越多的应用程序中集成了语音播报功能。用户可以下载并安装各种语音包,使得语音播报可以使用自己喜欢的任务的声音。
目前,虽然语音播报在声音方面很大程度上满足了用户的需求,但由于语音播报的内容在各场景下都是固定的,效果并不尽如人意。例如,在导航刚开始的时候,不管用户使用什么样的语音包都播报“开始出发”。
发明内容
有鉴于此,本申请能提供了一种生成播报语音的方法、装置、设备和计算机存储介质,以便于提高语音播报的效果。
第一方面,本申请提供了一种生成播报语音的方法,包括:
从语音包中获取与场景匹配的话术,以及,获取预先为所述场景配置的播报模板;
利用所述话术对所述播报模板进行填充,生成播报语音。
第二方面,本申请提供了一种生成播报语音的装置,包括:
话术获取模块,用于从语音包中获取与场景匹配的话术;
模板获取模块,用于获取预先为所述场景配置的播报模板;
语音生成模块,用于利用所述话术对所述播报模板进行填充,生成播报语音。
第三方面,本申请提供了一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的方法。
第四方面,本申请提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行上述的方法。
由以上技术方案可以看出,本申请利用语音包中与场景匹配的话术,对播报模板进行填充后得到播报语音,使得播报语音能够很好地体现该语音包的实体对象的人格特点,大大提高了播报效果,让用户有真的在听该语音包的实体对象说话的感觉。
上述可选方式所具有的其他效果将在下文中结合具体实施例加以说明。
附图说明
附图用于更好地理解本方案,不构成对本申请的限定。其中:
图1是现有技术中生成播报语音的原理示意图;
图2示出了可以应用本申请实施例的示例性系统架构;
图3为本申请实施例提供的主要方法流程图;
图4为本申请实施例提供的生成播报语音的原理示意图;
图5为本申请实施例提供的挖掘风格话术的方法流程图;
图6为本申请实施例提供的挖掘知识话术的方法流程图;
图7为本申请实施例提供的部分知识图谱的实例图;
图8为本申请实施例提供的生成播报语音的装置结构图;
图9是用来实现本申请实施例的电子设备的框图。
具体实施方式
以下结合附图对本申请的示范性实施例做出说明,其中包括本申请实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本申请的范围和精神。同样,为了清楚和简明, 以下的描述中省略了对公知功能和结构的描述。
在现有技术中,生成播报语音的原理可以如图1中所示。生成播报文本可以包括但不限于两种情况:
一种是基于对话的播报文本生成。即在接收到用户语音指令后,响应于用户语音指令所生成的回复文本作为播报文本。例如,接收到用户的语音指令“查询咖啡店”,生成的回复文本为“为你找到最近的咖啡馆,位于中关村南大街北京国际大厦C层,距你2.1公里”。这种情况中,主要基于对话理解进行场景和用户意图的分析后生成回复文本。
另一种是主动的播报文本生成。即在某个功能的语音播报过程中,主动进行语音播报。例如,在导航过程中,主动播报“开始出发”、“前方左拐”等播报文本。这种情况中,主要基于目前所处的实际状况进行场景分析后生成播报文本。
生成播报文本后,利用语音包中的音色信息进行语音合成得到要播报的语音。上述现有技术生成的播报语音在同样场景下,不同语音包播报的语音内容是相同的,仅仅是音色上存在差异。例如,用户采用儿子的语音包采用某明星的语音包在“查询咖啡馆”的场景下,都播报“找到最近的咖啡馆,位于***”。
图2示出了可以应用本申请实施例的生成播报语音的方法或生成播报语音的装置的示例性系统架构。
如图2所示,该系统架构可以包括终端设备101和102,网络103和服务器104。网络103用以在终端设备101、102和服务器104之间提供通信链路的介质。网络103可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101和102通过网络103与服务器104交互。终端设备101和102上可以安装有各种应用,例如语音交互应用、地图类应用、网页浏览器应用、通信类应用等。
终端设备101和102可以是支持语音播报的各种电子设备。包括但不限于智能手机、平板电脑、笔记本电脑、智能可穿戴式设备等等。本申请所提供的生成播报语音的装置可以设置并运行于上述服务器104中,也可以设置并运行于终端设备101和102中。其可以实现成多个软件或软件模块(例如用来提供分布式服务),也可以实现成单个软件或 软件模块,在此不做具体限定。
服务器104可以是单一服务器,也可以是多个服务器构成的服务器群组。应该理解,图2中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。
图3为本申请实施例提供的主要方法流程图,如图3中所示,该方法可以包括以下步骤:
在301中,从语音包中获取与场景匹配的话术。
在本申请实施例中,语音包中除了包含音色信息之外,还会包括各种话术信息。所谓“话术”可以理解为说话的方式,在表达相同意思的时候可以采用不同的表达方式,即采用不同的话术。在本申请实施例中,对于同一场景,不同语音包可以采用不同的话术。其中,话术包括称呼话术、风格话术和知识话术等中的至少一种。称呼话术就是对用户的称呼的表达方式。风格话术就是采用特定风格的表达方式。知识话术就是基于特定知识内容的表达方式。
举个称呼话术的例子,当用户采用其儿子的语音包时,称呼话术可以采用“爸爸”。当用户采用其老婆的语音包时,称呼话术可以采用“老公”。当然,一个语音包中也可以不存在称呼话术信息。例如,对于明星的语音包,则可以不采用称呼话术,都统称为“你”、“您”之类的基础话术。
举一个风格话术的例子,针对同一个场景“超速”,当用户采用其家人的语音包时,体现的是暖心的风格,可以采用风格话术“超速了,开车要小心注意安全哦”。当用户采用笑星的语音包时,体现的是搞笑的风格,可以采用风格话术“咱就是一个普通司机,不要老假装开F1,减速慢行”。
举一个知识话术的例子,针对场景“咖啡店”,当用户采用明星A的语音包时,可以采用知识话术“来一杯xxx咖啡吧”,其中“xxx”可以是明星A所代言的咖啡品牌。若用户采用明星B的语音包,则知识话术中“xxx”可以是明星B所代言的咖啡品牌。
语音包中各种话术的生成方式将在后续实施例中详述。
在302中,获取预先为该场景配置的播报模板。
在本申请实施例中,可以预先为各场景配置播报模板。播报模板可 以包括至少一种话术的组合。
在303中,利用获取的话术对该播报模板进行填充,生成播报语音。
播报模板是与场景相对应的,语音包中存在与场景匹配的人格化的话术,采用该话术对播报模板进行填充后得到的播报文本能够很好地体现该语音包的实体对象(例如儿子、老婆、某名人等等)对应的人格特点,大大提高了播报效果,让用户有真的在听该语音包的实体对象说话的感觉。
在得到播报文本后,可以进一步利用语音包中的音色信息进行语音合成最终生成播报语音,该部分与现有技术中相同,不做详述。
如图4中所示,本申请在生成播报文本的过程中利用了语音包中的话术信息。下面结合实施例对语音包中话术信息的生成方式进行详细描述。
对于语音包中的称呼话术,可以由用户设置得到。作为一种优选的实施方式,可以在针对语音包的设置界面向用户提供称呼话术的输入框或选项等组件供用户输入或选择称呼话术。例如,针对使用儿子语音包的用户,可以向其提供针对该语音包的设置界面,该设置界面中包含“爸爸”、“妈妈”、“爷爷”、“老公”、“老婆”、“奶奶”、“姥姥”、“姥爷”、“宝贝”等常见称呼的选项供用户选择。也可以提供输入框供用户自行输入。
对于语音包中的风格话术,可以获取预先设置的内容,例如由研发人员、服务提供者等预先设置。但作为一种优选的实施方式,风格话术可以通过搜索引擎预先挖掘得到。例如可以采用如图5中所示步骤:
在501中,利用预设的风格关键词和场景的关键词拼接得到搜索关键词。
其中风格关键词也可以由用户设置得到。例如,可以在针对语音包的设置界面向用户提供风格关键词的输入框或选择项的组件供用户输入或选择。例如,可以在语音包的设置界面提供诸如“贴心”、“搞笑”、“霸道”、“抖音风”等风格关键词的选择项供用户选择。
在502中,从搜索关键词对应的搜索结果文本中选择风格话术候选项。
假设当前场景为询问咖啡店,场景关键词为“咖啡店”、“咖啡”, 用户当前使用的语音包的风格关键词为“暖心”,可以构建关键词“咖啡店暖心”、“咖啡暖心”,分别进行搜索后,可以获取搜索结果文本,例如搜索结果的标题、摘要等。基于与搜索关键词的相关性排序后,取排在前N个的搜索结果文本作为风格话术候选项。其中N为预设的正整数。
在503中,将对风格话术候选项进行修正后,得到风格话术。
在本实施例中,上述对风格话术候选项进行的修正可以是研发人员可以对风格话术候选项进行调整、组合和选择等处理后,得到最终的风格话术。在风格话术中也可以加入称呼槽位。除了人工修正的方式之外,也可以采用其他修正方式。
例如,从风格话术候选项“咖啡可以让我很精神,倘若想好好睡一觉,就先戒掉它吧”、“抿上一口咖啡,虽然有点苦,但而后的甘甜会让你忘记苦涩”、“生活就像一杯咖啡,苦中带甜,甜中带乐”。进行人工修正后,可以得到风格话术:“喝咖啡可以提神醒脑,但会影响睡眠,【称呼】要多注意休息呀”。
对于语音包中的知识话术,可以获取预先设置的内容,例如由研发人员、服务提供者等预先设置。但作为一种优选的实施方式,知识话术可以基于知识图谱预先挖掘得到。例如可以采用如图6中所示的步骤:
在601中,获取语音包所关联的知识图谱。
通常语音包是与一定实体对象对应的,体现的是该实体对象的音色。例如,用户采用亲人的语音包,那么该语音包对应的实体就是该亲人。再例如,用户采用明星A的语音包,那么该语音包对应的实体就是明星A。每个实体都存在其对应的知识图谱,因此,本步骤中可以获取语音包所对应实体的知识图谱。
在602中,从知识图谱中获取与场景匹配的知识节点。
在知识库图谱中,每个知识节点都包含具体内容以及与其他知识节点的关联关系。以图7中所示知识图谱的部分内容为例。以“明星A”的语音包为例,其对应的实体为“明星A”,在知识图谱中,可以包含例如“吹哨人”、“瑞幸咖啡”、“中央戏剧学院”、“杭州市”等知识节点,其中“吹哨人”与“明星A”的关联关系为“热映电影”,“瑞幸咖啡”与“明星A”的关联关系为“广告代言”,“中央戏剧学院” 与“明星A”的关联关系为“毕业院校”,“杭州市”与“明星A”的关联关系为“出生地”。在获取与场景匹配的知识节点时,可以将场景关键词与知识节点的内容、关联关系进行匹配。
在603中,利用获取的知识节点以及对应场景的话术模板,生成对应场景的知识话术。
对于每种场景而言,可以预先设置知识话术的话术模板。例如,对于场景“查询电影院”,则可以设置话术模板“来电影院看我新上映的电影【电影名】吧”,在步骤602中确定出知识节点“吹哨人”后,将其填入话术模板中的槽位【电影名】,从而生成知识话术“来电影院看我新上映的电影《吹哨人》吧”。
对于语音包而言,可能具有称呼话术、风格话术和知识话术中的部分或全部。作为一种优选的实施方式,在上述步骤301中“从语音包中获取与场景匹配的话术”时,可以首先确定场景的关键词,然后从语音包中获取与场景的关键词相匹配的话术。其中,在进行匹配时可以基于文本相似度的方式,例如话术与场景的关键词之间的文本相似度大于或等于预设的相似度阈值,则认为匹配。这种方式可以找到与场景较为接近的且比较全面的话术。
除了上述优选的实施方式之外,也可以采用其他的方式。例如预先设置好话术与各场景的匹配关系,等等。
下面结合实施例对上述步骤302中“获取预先为该场景配置的播报模板”以及步骤303中“利用获取的话术对该播报模板进行填充,生成播报语音”的实现方式进行描述。
可以预先针对各场景配置至少一个播报模板以及各播报模板的属性信息。其中,播报模板包括至少一种话术的组合,除了上述称呼话术、风格话术和知识话术之外,还可以进一步包括基础话术,基础话术可以在服务器端存储。属性信息可以包括优先级和话术间的约束规则等的至少一种。
举一个例子,假设针对“查询咖啡馆”这一主题,设置了6种播报模板,其优先级和约束规则如表1中所示。
表1
播报模板 优先级 约束规则
【称呼】【知识话术】 10 知识话术中不能有称呼
【知识话术】 9
【称呼】【基础话术】【风格话术】 7 风格话术中不能有称呼
【基础话术】【风格话术】 5
【称呼】【基础话术】 2
【基础话术】 0
假设用户使用儿子的语音包,在“查询咖啡馆”的场景下,获取到语音包中与该场景匹配的话术如下:
称呼话术:爸爸;
风格话术:喝咖啡可以提神醒脑,但会影响睡眠,【称呼】要多注意休息呀。
从表1中所示的播报模板中,按照优先级从高到低进行筛选。由于不存在与该场景匹配的知识话术,因此,前两个模板不采用。由于第三个模板的约束规则中,约束风格话术不能有称呼,也不能够采用,因此可以采用第四个模板“【基础话术】【风格话术】”。
从服务器端获取该场景的基础话术“找到最近的咖啡馆,位于***”,从语音包中获取该场景的风格话术“喝咖啡可以提神醒脑,但会影响睡眠,【称呼】要多注意休息呀”对第四个模板进行填充,最终得到播报文本:“找到最近的咖啡馆,位于***,喝咖啡可以提神醒脑,但会影响睡眠,爸爸要多注意休息呀”。
在得到播报文本后,可以基于语音包中的音色信息进行语音合成,得到播报语音。这种播报语音的生成方式,使得该用户听到的语音像是自己儿子说出来的一样,非常暖心,具有很强的人格化效果。
以上是对本申请所提供的方法进行的详细描述,下面结合对本申请提供的装置进行详细描述。
图8为本申请实施例提供的生成播报语音的装置结构图,该装置可以是位于本地终端的应用,或者还可以是位于本地终端的应用中的插件或软件开发工具包(Software Development Kit,SDK)等功能单元,或者,还可以位于服务器端。如图8中所示,该装置可以包括:话术获取模块00、模板获取模块10和语音生成模块20,还可以进一步包括:第一挖掘模块30和第二挖掘模块40。其中,各组成单元的主要功能如下:
话术获取模块00,用于从语音包中获取与场景匹配的话术。
作为一种优选的实施方式,话术获取模块00可以确定场景的关键词;从语音包中获取与场景的关键词相匹配的话术。
其中,话术包括称呼话术、风格话术和知识话术中的至少一种。
模板获取模块10,用于获取预先为场景配置的播报模板。
作为一种优选的实施方式,模板获取模块10可以确定预先为场景配置的至少一个播报模板以及各播报模板的属性信息,播报模板包括至少一种话术的组合;依据各播报模板的属性信息和语音包,从至少一个播报模板中选择一个为所述场景配置的播报模板。
语音生成模块20,用于利用话术对播报模板进行填充,生成播报语音。
具体地,语音生成模块20可以包括:文本生成子模块21和语音合成子模块22。
文本生成子模块21,用于利用话术对播报模板进行填充,生成播报文本。
语音合成子模块22,用于利用语音包中的音色信息,对播报文本进行语音合成得到播报语音。
对于语音包中的称呼话术,可以由用户设置得到。作为一种优选的实施方式,可以在针对语音包的设置界面向用户提供称呼话术的输入框或选项等组件供用户输入或选择称呼话术。
对于语音包中的风格话术,可以获取预先设置的内容,例如由研发人员、服务提供者等预先设置。但作为一种优选的实施方式,风格话术可以通过搜索引擎预先由第一挖掘模块30挖掘得到。
第一挖掘模块30,用于采用以下方式预先挖掘得到语音包中的风格话术:
利用预设的风格关键词和场景的关键词拼接得到搜索关键词;
从搜索关键词对应的搜索结果文本中选择风格话术候选项;
获取对风格话术候选项进行修正的结果,得到风格话术。作为其中一种实施方式,可以对风格话术候选项进行人工修正。
第二挖掘模块40,用于采用以下方式预先挖掘得到语音包中的知识话术:
获取语音包所关联的知识图谱;
从知识图谱中获取与场景匹配的知识节点;
利用获取的知识节点以及对应场景的话术模板,生成对应场景的知识话术。
根据本申请的实施例,本申请还提供了一种电子设备和一种可读存储介质。
如图9所示,是根据本申请实施例的生成播报语音的方法的电子设备的框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本申请的实现。
如图9所示,该电子设备包括:一个或多个处理器901、存储器902,以及用于连接各部件的接口,包括高速接口和低速接口。各个部件利用不同的总线互相连接,并且可以被安装在公共主板上或者根据需要以其它方式安装。处理器可以对在电子设备内执行的指令进行处理,包括存储在存储器中或者存储器上以在外部输入/输出装置(诸如,耦合至接口的显示设备)上显示GUI的图形信息的指令。在其它实施方式中,若需要,可以将多个处理器和/或多条总线与多个存储器和多个存储器一起使用。同样,可以连接多个电子设备,各个设备提供部分必要的操作(例如,作为服务器阵列、一组刀片式服务器、或者多处理器系统)。图9中以一个处理器901为例。
存储器902即为本申请所提供的非瞬时计算机可读存储介质。其中,所述存储器存储有可由至少一个处理器执行的指令,以使所述至少一个处理器执行本申请所提供的生成播报语音的方法。本申请的非瞬时计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行本申请所提供的生成播报语音的方法。
存储器902作为一种非瞬时计算机可读存储介质,可用于存储非瞬时软件程序、非瞬时计算机可执行程序以及模块,如本申请实施例中的 生成播报语音的方法对应的程序指令/模块。处理器901通过运行存储在存储器902中的非瞬时软件程序、指令以及模块,从而执行服务器的各种功能应用以及数据处理,即实现上述方法实施例中的生成播报语音的方法。
存储器902可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据该电子设备的使用所创建的数据等。此外,存储器902可以包括高速随机存取存储器,还可以包括非瞬时存储器,例如至少一个磁盘存储器件、闪存器件、或其他非瞬时固态存储器件。在一些实施例中,存储器902可选包括相对于处理器901远程设置的存储器,这些远程存储器可以通过网络连接至该电子设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
该电子设备还可以包括:输入装置903和输出装置904。处理器901、存储器902、输入装置903和输出装置904可以通过总线或者其他方式连接,图9中以通过总线连接为例。
输入装置903可接收输入的数字或字符信息,以及产生与该电子设备的用户设置以及功能控制有关的键信号输入,例如触摸屏、小键盘、鼠标、轨迹板、触摸板、指示杆、一个或者多个鼠标按钮、轨迹球、操纵杆等输入装置。输出装置904可以包括显示设备、辅助照明装置(例如,LED)和触觉反馈装置(例如,振动电机)等。该显示设备可以包括但不限于,液晶显示器(LCD)、发光二极管(LED)显示器和等离子体显示器。在一些实施方式中,显示设备可以是触摸屏。
此处描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、专用ASIC(专用集成电路)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。
这些计算程序(也称作程序、软件、软件应用、或者代码)包括可 编程处理器的机器指令,并且可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。如本文使用的,术语“机器可读介质”和“计算机可读介质”指的是用于将机器指令和/或数据提供给可编程处理器的任何计算机程序产品、设备、和/或装置(例如,磁盘、光盘、存储器、可编程逻辑装置(PLD)),包括,接收作为机器可读信号的机器指令的机器可读介质。术语“机器可读信号”指的是用于将机器指令和/或数据提供给可编程处理器的任何信号。
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发申请中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本申请公开的技术方案所期望的结果,本文在此不进行限制。
上述具体实施方式,并不构成对本申请保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本申请的精神和原则之内所作的修改、等同替换和改进等,均应包含在本申请保护范围之内。

Claims (16)

  1. 一种生成播报语音的方法,包括:
    从语音包中获取与场景匹配的话术,以及,获取预先为所述场景配置的播报模板;
    利用所述话术对所述播报模板进行填充,生成播报语音。
  2. 根据权利要求1所述的方法,其中,所述话术包括称呼话术、风格话术和知识话术中的至少一种。
  3. 根据权利要求1所述的方法,其中,所述从语音包中获取与场景匹配的话术包括:
    确定所述场景的关键词;
    从所述语音包中获取与所述场景的关键词相匹配的话术。
  4. 根据权利要求1所述的方法,其中,所述获取预先为所述场景配置的播报模板包括:
    确定预先为所述场景配置的至少一个播报模板以及各播报模板的属性信息,所述播报模板包括至少一种话术的组合;
    依据各播报模板的属性信息和所述语音包,从所述至少一个播报模板中选择一个为所述场景配置的播报模板。
  5. 根据权利要求1所述的方法,其中,利用所述话术对所述播报模板进行填充,生成播报语音包括:
    利用所述话术对所述播报模板进行填充,生成播报文本;
    利用所述语音包中的音色信息,对所述播报文本进行语音合成得到所述播报语音。
  6. 根据权利要求2所述的方法,其中,所述语音包中的风格话术采用以下方式预先挖掘得到:
    利用预设的风格关键词和场景的关键词拼接得到搜索关键词;
    从所述搜索关键词对应的搜索结果文本中选择风格话术候选项;
    获取对所述风格话术候选项进行修正的结果,得到风格话术。
  7. 根据权利要求2所述的方法,其中,所述语音包中的知识话术采用以下方式预先挖掘得到:
    获取所述语音包所关联的知识图谱;
    从所述知识图谱中获取与场景匹配的知识节点;
    利用获取的知识节点以及对应场景的话术模板,生成对应场景的知识话术。
  8. 一种生成播报语音的装置,包括:
    话术获取模块,用于从语音包中获取与场景匹配的话术;
    模板获取模块,用于获取预先为所述场景配置的播报模板;
    语音生成模块,用于利用所述话术对所述播报模板进行填充,生成播报语音。
  9. 根据权利要求8所述的装置,其中,所述话术包括称呼话术、风格话术和知识话术中的至少一种。
  10. 根据权利要求8所述的装置,其中,所述话术获取模块,具体用于:确定所述场景的关键词;从所述语音包中获取与所述场景的关键词相匹配的话术。
  11. 根据权利要求8所述的装置,其中,所述模板获取模块,具体用于:确定预先为所述场景配置的至少一个播报模板以及各播报模板的属性信息,所述播报模板包括至少一种话术的组合;依据各播报模板的属性信息和所述语音包,从所述至少一个播报模板中选择一个为所述场景配置的播报模板。
  12. 根据权利要求11所述的装置,其中,所述语音生成模块包括:
    文本生成子模块,用于利用所述话术对所述播报模板进行填充,生成播报文本;
    语音合成子模块,用于利用所述语音包中的音色信息,对所述播报文本进行语音合成得到所述播报语音。
  13. 根据权利要求9所述的装置,还包括:第一挖掘模块,用于采用以下方式预先挖掘得到所述语音包中的风格话术:
    利用预设的风格关键词和场景的关键词拼接得到搜索关键词;
    从所述搜索关键词对应的搜索结果文本中选择风格话术候选项;
    获取对所述风格话术候选项进行修正的结果,得到风格话术。
  14. 根据权利要求9所述的装置,还包括:第二挖掘模块,用于采用以下方式预先挖掘得到所述语音包中的知识话术:
    获取所述语音包所关联的知识图谱;
    从所述知识图谱中获取与场景匹配的知识节点;
    利用获取的知识节点以及对应场景的话术模板,生成对应场景的知识话术。
  15. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1-7中任一项所述的方法。
  16. 一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行权利要求1-7中任一项所述的方法。
PCT/CN2021/097840 2020-10-15 2021-06-02 生成播报语音的方法、装置、设备和计算机存储介质 WO2022077927A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2021576765A JP2023502815A (ja) 2020-10-15 2021-06-02 放送音声を生成する方法、装置、機器、およびコンピュータ記憶媒体
EP21827167.4A EP4012576A4 (en) 2020-10-15 2021-06-02 METHOD AND APPARATUS FOR GENERATING A BROADCAST VOICE, AND DEVICE AND COMPUTER STORAGE MEDIA
KR1020217042726A KR20220051136A (ko) 2020-10-15 2021-06-02 방송 음성을 생성하는 방법, 장치, 기기 및 컴퓨터 기록 매체
US17/622,922 US20220406291A1 (en) 2020-10-15 2021-06-02 Method for generating broadcast speech, device and computer storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011105935.8 2020-10-15
CN202011105935.8A CN112269864B (zh) 2020-10-15 2020-10-15 生成播报语音的方法、装置、设备和计算机存储介质

Publications (1)

Publication Number Publication Date
WO2022077927A1 true WO2022077927A1 (zh) 2022-04-21

Family

ID=74338621

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097840 WO2022077927A1 (zh) 2020-10-15 2021-06-02 生成播报语音的方法、装置、设备和计算机存储介质

Country Status (6)

Country Link
US (1) US20220406291A1 (zh)
EP (1) EP4012576A4 (zh)
JP (1) JP2023502815A (zh)
KR (1) KR20220051136A (zh)
CN (1) CN112269864B (zh)
WO (1) WO2022077927A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112269864B (zh) * 2020-10-15 2023-06-23 北京百度网讯科技有限公司 生成播报语音的方法、装置、设备和计算机存储介质
CN113452853B (zh) * 2021-07-06 2022-11-18 中国电信股份有限公司 语音交互方法及装置、电子设备、存储介质
CN115063999A (zh) * 2022-05-23 2022-09-16 江苏天安智联科技股份有限公司 一种基于车联网的智慧导航系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9978359B1 (en) * 2013-12-06 2018-05-22 Amazon Technologies, Inc. Iterative text-to-speech with user feedback
CN109979457A (zh) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 一种应用于智能对话机器人的千人千面的方法
CN110017847A (zh) * 2019-03-21 2019-07-16 腾讯大地通途(北京)科技有限公司 一种自适应导航语音播报方法、装置及系统
CN110534088A (zh) * 2019-09-25 2019-12-03 招商局金融科技有限公司 语音合成方法、电子装置及存储介质
CN111259125A (zh) * 2020-01-14 2020-06-09 百度在线网络技术(北京)有限公司 语音播报的方法和装置、智能音箱、电子设备、存储介质
CN112269864A (zh) * 2020-10-15 2021-01-26 北京百度网讯科技有限公司 生成播报语音的方法、装置、设备和计算机存储介质

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH11102196A (ja) * 1997-09-29 1999-04-13 Ricoh Co Ltd 音声対話システム及び音声対話方法及び記録媒体
US9189742B2 (en) * 2013-11-20 2015-11-17 Justin London Adaptive virtual intelligent agent
US10692497B1 (en) * 2016-11-01 2020-06-23 Scott Muske Synchronized captioning system and methods for synchronizing captioning with scripted live performances
US10645460B2 (en) * 2016-12-30 2020-05-05 Facebook, Inc. Real-time script for live broadcast
CN108897848A (zh) * 2018-06-28 2018-11-27 北京百度网讯科技有限公司 机器人互动方法、装置及设备
CN109273001B (zh) * 2018-10-25 2021-06-18 珠海格力电器股份有限公司 一种语音播报方法、装置、计算装置和存储介质
CN110266981B (zh) * 2019-06-17 2023-04-18 深圳壹账通智能科技有限公司 视频录制的方法、装置、计算机设备和存储介质
CN110399457B (zh) * 2019-07-01 2023-02-03 吉林大学 一种智能问答方法和系统
CN110600000B (zh) * 2019-09-29 2022-04-15 阿波罗智联(北京)科技有限公司 语音播报方法、装置、电子设备及存储介质
CN110674241B (zh) * 2019-09-30 2020-11-20 百度在线网络技术(北京)有限公司 地图播报的管理方法、装置、电子设备和存储介质
CN110808028B (zh) * 2019-11-22 2022-05-17 芋头科技(杭州)有限公司 嵌入式语音合成方法、装置以及控制器和介质
CN111339246B (zh) * 2020-02-10 2023-03-21 腾讯云计算(北京)有限责任公司 查询语句模板的生成方法、装置、设备及介质
CN111506770B (zh) * 2020-04-22 2023-10-27 新华智云科技有限公司 一种采访视频集锦生成方法和系统
CN111578965B (zh) * 2020-04-30 2022-07-08 百度在线网络技术(北京)有限公司 导航播报信息处理方法、装置、电子设备和存储介质
CN111583931A (zh) * 2020-04-30 2020-08-25 中国银行股份有限公司 业务数据处理方法及装置
CN111681640B (zh) * 2020-05-29 2023-09-15 阿波罗智联(北京)科技有限公司 播报文本的确定方法、装置、设备和介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9978359B1 (en) * 2013-12-06 2018-05-22 Amazon Technologies, Inc. Iterative text-to-speech with user feedback
CN110017847A (zh) * 2019-03-21 2019-07-16 腾讯大地通途(北京)科技有限公司 一种自适应导航语音播报方法、装置及系统
CN109979457A (zh) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 一种应用于智能对话机器人的千人千面的方法
CN110534088A (zh) * 2019-09-25 2019-12-03 招商局金融科技有限公司 语音合成方法、电子装置及存储介质
CN111259125A (zh) * 2020-01-14 2020-06-09 百度在线网络技术(北京)有限公司 语音播报的方法和装置、智能音箱、电子设备、存储介质
CN112269864A (zh) * 2020-10-15 2021-01-26 北京百度网讯科技有限公司 生成播报语音的方法、装置、设备和计算机存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4012576A4

Also Published As

Publication number Publication date
CN112269864A (zh) 2021-01-26
JP2023502815A (ja) 2023-01-26
EP4012576A4 (en) 2022-08-17
EP4012576A1 (en) 2022-06-15
US20220406291A1 (en) 2022-12-22
CN112269864B (zh) 2023-06-23
KR20220051136A (ko) 2022-04-26

Similar Documents

Publication Publication Date Title
WO2022077927A1 (zh) 生成播报语音的方法、装置、设备和计算机存储介质
US11983638B2 (en) Example-driven machine learning scheme for dialog system engines
CN113571058B (zh) 语音动作可发现性系统
CN107112013B (zh) 用于创建可定制对话系统引擎的平台
CN108701454B (zh) 对话系统中的参数收集和自动对话生成
JP6987814B2 (ja) 自然言語会話に関連する情報の視覚的提示
US10860289B2 (en) Flexible voice-based information retrieval system for virtual assistant
CN114303132A (zh) 在虚拟个人助手中使用唤醒词进行上下文关联和个性化的方法和系统
US11514907B2 (en) Activation of remote devices in a networked system
US11551676B2 (en) Techniques for dialog processing using contextual data
WO2021068467A1 (zh) 语音包的推荐方法、装置、电子设备和存储介质
US20210158814A1 (en) Interfacing with applications via dynamically updating natural language processing
KR20220062360A (ko) 동적으로 업데이트되는 자연어 처리를 통한 애플리케이션과의 인터페이스
JP2024063034A (ja) オーディオクエリのオーバーラップ処理の協調

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2021576765

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2021827167

Country of ref document: EP

Effective date: 20211227

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21827167

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE