WO2023273749A1 - Broadcasting text generation method and apparatus, and electronic device - Google Patents

Broadcasting text generation method and apparatus, and electronic device Download PDF

Info

Publication number
WO2023273749A1
WO2023273749A1 PCT/CN2022/095805 CN2022095805W WO2023273749A1 WO 2023273749 A1 WO2023273749 A1 WO 2023273749A1 CN 2022095805 W CN2022095805 W CN 2022095805W WO 2023273749 A1 WO2023273749 A1 WO 2023273749A1
Authority
WO
WIPO (PCT)
Prior art keywords
broadcast
length parameter
text
information
target
Prior art date
Application number
PCT/CN2022/095805
Other languages
French (fr)
Chinese (zh)
Inventor
陈开济
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202280029750.4A priority Critical patent/CN117203703A/en
Publication of WO2023273749A1 publication Critical patent/WO2023273749A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence (AI), and in particular to a method, device and electronic device for generating broadcast text.
  • AI artificial intelligence
  • Voice assistant or virtual assistant is a kind of agent software that can perform tasks or services instead of individuals, and is widely used in devices such as smartphones, smart speakers, and smart vehicle terminals (electronic control unit, ECU).
  • a voice assistant or virtual assistant provides a voice user interface (voice user interface, VUI), and completes corresponding tasks or provides related services according to the user's voice command input. After the voice assistant executes the voice command issued by the user, it will generate the broadcast text and generate the corresponding broadcast voice through the text-to-speech (TTS) module, inform the user of the broadcast content and guide the user to continue using the device.
  • VUI voice user interface
  • TTS text-to-speech
  • the broadcast text of the current voice assistant generally adopts a fixed method, and when interacting with different users, there is no difference in the broadcast voice/broadcast text. How to provide users with broadcasts that conform to their personal usage habits and improve the naturalness of user interaction is an urgent problem to be solved.
  • the embodiments of the present application provide a method, device, terminal device and system for generating broadcast text.
  • an embodiment of the present application provides a method for generating broadcast text, the method comprising: receiving a user's voice command; acquiring the broadcast content corresponding to the voice command; generating a target according to the broadcast length parameter and the broadcast content
  • the broadcast text, the broadcast length parameter indicates historical listening duration information.
  • the generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: using the broadcast content and the broadcast length parameter as input to a model, and the model outputs the target broadcast text , the target broadcast text is a broadcast text whose duration matches the broadcast length parameter.
  • the broadcast length parameter users can be provided with voice assistant broadcast texts that conform to personal historical usage habits through the model, providing a personalized broadcast experience for thousands of people, and improving the naturalness of voice assistant interaction.
  • the model is a generative model or a retrieval model; generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: combining the broadcast content and the broadcast length parameter As an input of the generative model, the generative model outputs a target broadcast text, and the target broadcast text is a broadcast text whose duration matches the broadcast length parameter. or
  • the broadcast content and the broadcast length parameter are used as the input of the retrieval model, and the retrieval model retrieves a text template of a limited length in a predefined template library according to the broadcast length parameter;
  • the text template of the length outputs the target broadcast text, and the target broadcast text is the broadcast text whose duration matches the historical listening duration information.
  • the broadcast length parameter is associated with device information
  • the first broadcast length parameter is determined according to the device information
  • the target broadcast text is generated according to the broadcast length parameter and the broadcast content, specifically including: Generate a first target broadcast text according to the first broadcast length parameter and the broadcast content; the first broadcast length parameter indicates first historical listening duration information associated with the device information.
  • the broadcast length parameter is associated with scene information
  • the second broadcast length parameter is determined according to the scene information
  • the target broadcast text is generated according to the broadcast length parameter and the broadcast content, specifically including: Generate a second target broadcast text according to the second broadcast length parameter and the broadcast content; the second broadcast length parameter indicates second historical listening duration information associated with the scene information.
  • the broadcast length parameter is associated with field device information and scene information
  • the third broadcast length parameter is determined according to the device information and scene information
  • the third broadcast length parameter is generated according to the broadcast length parameter and the broadcast content.
  • the target broadcast text specifically includes: generating a third target broadcast text according to the third broadcast length parameter and the broadcast content; the third broadcast length parameter indicates a third history associated with the device information and the scene information Listen to duration information.
  • the broadcast length parameter is associated with field device information and/or scene information
  • generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: storing the historical listening duration information , device information and/or scene information input classification model; output a fourth broadcast length parameter; the fourth broadcast length parameter is a different length category; generate a fourth target broadcast according to the fourth broadcast length parameter and the broadcast content text.
  • the broadcast length parameters obtained through the classification model conform to the personal historical usage habits, adapt to the device and/or the current scene of the voice assistant broadcast, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interaction.
  • the broadcast length parameter is associated with field device information and/or scene information
  • generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: storing the historical listening duration information , device information and/or scene information are input into the regression model; a fifth broadcast length parameter is output, and the fifth broadcast length parameter is a length limit value; a fifth target is generated according to the fifth broadcast length parameter and the broadcast content Announce text.
  • the regression model can be used to generate a voice assistant broadcast that conforms to personal historical usage habits, adapts to the device and/or the current scene, provides a personalized broadcast experience for thousands of people, and improves the naturalness of voice assistant interaction.
  • the broadcast length parameter is associated with field device information and/or scene information
  • generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: combining the device information, scene information and /or the historical listening duration information is respectively linearly coded and fused to obtain the sixth broadcast length parameter;
  • the sixth broadcast length parameter is a characterization vector of the broadcast length parameter;
  • the sixth broadcast length parameter the broadcast Whether the content and the voice instruction is executable or non-executable is used as the input of the pre-trained language model, and the sixth target broadcast text is output.
  • the pre-trained language model can be used to generate voice assistant broadcasts that conform to personal historical usage habits, adapt to devices and/or current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
  • the acquiring the broadcast content corresponding to the voice command includes: acquiring intent and slot information according to the voice command; determining whether the voice command can be used according to the intent and slot information Executing: in the case that the voice instruction is not executable, generate broadcast content, where the broadcast content is inquiry information. In this way, in the case that the voice command cannot be executed, it is possible to obtain the broadcast content that the voice assistant asks the user.
  • the determining the broadcast content according to the dialog state includes: acquiring intent and slot information according to the voice instruction; determining whether the voice instruction is Executable; in the case that the voice instruction is executable, determine the third-party service that executes the intention; obtain the broadcast content from the third-party service, and the broadcast content is corresponding to the voice instruction result information.
  • the voice command is executable, the broadcast content returned after the third-party service executes the voice command is obtained.
  • the method further includes: controlling the broadcast speed of the target broadcast text according to the broadcast length parameter. In this way, it is possible to generate voices that conform to personal historical usage habits, adapt to devices and/or current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
  • the method further includes: recording the current broadcast duration of the target broadcast text, and obtaining the historical listening duration information. In this way, a personalized broadcast experience that conforms to personal historical usage habits can be obtained, and the naturalness of voice assistant interaction can be improved.
  • the embodiment of the present application provides a method for broadcasting text, the method comprising: receiving a user's voice command; generating a target broadcast text corresponding to the voice command; and broadcasting the target text according to the broadcast length parameter
  • the broadcast speed is controlled, and the broadcast length parameter indicates historical listening duration information.
  • the beneficial effect of controlling the broadcast speed of the target broadcast text according to the broadcast length parameter is the same as that of the embodiments of the first aspect of the present application in which the target broadcast text is generated by the broadcast length parameter, and will not be repeated hereafter.
  • the broadcast length parameter is associated with device information
  • the first broadcast length parameter is determined according to the device information
  • the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including : Control the broadcast speed of the target broadcast text according to the first broadcast length parameter; the first broadcast length parameter indicates first historical listening duration information associated with the device information.
  • the broadcast length parameter is associated with scene information
  • the second broadcast length parameter is determined according to the scene information
  • the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including : Control the broadcast speed of the target broadcast text according to the second broadcast length parameter; the second broadcast length parameter indicates the second historical listening duration information associated with the device information.
  • the broadcast length parameter is associated with field device information and scene information, and a third broadcast length parameter is determined according to the device information and scene information, and the broadcast length parameter is used to determine the third broadcast length parameter.
  • Controlling the broadcast speed of the target broadcast text includes: controlling the broadcast speed of the target broadcast text according to the third broadcast length parameter; the third broadcast length parameter indicates the third history associated with the device information Listen to duration information.
  • the controlling the broadcast speed of the target broadcast text according to the broadcast length parameter includes: inputting the historical listening duration information, device information and/or scene information into the classification model; outputting the first Four broadcast length parameters; the fourth broadcast length parameters are different length categories; the broadcast speed of the target broadcast text is controlled according to the fourth broadcast length parameters.
  • the controlling the broadcast speed of the target broadcast text according to the broadcast length parameter includes: inputting the historical listening duration information, device information and/or scene information into the regression model; Outputting a fifth broadcast length parameter, where the fifth broadcast length parameter is a length limit value; and controlling the broadcast speed of the target broadcast text according to the fifth broadcast length parameter.
  • an embodiment of the present application provides an electronic device, including: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory, when the programs stored in the memory When executed, the processor is configured to execute the method described in any one of the foregoing embodiments.
  • an embodiment of the present application is a storage medium, where an instruction is stored in the storage medium, and when the instruction is run on a terminal, the first terminal is made to execute the method described in any one of the foregoing embodiments.
  • Fig. 1 is a schematic diagram of an artificial intelligence main frame
  • FIG. 2 is a schematic diagram of the application system of the voice assistant proposed in the embodiment of the present application.
  • FIG. 3 is a functional architecture diagram of the voice assistant in the embodiment of the present application.
  • FIG. 4 is a flowchart of a method for generating broadcast text proposed in Embodiment 1 of the present application.
  • FIG. 5 is an application schematic diagram of a method for generating broadcast text proposed in Embodiment 1 of the present application.
  • FIG. 6 is a schematic structural diagram of a random forest-based machine learning model based on a method for generating broadcast text proposed in Embodiment 3 of the present application;
  • FIG. 7 is a schematic diagram of a structure of a typical pre-trained language model based on a method for generating broadcast text proposed in Embodiment 4 of the present application.
  • Natural language generation is a part of natural language processing, which generates natural language from machine representation systems such as knowledge bases or logical forms.
  • NLG can be regarded as the reverse of natural language understanding (NLU): NLU needs to clarify the meaning of the input language and generate a machine expression language; and NLG needs to decide how to convert the conceptual machine expression language into a natural language that users can receive. language.
  • NLU natural language understanding
  • the user wakes up the voice assistant and issues a voice command related to querying the weather.
  • the voice assistant uses the natural language understanding (NLU) capability to understand the voice command issued by the user related to querying the weather and interprets the voice command Classify according to the natural language classification system similar to Table 1, query the weather according to the classification results, select the corresponding template according to the weather query results to generate the broadcast text corresponding to the weather, or generate the broadcast text corresponding to the weather information category and its associated attributes, broadcast The text content matches the category to which the voice command belongs.
  • NLU natural language understanding
  • This solution generates different types of broadcast text according to different voice commands input by the user, but the content of the broadcast text is only related to the type of voice command input by the user, and does not consider the user's personal usage habits, differences in equipment or differences in the scene they are in. Provide a personalized weather broadcast experience for thousands of people.
  • the embodiment of the present application proposes a method for generating broadcast text, which relates to the field of AI and is applicable to voice assistants.
  • the voice assistant can be based on the user's personal usage habits, device differences and/or
  • the environment generates a personalized broadcast text, and generates broadcast voice information corresponding to the speech rate through TTS, informs the user of the broadcast content and guides the user to continue using the device.
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements. Based on the main framework of artificial intelligence shown in Figure 1, the main framework of artificial intelligence will be described from the two dimensions of "intelligent information chain” (horizontal axis) and “IT value chain” (vertical axis).
  • Intelligent information chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
  • IT value chain reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, the realization of information provision and processing technology, to the systematic industrial ecological process.
  • the infrastructure 10 provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • sensors are used to communicate with the outside to obtain data streams;
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA) are used to provide training, calculation and execution capabilities;
  • basic platforms are used for cloud storage and cloud computing, Network interconnection, etc., including distributed computing framework and network related platform guarantee and support.
  • the data 11 on the upper layer of the infrastructure 10 is used to represent data sources in the field of artificial intelligence.
  • the data 11 of the upper layer of the infrastructure 10 comes from the voice commands acquired on the terminal side, the equipment information of the terminal used, and the scene information obtained through sensor communication with the outside .
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • the data processing process includes front-end processing, speech recognition (ASR), semantic understanding (NLU), dialog management (DM), natural Language generation (NLG), speech synthesis (TTS) and other processing.
  • ASR speech recognition
  • NLU semantic understanding
  • DM dialog management
  • NLG natural Language generation
  • TTS speech synthesis
  • some general-purpose capabilities can be formed based on the results of data processing, such as algorithms or a general-purpose system.
  • a personalized broadcast text can be generated based on the result of the data processing, and a The broadcast voice corresponding to the speed of speech provides a personalized broadcast experience for thousands of people.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, smart terminals, etc.
  • a broadcast text generation method proposed in the embodiment of the present application can be applied to voice assistants of smart devices in the fields of smart terminals, smart homes, smart security, and automatic driving.
  • ECU provides a voice user interface (VUI), and completes corresponding tasks or provides related services according to voice commands input by users.
  • VUI voice user interface
  • smart devices include smart TVs, smart speakers, robots, smart air conditioners, smart smoke alarms, smart fire extinguishers, smart vehicle terminals, mobile phones, tablets, laptops, desktop computers, and all-in-one machines.
  • FIG. 2 is a schematic diagram of the application system of the voice assistant proposed by the embodiment of the present application.
  • the data collection device 260 is used to collect information such as user information, device information, scene information and/or historical listening duration, and store the information in the database 230 .
  • the data acquisition device 260 corresponds to the sensors of the infrastructure in Figure 1, including devices such as motion sensors, displacement sensors, and infrared sensors that communicate with smart devices, and are used to collect user current scene information, such as sports, meetings, rest or chatting, etc. .
  • the data collection device 260 also includes a camera device, GPS, and other devices that are communicatively connected with the smart device, and is used to collect scene information of the user's current location or place, such as in a vehicle, living room or bedroom.
  • the data collection device 260 also includes a timer, which is used to record the start time, end time and broadcast duration of the broadcast voice.
  • the broadcast duration is recorded in the user information as the user's historical listening duration.
  • the client device 240 corresponds to the basic platform of the infrastructure in FIG. 1, and is used for interacting with the user, obtaining the voice command sent by the user, broadcasting the broadcast content of the voice command, showing the broadcast content to the user, and storing the information in the database 230;
  • the client device 240 includes a smart phone providing a voice user interface (VUI), a display screen and a microphone, a speaker, a button, a Bluetooth earphone microphone, and the like, such as a smart vehicle terminal.
  • VUI voice user interface
  • the microphone can be a radio device, including an integrated microphone, a microphone or a microphone array connected to a smart device, or a microphone or a microphone array connected to a smart device through a short-distance connection technology, and is used to collect voice commands issued by the user.
  • the training device 220 corresponds to the smart chip of the infrastructure in FIG. 1 , and trains the voice assistant 201 based on data maintained in the database 230 such as user information, device information, scene information and/or historical broadcast duration.
  • the voice assistant 201 can provide a personalized broadcast text in the voice dialogue scene between the user and the client device 240 , and generate a broadcast voice corresponding to the speech rate, inform the user of the broadcast content and guide the user to continue using the client device 240 .
  • the execution device 210 corresponds to the smart chip of the infrastructure in FIG. 1 , and is equipped with an I/O interface 212 for data interaction with the client device 240 .
  • Input voice command information output the broadcast content to the client device 240 through the I/O interface 212, for example, broadcast the broadcast content through the loudspeaker, or display the broadcast content on the display of smart phones, smart vehicle terminals, etc. through the Voice User Interface (VUI) screen.
  • VUI Voice User Interface
  • the execution device 210 may call data, codes, etc. in the data storage system 250 , and may also store data, code instructions, etc. in the data storage system 250 .
  • the training device 220 and the executing device 210 may be the same smart chip or different smart chips.
  • the database 230 is a data collection of user information, device information and/or scene information stored on a storage medium.
  • the voice assistant 201 is an agent software for executing voice instructions or services.
  • the execution device 210 executes the voice assistant 201. After acquiring the voice instructions issued by the user, it will generate a personalized length target broadcast according to user information, device information and/or scene information Text, and control the speech rate of the broadcast voice, inform the user of the broadcast content and guide the user to continue using the device.
  • the I/O interface 212 returns the target broadcast text of personalized length generated by the voice assistant 201 to the client device 240 as output data, and the client device 240 displays the broadcast text and broadcasts it to the user at a corresponding speech speed.
  • the training device 220 acquires the training data and corpus stored in the database 230, and based on the acquired data such as user information, device information and/or scene information of the historical record, to output a length that matches the user's historical listening history record.
  • the broadcast text is the training target, and the voice assistant 201 is trained to output better target broadcast text.
  • the user can input voice instruction information to the execution device 210 , for example, can operate in a voice user interface (VUI) provided by the client device 240 .
  • VUI voice user interface
  • the client device 240 can automatically input instructions to the I/O interface 212 and obtain broadcast content. If the client device 240 needs to obtain user authorization for automatically inputting instruction information, the user can set corresponding permissions in the client device 240 .
  • the user can view or listen to the broadcast content output by the execution device 210 on the client device 240 , and the specific presentation form may be specific ways such as display, wake-up sound, and broadcast.
  • the client device 240 can also serve as a voice data collection terminal and store the collected wake-up sound or voiceprint data of the user into the database 230 .
  • Figure 2 is only a schematic diagram of a system application scenario provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the system shown in Figure 2 It may correspond to one or more device entities.
  • the data storage system 250 is an external memory relative to the execution device 210 . In other cases, the data storage system 250 may also be placed in the execution device 210 .
  • FIG. 3 is a functional architecture diagram of the voice assistant in the embodiment of the present application.
  • voice assistant 201 comprises front-end processing module 31, speech recognition module 32, semantic understanding module 33, dialog state module 34, dialog strategy learning module 35, natural language generation module 36 , speech synthesis module 37 and dialogue output module 38 .
  • the front-end processing module 31 is used to process the voice command input by the user to obtain the data format required by the network model for use by the voice recognition module 32 .
  • the front-end processing module 31 obtains the voice command of alphabet compression format input by the user, performs audio decoding on the voice command, and decodes it into an audio signal in pcm format; uses voiceprint or other features to separate and denoise the audio signal , feature extraction, and through audio processing algorithms such as framing, windowing, and short-time Fourier transform, the audio feature vector of the mel-frequency cepstral coefficients (MFCC) filter bank (filter bank) is obtained .
  • the front-end processing module 31 is generally disposed on the terminal side.
  • the speech recognition (automatic speech recognition, ASR) module 32 is used for obtaining the audio feature vector obtained by the front-end processing module 31, and converts the audio feature vector into text through an acoustic model and a language model for the semantic understanding module 33 to understand.
  • the acoustic model is used to classify the acoustic features and correspond to (decode) phonemes or words
  • the language model is used to decode phonemes or words into a complete sentence.
  • the acoustic model and the language model process the audio feature vectors in series, convert the audio feature vectors into phonemes or words through the acoustic model, and then convert the phonemes or words into text sequences through the language model, and output user Speech-to-text.
  • the ASR module 32 can be implemented in an end-to-end manner, wherein the acoustic model and the language model adopt a neural network structure, and the acoustic model and the language model are jointly trained so that the result of the training is to output Chinese characters corresponding to the user's voice sequence.
  • the acoustic model may be modeled using a Hidden Markov Model (HMM), and the language model may be an n-gram model.
  • HMM Hidden Markov Model
  • the semantic understanding (natural language understanding, NLU) module 33 is used to convert the text or Chinese character sequence corresponding to the user's voice into structured information, wherein the structured information includes machine-executable intention information and recognizable slot information. Its purpose is to obtain the semantic representation of natural language through the analysis of syntax, semantics and pragmatics.
  • the intent information refers to the task that needs to be performed by the voice command issued by the user;
  • the slot information refers to the parameter information that needs to be determined to perform the task.
  • the user asks the voice assistant 201 "What's the temperature in Nanjing today?"
  • the NLU module 33 understands the text corresponding to the voice command, and obtains the voice command's intention as “check the weather", and the slot is "Location: Nanjing" and "Time: Today”.
  • the NLU module 33 can use a classifier to classify the text corresponding to the voice instruction into the intent information that the voice assistant 201 can support, and then use the sequence labeling model to label the slot information in the text.
  • the classifier can be a model that can be used for classification in traditional machine learning algorithms, for example, NB model, random forest model (RF), SVM classification model, KNN classification model, etc.; it can also be a deep learning text classification model, for example, FastText model, TextCNN, etc.
  • the sequence labeling model is used to mark each element in the text information or Chinese character sequence, and output the mark sequence, which can be used to indicate the beginning, end and type of the slot.
  • the sequence labeling model can be one of the following models: linear model, hidden Markov model, maximum entropy Markov model, conditional random field, etc.
  • the NLU module 33 may also use an end-to-end model to simultaneously output intent information and slot information.
  • the dialog state tracking (dialog state tracking, DST) module 34 is used to manage the dialog state of the voice assistant 201.
  • the DST module 34 uses the intent information and slot information of the current round of dialogue output by the NLU module 33 to maintain the current round of dialogue intention, filled slots and dialogue status in the multi-round dialogue scene.
  • the input of the DST module 34 is the last round of dialogue state, the broadcast content returned by the last round of third-party applications, and the intent information and slot information of the current round of dialogue, and the output is the current round of dialogue state.
  • the DST module 34 module has recorded the dialog history and the dialog status of the voice assistant 201, and the assistant voice assistant 201 understands the instruction of the current round dialog user's voice in combination with the dialog history recorded by the context manager (that is, the database 230 in FIG. 2 ), and gives an appropriate feedback of.
  • the NLU module 33 outputs the intention of the current round of dialogue as “check the weather”, and the slot is "place: there" and “time:” because the DST module 34 has recorded the first round of dialogue state, the system combines the dialogue history understanding of the context manager record If "there” in the slot “Location: There” is “Nanjing”, then "Nanjing” is filled into the location slot.
  • the DST module 34 outputs the dialogue state information of the current round, including intent information (check the weather), filled slots (Nanjing) and unfilled slots (time:).
  • the dialog policy learning (dialog policy learning, DPL) module 35 is used to determine the next action performed by the voice assistant 201, including asking the user, executing the user's instruction, recommending other user instructions, and generating a reply.
  • the DPL module 35 uses the dialog state information output by the DST module 34 to determine the next execution action.
  • the DPL module 35 may determine according to the state of the current round of dialogue that the next step to perform action information is to generate a broadcast content asking the user.
  • the DST module 34 outputs the dialogue status information of the current round and there is an unfilled slot (time: ), the DPL module 35 can determine that the next step of execution action is to ask the user "what day?" to maintain the dialogue system
  • the control logic ensures that the dialogue can continue to be executed.
  • the execution action information is an action tag or structured information, such as "REQUEST-SLOT: date", indicating the next time to be queried to the user.
  • the DPL module 35 can determine that the next step to execute is to select an appropriate third-party application (app) to execute the voice command according to the current round of dialogue status, and send the intention and slot information to the selected third-party application (app).
  • the third-party application obtaining the execution result returned by the third-party application, where the execution result is the broadcast content corresponding to the voice command.
  • a third-party application is an application that can execute or meet the intention of the voice command according to the slot information and return the broadcast content, such as an app that can query the weather, an app that can provide product information, and an app that can provide navigation or positioning information Wait.
  • the broadcast content determined by the DPL module 35 according to the current round of dialogue state or the broadcast content returned by the third-party application (app) or server after executing the voice command according to the intention and slot information, can be used as the input parameter of the next round of dialogue state of the DST module 34, It can also be used as an input parameter of the NLG module 36 .
  • the natural language understanding (NLG) module 36 is a translator that converts structured information into natural language expressions, and is currently widely used in voice assistants.
  • NLG natural language understanding
  • the NLG module 36 is used to obtain the current dialogue status maintained by the DST module 34, the next step to execute the action determined by the DPL module 35, and/or the broadcast content returned by the third-party application (app), combined with user information, Device information and/or scene information generate a target broadcast text of personalized length.
  • the current dialog state maintained by the DST module 34 is intent information (check the weather), filled slots (Nanjing) and unfilled slots (time:), the next step determined by the DPL module 35
  • the execution action is to ask the user, and the broadcast text generated by the NLG module 36 is "Which day do you need to inquire?"
  • the NLG module 36 inputs the current dialogue state and the announcement content returned by the third-party application into a template matching the current intention, device or scene, and outputs the target announcement text of the length configured by the template.
  • the NLG module 36 can also use the model-based black box to output the target broadcast text of personalized length.
  • User portrait (user profile, UP) module 213, is used for by querying the data in the database 230 shown in Figure 2 to obtain user information, record user to listen to information such as the historical listening duration of voice assistant broadcast in user information.
  • User information also known as user portrait, describes the user's usage habits by collecting data in various dimensions such as the user's social attributes, consumption habits, preference characteristics, and system usage behavior, and analyzes and counts these characteristics to tap potential Value information, so as to abstract the whole picture of user information, and use it to recommend personalized content to users, or provide services in line with user habits.
  • a device profile (device profile, DP) module 214 is used to obtain device information of the client device 240 shown in FIG.
  • Scene perception (context awareness, CA) module 215, is used for obtaining current scene information through data acquisition device 260 shown in Figure 2, and scene information includes room category, background noise level, user's current state of motion etc.
  • the CA module 215 , the DP module 214 , and the UP module 213 may also be external modules relative to the voice assistant 201 , which are not specifically limited here.
  • the voice assistant understands the user's voice command through the natural language understanding NLU module 35 and sends it to the corresponding third-party application (app) for execution, and can obtain the structured broadcast content returned by the third-party application, using NLG
  • the module 36 converts the returned structured broadcast content into a broadcast text for the TTS module to generate a broadcast voice to inform the user of the broadcast content.
  • the speech synthesis (Text-to-Speech, TTS) module 37 is used to control the broadcast speed of the target broadcast text according to the broadcast length parameter, and the broadcast length parameter indicates historical listening duration information.
  • the TTS module 37 when the TTS module 37 converts the target broadcast text into broadcast voice, by introducing the broadcast length parameter, combined with user information, device information and/or scene information to control the speech rate of the broadcast, thereby limiting the broadcast of the target broadcast text Duration, while ensuring the accuracy of speech generation, it also controls the speech rate, timbre, volume and other characteristics of the generated speech.
  • the dialog output module 38 is configured to generate a corresponding broadcast card according to the target broadcast voice, and then present it to the user.
  • the embodiment of the present application proposes a method for generating broadcast text. This method is applied to a voice assistant. By receiving the user's voice command, the broadcast content corresponding to the voice command is obtained, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, wherein the broadcast The length parameter indicates historical listening duration information.
  • FIG. 4 is a flow chart of a method for generating broadcast text proposed in Embodiment 1 of the present application. As shown in Figure 4, the voice assistant performs the following steps S401-S404.
  • the voice assistant 201 receives a user's voice instruction.
  • the voice assistant 201 performs front-end processing on the voice command "What's the temperature in Nanjing today" to obtain an audio feature vector; recognizes the audio feature vector as text through an acoustic model and a language model; understands the text, and obtains the corresponding intent of the voice command as "Check the weather", the slots are "Location: Nanjing" and "Time: Today”; the dialog status is managed, and according to the last round of dialog status, the content of the last round of broadcast, and the corresponding intent information and slot information of the current voice command, obtain The current dialog state, including intent information, filled slots, and unfilled slots, determines whether voice commands can be executed.
  • the voice assistant 201 can determine the third-party application that executes the intent information according to the current dialog state that is executable; and send the intent information and slot information corresponding to the voice command to the third-party application ; Obtain the execution result returned by the third-party application (app) or the server, and the execution result is the broadcast content corresponding to the current voice command.
  • the user sends a voice command "What's the temperature in Nanjing today" to the voice assistant 201, and the voice assistant 201 selects an appropriate third-party application (app) to execute the voice command in combination with the intent information and slot information related to the user request, and outputs
  • the execution result related to the user request returned by the third-party application (app) is a structured broadcast content " ⁇ "temperature”: “15-23”, “unit”: “C”, "location”: "Nanjing” ⁇ ".
  • the voice assistant 201 may generate the broadcast content according to the dialogue state.
  • the voice assistant 201 acquires the next action information determined by the DPL module 35, the action information is an action tag or structured information, and determines the broadcast content It is "REQUEST-SLOT:date", indicating the time to ask the user next.
  • the NLG module 36 can be a generative model, and the broadcast content can be used as the input of the generative model, the user's broadcast length parameter can be used as an additional parameter, and the output broadcast text is implicitly limited by the training data. Length, generate the target broadcast text, the target broadcast text is the broadcast text whose duration matches the broadcast length parameter.
  • the length or length range of the text generated by the generative model can be limited by the input broadcast length parameter, and the broadcast content and the broadcast length parameter are used as the input of the model, and the target broadcast text of the limited length is output .
  • the NLG module 36 can be a retrieval model, which takes the broadcast content as input to the retrieval model, retrieves the corresponding template according to the broadcast content, and generates the target broadcast text through the retrieved template.
  • the broadcast content and the user's broadcast length parameter are used as the input of the retrieval model, and the template corresponding to the broadcast content is retrieved in the predefined template library according to the length defined by the broadcast length parameter.
  • the template output target announcement text.
  • the broadcast length parameter may be determined according to an average value or a weighted average value of at least one piece of historical listening duration information.
  • the device portrait module 213 obtains user information, obtains the historical listening duration information of each time the user listens to the voice broadcast, and obtains the broadcast length according to the statistical average or weighted average of the historical listening duration information of each user's listening to the voice broadcast parameter; the minimum/maximum value and the latest value of the historical listening time can also be used as the broadcast length parameter.
  • NLG module 36 according to the broadcast content " ⁇ "temperature”: “15-23”, “unit”: “C”, “location”: “Nanjing” ⁇ ” and the broadcast length parameter 20 of the above-mentioned returned broadcast content generate a character length of
  • the target broadcast text of about 20 characters is "Nanjing is sunny today, with a minimum of 15 degrees Celsius and a maximum of 23 degrees Celsius.”
  • the value of the historical listening time may be an initial value.
  • the value can be a precise numerical record, such as "5 seconds", “20 words”, etc., or an identifier mapped to a certain period of time, such as “medium”, “concise”, etc.; the initial value can also be passed
  • the embodiment of the present application does not limit the initial value of the historical listening duration.
  • the voice assistant 201 When the user listens to the voice broadcast each time, the voice assistant 201 will continue to record the listening duration of the broadcast, and collect the duration information of each listening broadcast in the user portrait to obtain multiple pieces of historical listening duration information.
  • the recording of the listening duration can be timed from the moment when the broadcast starts, and ends when one of the following situations occurs: the broadcast is completed, the broadcast is interrupted, or the program is closed or switched to another program.
  • the listening duration is the time interval from the start of the timer to the end of the timer.
  • the TTS module 37 controls the speech rate of the broadcast voice with the broadcast length parameter as the speech rate limit of the broadcast voice, and converts the broadcast text into a text that conforms to the current user's historical listening habits. Announce voice.
  • FIG. 5 is a schematic diagram of the application of the method for generating broadcast text proposed in Embodiment 1 of the present application. As shown in Figure 5, user A wakes up the voice assistant 201 and asks "how is the temperature in Nanjing today?"
  • the front-end processing module 31 performs audio decoding on the voice command "What's the temperature in Nanjing today?" input by user A, and decodes it into an audio signal in pcm format; uses voiceprint or other features to separate, denoise, and feature extract the audio signal, and Audio feature vectors are obtained through audio processing algorithms such as framing, windowing, and short-time Fourier transform.
  • the ASR module 32 converts audio feature vectors into text through an acoustic model and a language model. Specifically, the acoustic features in the audio feature vector are converted into phonemes or words through the acoustic model, and then the phonemes or words are converted into text sequences through the language model, and the text corresponding to the voice command of user A is output.
  • the NLU module 33 understands the text, and obtains that the user's intention is “to check the weather”, and the slot is "location: Nanjing”.
  • the DST module 34 utilizes " check the weather " of the current round conversation of NLU module 33 output, and the slot position is " place: Nanjing ", outputs the dialogue state information of current round, comprises intention information (check weather), filled slot (Nanjing ) and (time: today).
  • the DPL module 35 utilizes the dialogue state information that the DST module 34 outputs to determine that the execution action of the next step is to execute an instruction, and the DPL module 35 uses the slot information as a parameter to select a suitable third-party service or application (app) according to the intention information to execute the user's action.
  • Voice command ; send to "check the weather" to the corresponding third-party application (service provider W).
  • the NLG module 36 acquires and returns the broadcast content as structured information “ ⁇ “temperature”:“15-23”,“unit”:“C”,“location”:“Nanjing” ⁇ ”.
  • the historical listening duration t A 5s of user A is acquired through the UP module 213 , and the character length of the generated broadcast text is determined to be 20 after conversion of the mapping table, so the broadcast length parameter is 20.
  • the NLG module 36 generates a target broadcast text with a character length of about 20 characters according to the broadcast content returned above and the broadcast length parameter "Nanjing is sunny today, the lowest is 15 degrees Celsius, and the highest is 23 degrees Celsius".
  • the voice assistant 201 sends the user's listening time to the UP module 213, and the UP module 213 records the user A's listening time to the broadcast.
  • the voice command input by user B is the same as that of user A, and the process of DPL module 35 obtaining the returned result is the same as that of user A.
  • the character length of the generated broadcast text is about 10 characters, and the broadcast length parameter is 10; the target broadcast text generated by the NLG module 36 is "Sunny, 15 to 23 degrees Celsius", and the TTS module 37 generates a broadcast voice with a duration of 1.5-2.5s.
  • the method for generating the broadcast text proposed in the embodiment of the present application can generate the same voice command Broadcast texts of different lengths, so that the voice assistant can generate personalized broadcast texts according to the user's usage habits, and then perform personalized broadcasts according to the personalized broadcast texts.
  • the method for generating the broadcast text proposed in the embodiment of the present application introduces user information in the generation stage of the broadcast text and broadcast voice, and controls the detail level of the target broadcast text according to the user's historical listening time recorded in the user information. Provide a personalized interactive experience for thousands of people.
  • a method for generating broadcast text proposed in the embodiment of the present application On the basis of Embodiment 1, the user's voice instruction and the user's history are listened to through the imported data of user information, device information and/or current scene information. The duration, device information, and/or current scene information are combined to generate a broadcast text whose length matches the user's historical listening habits, and broadcast at a corresponding speech rate to provide a personalized broadcast experience.
  • user information includes the user's historical listening time; device information includes configuration information such as display resolution, size, and broadcast device type of the broadcast device; scene information includes information such as room type, background noise level, and the user's current exercise status.
  • the voice assistant obtains the device information of the used broadcasting device through the DP module 214, obtains the current scene information through the CA module 213, and the UP module 213 uses the device information and the scene information as indexes to search in the database 213 to obtain the most detailed information that meets the threshold requirements.
  • the granular historical listening duration information list is shown in Table 2.
  • the historical listening duration of the user is divided into three levels and calculated according to the device information and the current scene.
  • the broadcast text will be broadcast at a speaking speed of .
  • the broadcast length parameters are calculated according to the three-level listening duration.
  • Table 2 there are mainly the following available broadcast length parameters: overall listening duration t_total, mobile phone listening duration t_d 1 , TV listening duration t_d 2 , vehicle listening duration The duration t_e 1 , the listening duration t_e 2 in the living room and the listening duration of the mobile phone in the vehicle. According to the data in Table 2, it can be obtained:
  • average() is the mean function
  • the index value d 1 in brackets is the mobile phone
  • d 2 is the TV
  • e 1 is the vehicle
  • e 2 is the living room
  • d 1 e 1 is the mobile phone in the vehicle.
  • multiple pieces of historical listening duration information may be determined according to device information or scene information; and the broadcast length parameter may be determined according to the average or weighted average of multiple pieces of historical listening duration information.
  • the broadcast length parameter determined according to the device information is recorded as the first broadcast length parameter; the broadcast length parameter determined according to the scene information is recorded as the second broadcast length parameter.
  • the voice assistant uses the broadcast length parameter obtained through the first-level calculation.
  • the calculation of the first level is to calculate the overall listening time t_total.
  • the overall listening duration t_total is consistent with the user's historical listening duration defined in Embodiment 1, which is the average or weighted average of multiple pieces of historical listening duration information, and the overall listening duration t_total is used as the broadcast length parameter.
  • the voice assistant can determine the user's listening history based on the statistical average or weighted average
  • the overall listening time t_total, the overall listening time t_total is used as the broadcast length parameter.
  • the calculation of the second level is to count the multiple pieces of historical listening duration information on the corresponding device according to the listening duration t_d under the device information, or count the multiple historical listening duration information under the corresponding scene according to the listening duration t_e under the scene information.
  • the device corresponding to the device information in Table 2 may be a smart terminal such as a mobile phone or a TV; the scene corresponding to the scene information may be a place such as a vehicle, a bedroom or a living room, and a state of motion such as exercising or resting.
  • the voice assistant can record each piece of historical listening time information according to the mobile terminal Statistical average value or weighted average value to obtain user A's report broadcast length parameter under the mobile terminal.
  • the voice assistant logged in by user B can record statistical average or weighted average according to the historical duration of each record of listening to the broadcast in the living room, and obtain the user through different smart terminals.
  • the broadcast length parameter in the same scene.
  • At least one piece of historical listening duration information can be determined according to device information and scene information; and the broadcast length parameter can be determined according to the average or weighted average of multiple pieces of historical listening duration information.
  • the historical listening duration information obtained through the third-level calculation is used.
  • the broadcast length parameter determined according to the combination of device information and scene information is recorded as the third broadcast length parameter.
  • the calculation of the third level is to count the historical listening duration of the user of the current device d in the current scene e according to the listening duration t_de of the device scene.
  • the voice assistant logged in by user C can listen to the weather report in the vehicle according to each recorded item
  • the statistical average or weighted average of the historical listening duration information of the broadcast is used to obtain the broadcast length parameter for user C to listen to the broadcast in the vehicle through the mobile terminal.
  • the voice assistant After the voice assistant completes a broadcast text listening event, it sends the user's listening time to the UP module 213, and the UP module 213 will record the historical listening time of the corresponding level on the three-level historical listening time information list shown in Table 2. duration and time.
  • Embodiment 2 of the present application is aimed at users with different historical listening durations on different devices and in different scenarios.
  • the voice assistant 201 can generate target broadcast texts with different lengths to provide users with more refined personalities. personalized interactive experience.
  • Embodiment 2 of the present application conducts refined statistics on the user's historical listening time according to the type of device and the scene in which it is located, so as to provide a personalized broadcast voice interaction experience that is more suitable for the user's usage scene.
  • the dialog system of the voice assistant 201 can provide the user with a broadcast length and speech rate in line with the current user's listening history record, suitable for the user in combination with the user's historical listening duration information, device related parameters and/or current scene information during the broadcast text generation process. It is equipped with broadcast voice of device information and scene information, thereby improving the naturalness of voice interaction and greatly improving user experience.
  • a method for generating a broadcast text proposed in the embodiment of the present application, on the basis of Embodiments 1 and 2, can obtain the broadcast length parameter through a machine learning model, and the machine learning model can be realized based on a random forest (random forest) , use the historical listening time, screen size, screen resolution and/or noise level of the environment where the user is listening to the broadcast, and the room type to train the broadcast length parameter, input the broadcast length parameter and broadcast content into the machine learning model, and output the broadcast length Parameters, according to the broadcast length parameters and broadcast content to generate the target broadcast text, and broadcast at the corresponding speech rate, providing a personalized broadcast experience.
  • a random forest random forest
  • FIG. 6 is a schematic structural diagram of a random forest-based machine learning model of a broadcast text generation method proposed in Embodiment 3 of the present application. As shown in Figure 6, x in the figure is the input feature of the machine learning model, and the broadcast length parameter y is output.
  • the input feature x includes data such as user information, device information, and/or scene information; wherein, the user information includes the user's historical listening time; the device information includes the screen size and screen resolution of the current broadcasting device; the scene information is related to The data includes the level of ambient noise, the type of room it is in, and so on.
  • the broadcast length parameter y includes classification results such as "concise” or “moderate”, or a predicted length limit value L of the broadcast text.
  • the machine learning model is a classification model, input user information, device information, and/or scene information and other feature data, and the output broadcast length parameter y is the length classification result of the target broadcast text, which is recorded as The fourth broadcast length parameter; such as concise, moderate, detailed.
  • the classification model can be trained using a standard random forest classifier.
  • the machine learning model can be a regression model, which inputs feature data such as user information, device information, and/or scene information, and outputs the broadcast length parameter y, which is the length limit value L of the target broadcast text, denoted as The fifth broadcast length parameter; the regression model model can be trained using a standard random forest regression learner (random forest regressor).
  • Each initial model of the above-mentioned machine learning model is obtained through offline training, and then continuously collects the historical listening time of the user under the conditions of a specific screen size, screen resolution, and/or noise level of the environment and the room category in which the user is located. Learn to provide broadcast length parameters that adapt to the user's historical listening habits.
  • the training data of the machine learning model includes the user's historical listening time and/or device information, such as the screen size and/or screen resolution of the current broadcast device, and scene information, such as the level of ambient noise and/or the type of room in which it is located, etc.
  • the label of each piece of training data is the expected broadcast length parameter.
  • Each piece of training data can be obtained through the steps corresponding to Embodiment 1 and Embodiment 2 above, or collected from the network environment in combination with user feedback, which is not limited here.
  • the NLG module 36 uses the broadcast length parameter output by the above machine learning model to control the generated length of the target broadcast text.
  • the TTS module 37 uses the broadcast length parameter output by the machine learning model to control the speech rate of the broadcast voice, and broadcast at the corresponding speech speed.
  • the method for generating the broadcast text proposed in the embodiment of the present application introduces a machine learning model, obtains the broadcast length parameter according to the user's historical listening time, device information and/or scene information, and limits the broadcast text according to the broadcast length parameter and Announce the length of the voice, and through the online learning mechanism, keep the machine learning model to learn continuously, and update the personalized broadcast length parameters to match the user.
  • the personalized experience generated by voice assistant 201 using the broadcast text generation method in Embodiment 3 of the present application becomes more accurate as it is used.
  • the broadcast text generation method proposed in Embodiment 3 of the present application learns the mapping relationship between the user's historical listening duration to the expected broadcast text length and broadcast voice duration through a machine learning model, and provides more accurate personalization through online learning experience.
  • the first embodiment is the way of rule mapping.
  • a method for generating broadcast text proposed in the embodiment of this application can use pre-trained language models, such as BERT language model, GPT -2 language model, etc., integrate the broadcast length parameters into the controllable NLG module 36/TTS module 37, and generate broadcast text or voice end-to-end.
  • pre-trained language models such as BERT language model, GPT -2 language model, etc.
  • FIG. 7 is a schematic diagram of a structure of a typical pre-trained language model based on a method for generating broadcast text proposed in Embodiment 4 of the present application.
  • this module uses a linear encoder (linear) to encode different types of user information, device information, and/or scene information, and obtains the characterization vector of the broadcast length parameter through the fusion module (fusion), which is recorded as the sixth Broadcast length parameter;
  • the sixth broadcast length parameter and the broadcast content of the current user's voice command output by the DST module 34 and the current round dialogue state output by the DPL module 35 are input into the GPT-2 language model together, and the generation length is corresponding to the user's listening history record.
  • the matching target announces the text.
  • the NLG module 36 uses unlabeled text data to pre-train the GPT-2 language model to obtain language feature information. Then use the broadcast content information including the broadcast content, dialogue state, corresponding user information, device information and/or scene information, and the broadcast results that have received positive feedback from the user to fine-tune, learn the encoder parameters corresponding to each parameter, and adjust the pre-training
  • the parameters of the output layer of the GPT-2 model are used to generate the target broadcast text whose length matches the user's listening history to adapt to the generation task.
  • the method for generating the broadcast text proposed in the embodiment of the present application introduces not only user information but also device information and/or scene information when generating the broadcast text, and generates broadcast texts with different lengths. Collect the historical duration of users listening to the broadcast text through user information, and store the listening time of the broadcast text in combination with the environment and/or the equipment used.
  • Parameter-guided broadcast text generation can generate target broadcast texts that match user habits, adapt to device information and/or usage scenarios, improve interactive experience and efficiency, and provide a personalized voice assistant 201 that better understands users.
  • the voice assistant 201 actively sends out the welcome words, the broadcast text or voice generated when the system is turned on or off, and other things that may match the user's personalized usage records, device information and/or scene information
  • the methods of the above-mentioned embodiments of the present application can be used to generate the broadcast text or voice under the scenario of the present application.
  • the embodiment of this application proposes a method for broadcasting text, which can generate broadcast voice according to user request, introduce user information in the generation stage of broadcast voice, and control the speech rate of target broadcast voice according to the user's historical listening time recorded in user information, for Provide a personalized interactive experience with thousands of faces between users and voice assistants.
  • the embodiment of the present application proposes a method for broadcasting text, including: receiving a user's voice command; generating a target broadcast text corresponding to the voice command; controlling the broadcast speed of the target broadcast text according to the broadcast length parameter, and the broadcast length parameter indicates the historical listening duration information.
  • the voice assistant can determine the broadcast length parameter based on the average or weighted average of multiple pieces of historical listening duration information. For details, reference may be made to the implementation manner related to determining the broadcast length parameter in Embodiment 1, which will not be repeated here.
  • the broadcast length parameter is associated with device information
  • the first broadcast length parameter can be determined according to the device information
  • the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including: according to the first broadcast length parameter The broadcast speed of the target broadcast text is controlled; the first broadcast length parameter indicates the first historical listening duration information associated with the device information.
  • the broadcast length parameter is associated with the scene information
  • the second broadcast length parameter can be determined according to the scene information
  • the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including: according to the second broadcast length parameter The broadcast speed of the target broadcast text is controlled; the second broadcast length parameter indicates the second historical listening duration information associated with the scene information.
  • the broadcast length parameter is associated with field device information and scene information
  • the third broadcast length parameter can be determined according to the device information and scene information
  • the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including :
  • the third broadcast length parameter indicates the third historical listening duration information associated with the device information and scene information, and can refer to the third broadcast length in Embodiment 2 for details Implementation methods related to parameters will not be repeated here. .
  • controlling the broadcast speed of the target broadcast text according to the broadcast length parameter may include: inputting historical listening duration information, device information and/or scene information into the classification model; outputting the fourth broadcast length parameter;
  • the fourth broadcast length parameter is a different length category; the broadcast speed of the target broadcast text is controlled according to the fourth broadcast length parameter.
  • controlling the broadcast speed of the target broadcast text according to the broadcast length parameter may include: inputting historical listening duration information, device information and/or scene information into the regression model; outputting the fifth broadcast length parameter,
  • the fifth broadcast length parameter is a length limit value; the broadcast speed of the target broadcast text is controlled according to the fifth broadcast length parameter.
  • An embodiment of the present application provides an electronic device, including: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the processor is used for executing The method of any of the above embodiments.
  • An embodiment of the present application is a storage medium, and an instruction is stored in the storage medium, and when the instruction is run on a terminal, the first terminal is made to execute the method in any one of the foregoing embodiments.
  • the broadcast text listening duration defined in the embodiment of the present application may also be converted into an equivalent index such as the time for the user to view the broadcast text in a plain text generation scenario.
  • computer-readable media may include, but are not limited to: magnetic storage devices (e.g., hard disks, floppy disks, or tapes, etc.), optical disks (e.g., compact discs (compact discs, CDs), digital versatile discs (digital versatile discs, DVDs), etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), card, stick or key drive, etc.).
  • magnetic storage devices e.g., hard disks, floppy disks, or tapes, etc.
  • optical disks e.g., compact discs (compact discs, CDs), digital versatile discs (digital versatile discs, DVDs), etc.
  • smart cards and flash memory devices for example, erasable programmable read-only memory (EPROM), card, stick or key drive, etc.
  • various storage media described herein can represent one or more devices and/or other machine-readable media for storing information.
  • the term "machine-readable medium” may include, but is not limited to, wireless channels and various other media capable of storing, including and/or carrying instructions and/or data.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not The implementation process of the embodiment of the present application constitutes no limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the storage medium includes several instructions to enable a computer device (which may be a personal computer, a server, or an access network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The present application relates to the field of artificial intelligence (AI), and provides a broadcasting text generation method, which is applied to a speech assistant. The method comprises: receiving a speech instruction of a user; acquiring broadcasting content corresponding to the speech instruction; and generating target broadcasting text according to a broadcasting length parameter and the broadcasting content, wherein the broadcasting length parameter indicates historical listening duration information. In the present application, differential processing is performed on broadcasting text by means of a historical duration of listening to the broadcasting text by a user and in combination with a scenario where the user is located in and a device used; and in a specific scenario, the generation of broadcasting text is guided by means of a historical listening duration, and the speed of broadcasting target broadcast text is further controlled according to a broadcasting length parameter, so as to obtain target broadcasting speech that matches a historical usage habit of the user and is adapted to device information and a usage scenario, thereby improving the interaction experience and efficiency, and providing a personalized speech assistant that understands the user more.

Description

一种播报文本生成的方法、装置和电子设备A method, device and electronic equipment for generating broadcast text
本申请要求于2021年06月30日提交中国专利局、申请号为202110741280.1、申请名称为“一种播报文本生成的方法、装置和电子设备”的中国专利申请,以及要求于2022年3月30日提交的申请号为PCT/CN2022/084068的国际申请的优先权。上述两申请的全部内容通过引用结合在本申请中。This application requires submission of a Chinese patent application to the China Patent Office on June 30, 2021, with application number 202110741280.1, and the application name is "A Method, Device, and Electronic Equipment for Generating Broadcast Text", and it is required to be filed on March 30, 2022. The priority of the international application with application number PCT/CN2022/084068 filed on . The entire contents of the above two applications are incorporated by reference in this application.
技术领域technical field
本申请实施例涉及人工智能(artificial interlligence,AI)领域,尤其涉及一种播报文本生成的方法、装置和电子设备。The embodiments of the present application relate to the field of artificial intelligence (AI), and in particular to a method, device and electronic device for generating broadcast text.
背景技术Background technique
语音助手或虚拟助理是一种能代替个人执行任务或服务的代理软件,广泛应用于智能手机、智能音箱和智能车载终端(electronic control unit,ECU)等设备中。语音助手或虚拟助理提供语音用户界面(voice user interface,VUI),并根据用户的语音指令输入完成相应的任务或提供相关服务。语音助手执行用户发出的语音指令后,会生成播报文本并通过文字转语音(text-to-speech,TTS)模块生成对应的播报语音,告知用户播报内容并引导用户继续使用设备。Voice assistant or virtual assistant is a kind of agent software that can perform tasks or services instead of individuals, and is widely used in devices such as smartphones, smart speakers, and smart vehicle terminals (electronic control unit, ECU). A voice assistant or virtual assistant provides a voice user interface (voice user interface, VUI), and completes corresponding tasks or provides related services according to the user's voice command input. After the voice assistant executes the voice command issued by the user, it will generate the broadcast text and generate the corresponding broadcast voice through the text-to-speech (TTS) module, inform the user of the broadcast content and guide the user to continue using the device.
当前语音助手的播报文本一般采用固定的方式,与不同用户进行交互时,播报语音/播报文本无差异。如何为用户提供符合个人使用习惯的播报,提升用户交互的自然度,是亟待解决的问题。The broadcast text of the current voice assistant generally adopts a fixed method, and when interacting with different users, there is no difference in the broadcast voice/broadcast text. How to provide users with broadcasts that conform to their personal usage habits and improve the naturalness of user interaction is an urgent problem to be solved.
发明内容Contents of the invention
为了解决上述的问题,本申请的实施例提供了一种播报文本生成的方法、装置、终端设备和系统。In order to solve the above problems, the embodiments of the present application provide a method, device, terminal device and system for generating broadcast text.
第一方面,本申请的实施例提供了一种播报文本生成方法,所述方法包括:接收用户的语音指令;获取所述语音指令对应的播报内容;根据播报长度参数和所述播报内容生成目标播报文本,所述播报长度参数指示历史收听时长信息。以此能够为用户提供符合个人历史使用习惯的语音助手播报,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In the first aspect, an embodiment of the present application provides a method for generating broadcast text, the method comprising: receiving a user's voice command; acquiring the broadcast content corresponding to the voice command; generating a target according to the broadcast length parameter and the broadcast content The broadcast text, the broadcast length parameter indicates historical listening duration information. In this way, it can provide users with voice assistant broadcasts that conform to their personal historical usage habits, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
在一种可能的实现方式中,所述根据播报长度参数和所述播报内容生成目标播报文本,包括:将所述播报内容和所述播报长度参数作为模型的输入,所述模型输出目标播报文本,所述目标播报文本为时长与所述播报长度参数相匹配的播报文本。以此能够根据播报长度 参数通过模型为用户提供符合个人历史使用习惯的语音助手播报文本,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: using the broadcast content and the broadcast length parameter as input to a model, and the model outputs the target broadcast text , the target broadcast text is a broadcast text whose duration matches the broadcast length parameter. In this way, according to the broadcast length parameter, users can be provided with voice assistant broadcast texts that conform to personal historical usage habits through the model, providing a personalized broadcast experience for thousands of people, and improving the naturalness of voice assistant interaction.
在一种可能的实现方式中,所述模型为生成式模型或检索式模型;所述根据播报长度参数和所述播报内容生成目标播报文本,包括:将所述播报内容和所述播报长度参数作为生成式模型的输入,所述生成式模型输出目标播报文本,所述目标播报文本为时长与所述播报长度参数相匹配的播报文本。或In a possible implementation manner, the model is a generative model or a retrieval model; generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: combining the broadcast content and the broadcast length parameter As an input of the generative model, the generative model outputs a target broadcast text, and the target broadcast text is a broadcast text whose duration matches the broadcast length parameter. or
将所述播报内容和所述播报长度参数作为检索式模型的输入,所述检索式模型根据所述播报长度参数在预定义的模板库中检索限定长度的文本模板;通过检索到的所述限定长度的文本模板输出目标播报文本,所述目标播报文本为时长与所述历史收听时长信息相匹配的播报文本。以此能够根据播报长度参数通过生成式模型或检索式模型为用户提供符合个人历史使用习惯的语音助手播报文本,提供千人千面的个性化播报体验,提升语音助手交互的自然度。The broadcast content and the broadcast length parameter are used as the input of the retrieval model, and the retrieval model retrieves a text template of a limited length in a predefined template library according to the broadcast length parameter; The text template of the length outputs the target broadcast text, and the target broadcast text is the broadcast text whose duration matches the historical listening duration information. In this way, it is possible to provide users with voice assistant broadcast texts that conform to personal historical usage habits through generative models or retrieval models according to the broadcast length parameters, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
在一种可能的实现方式中,所述播报长度参数与设备信息关联,根据所述设备信息确定第一播报长度参数,所述根据播报长度参数和所述播报内容生成目标播报文本,具体包括:根据所述第一播报长度参数和所述播报内容生成第一目标播报文本;所述第一播报长度参数指示与所述设备信息关联的第一历史收听时长信息。以此能够为用户提供符合个人历史使用习惯的和适配设备的语音助手播报,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the broadcast length parameter is associated with device information, the first broadcast length parameter is determined according to the device information, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, specifically including: Generate a first target broadcast text according to the first broadcast length parameter and the broadcast content; the first broadcast length parameter indicates first historical listening duration information associated with the device information. In this way, it can provide users with voice assistant broadcasts that conform to personal historical usage habits and adapt to devices, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
在一种可能的实现方式中,所述播报长度参数与场景信息关联,根据所述场景信息确定第二播报长度参数,所述根据播报长度参数和所述播报内容生成目标播报文本,具体包括:根据所述第二播报长度参数和所述播报内容生成第二目标播报文本;所述第二播报长度参数指示与所述场景信息关联的第二历史收听时长信息。以此能够为用户提供符合个人历史使用习惯和当前场景的语音助手播报,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the broadcast length parameter is associated with scene information, the second broadcast length parameter is determined according to the scene information, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, specifically including: Generate a second target broadcast text according to the second broadcast length parameter and the broadcast content; the second broadcast length parameter indicates second historical listening duration information associated with the scene information. In this way, it can provide users with voice assistant broadcasts that conform to personal historical usage habits and current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
在一种可能的实现方式中,所述播报长度参数与场设备信息和场景信息关联,根据所述设备信息和场景信息确定第三播报长度参数,所述根据播报长度参数和所述播报内容生成目标播报文本,具体包括:根据所述第三播报长度参数和所述播报内容生成第三目标播报文本;所述第三播报长度参数指示与所述设备信息和所述场景信息关联的第三历史收听时长信息。以此能够为用户提供符合个人历史使用习惯的、适配设备和当前场景的语音助手播报,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the broadcast length parameter is associated with field device information and scene information, the third broadcast length parameter is determined according to the device information and scene information, and the third broadcast length parameter is generated according to the broadcast length parameter and the broadcast content. The target broadcast text specifically includes: generating a third target broadcast text according to the third broadcast length parameter and the broadcast content; the third broadcast length parameter indicates a third history associated with the device information and the scene information Listen to duration information. In this way, it can provide users with voice assistant broadcasts that conform to personal historical usage habits, adapt to devices and current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
在一种可能的实现方式中,所述播报长度参数与场设备信息和/或场景信息关联,所述根据播报长度参数和所述播报内容生成目标播报文本,包括:将所述历史收听时长信息、设备信息和/或场景信息输入分类模型;输出第四播报长度参数;所述第四播报长度参数为不同的长度类别;根据所述第四播报长度参数和所述播报内容生成第四目标播报文本。以此能够通过分类模型得到的播报长度参数符合个人历史使用习惯的、适配设备和/或当前场景的语音助手播报,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: storing the historical listening duration information , device information and/or scene information input classification model; output a fourth broadcast length parameter; the fourth broadcast length parameter is a different length category; generate a fourth target broadcast according to the fourth broadcast length parameter and the broadcast content text. In this way, the broadcast length parameters obtained through the classification model conform to the personal historical usage habits, adapt to the device and/or the current scene of the voice assistant broadcast, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interaction.
在一种可能的实现方式中,所述播报长度参数与场设备信息和/或场景信息关联,所述根据播报长度参数和所述播报内容生成目标播报文本,包括:将所述历史收听时长信息、设备信息和/或场景信息输入所述回归模型;输出第五播报长度参数,所述第五播报长度参 数为长度限制值;根据所述第五播报长度参数和所述播报内容生成第五目标播报文本。以此能够通过回归模型生成符合个人历史使用习惯的、适配设备和/或当前场景的语音助手播报,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: storing the historical listening duration information , device information and/or scene information are input into the regression model; a fifth broadcast length parameter is output, and the fifth broadcast length parameter is a length limit value; a fifth target is generated according to the fifth broadcast length parameter and the broadcast content Announce text. In this way, the regression model can be used to generate a voice assistant broadcast that conforms to personal historical usage habits, adapts to the device and/or the current scene, provides a personalized broadcast experience for thousands of people, and improves the naturalness of voice assistant interaction.
在一种可能的实现方式中,所述播报长度参数与场设备信息和/或场景信息关联,所述根据播报长度参数和所述播报内容生成目标播报文本,包括:将设备信息、场景信息和/或所述历史收听时长信息分别进行线性编码后进行融合,获得第六播报长度参数;所述第六播报长度参数为播报长度参数的表征向量;将所述第六播报长度参数、所述播报内容和所述语音指令是可执行/不可执行作为预训练语言模型的输入,输出第六目标播报文本。以此能够通过预训练语言模型生成符合个人历史使用习惯的、适配设备和/或当前场景的语音助手播报,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: combining the device information, scene information and /or the historical listening duration information is respectively linearly coded and fused to obtain the sixth broadcast length parameter; the sixth broadcast length parameter is a characterization vector of the broadcast length parameter; the sixth broadcast length parameter, the broadcast Whether the content and the voice instruction is executable or non-executable is used as the input of the pre-trained language model, and the sixth target broadcast text is output. In this way, the pre-trained language model can be used to generate voice assistant broadcasts that conform to personal historical usage habits, adapt to devices and/or current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
在一种可能的实现方式中,所述获取所述语音指令对应的播报内容,包括:根据所述语音指令获取意图和槽位信息;根据所述意图和槽位信息确定所述语音指令是否可执行;在所述所述语音指令不可执行的情况下,生成播报内容,所述播报内容为询问信息。以此在语音指令不可执行的情况下能够获得语音助手询问用户的播报内容。In a possible implementation manner, the acquiring the broadcast content corresponding to the voice command includes: acquiring intent and slot information according to the voice command; determining whether the voice command can be used according to the intent and slot information Executing: in the case that the voice instruction is not executable, generate broadcast content, where the broadcast content is inquiry information. In this way, in the case that the voice command cannot be executed, it is possible to obtain the broadcast content that the voice assistant asks the user.
在一种可能的实现方式中,所述根据所述对话状态确定所述播报内容,包括:根据所述语音指令获取意图和槽位信息;根据所述意图和槽位信息确定所述语音指令是否可执行;在所述所述语音指令可执行的情况下,确定执行所述意图的第三方服务;从所述第三方服务获取所述播报内容,所述播报内容为与所述语音指令对应的结果信息。以此在语音指令可执行的情况下获得第三方服务执行语音指令后返回的所述播报内容。In a possible implementation manner, the determining the broadcast content according to the dialog state includes: acquiring intent and slot information according to the voice instruction; determining whether the voice instruction is Executable; in the case that the voice instruction is executable, determine the third-party service that executes the intention; obtain the broadcast content from the third-party service, and the broadcast content is corresponding to the voice instruction result information. In this way, when the voice command is executable, the broadcast content returned after the third-party service executes the voice command is obtained.
在一种可能的实现方式中,所述方法还包括:根据所述播报长度参数对所述目标播报文本的播报速度进行控制。以此能够生成符合个人历史使用习惯的、适配设备和/或当前场景的语音,提供千人千面的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the method further includes: controlling the broadcast speed of the target broadcast text according to the broadcast length parameter. In this way, it is possible to generate voices that conform to personal historical usage habits, adapt to devices and/or current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.
在一种可能的实现方式中,所述方法还包括:记录当前所述目标播报文本的播报时长,获得所述历史收听时长信息。以此能够获得符合个人历史使用习惯的个性化播报体验,提升语音助手交互的自然度。In a possible implementation manner, the method further includes: recording the current broadcast duration of the target broadcast text, and obtaining the historical listening duration information. In this way, a personalized broadcast experience that conforms to personal historical usage habits can be obtained, and the naturalness of voice assistant interaction can be improved.
第二方面,本申请的实施例提供了一种播报文本的方法,所述方法包括:接收用户的语音指令;生成所述语音指令对应的目标播报文本;根据播报长度参数对所述目标播报文本的播报速度进行控制,所述播报长度参数指示历史收听时长信息。根据播报长度参数对目标播报文本的播报速度进行控制的有益效果与本申请第一方面播报长度参数生成目标播报文本的各个实施例的有益效果雷同,以下文中不再赘述。In a second aspect, the embodiment of the present application provides a method for broadcasting text, the method comprising: receiving a user's voice command; generating a target broadcast text corresponding to the voice command; and broadcasting the target text according to the broadcast length parameter The broadcast speed is controlled, and the broadcast length parameter indicates historical listening duration information. The beneficial effect of controlling the broadcast speed of the target broadcast text according to the broadcast length parameter is the same as that of the embodiments of the first aspect of the present application in which the target broadcast text is generated by the broadcast length parameter, and will not be repeated hereafter.
在一种可能的实现方式中,所述播报长度参数与设备信息关联,根据所述设备信息确定第一播报长度参数,所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:根据所述第一播报长度参数对所述目标播报文本的播报速度进行控制;所述第一播报长度参数指示与所述设备信息关联的第一历史收听时长信息。In a possible implementation manner, the broadcast length parameter is associated with device information, the first broadcast length parameter is determined according to the device information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including : Control the broadcast speed of the target broadcast text according to the first broadcast length parameter; the first broadcast length parameter indicates first historical listening duration information associated with the device information.
在一种可能的实现方式中,所述播报长度参数与场景信息关联,根据所述场景信息确定第二播报长度参数,所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:根据所述第二播报长度参数对所述目标播报文本的播报速度进行控制;所述第二播报长度参数指示与所述设备信息关联的第二历史收听时长信息。In a possible implementation manner, the broadcast length parameter is associated with scene information, the second broadcast length parameter is determined according to the scene information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including : Control the broadcast speed of the target broadcast text according to the second broadcast length parameter; the second broadcast length parameter indicates the second historical listening duration information associated with the device information.
在一种可能的实现方式中,所述所述播报长度参数与场设备信息和场景信息关联,根据所述设备信息和场景信息确定第三播报长度参数,所述所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:根据所述第三播报长度参数对所述目标播报文本的播报速度进行控制;所述第三播报长度参数指示与所述设备信息关联的第三历史收听时长信息。In a possible implementation manner, the broadcast length parameter is associated with field device information and scene information, and a third broadcast length parameter is determined according to the device information and scene information, and the broadcast length parameter is used to determine the third broadcast length parameter. Controlling the broadcast speed of the target broadcast text includes: controlling the broadcast speed of the target broadcast text according to the third broadcast length parameter; the third broadcast length parameter indicates the third history associated with the device information Listen to duration information.
在一种可能的实现方式中,所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:将所述历史收听时长信息、设备信息和/或场景信息输入分类模型;输出第四播报长度参数;所述第四播报长度参数为不同的长度类别;根据所述第四播报长度参数对所述目标播报文本的播报速度进行控制。In a possible implementation manner, the controlling the broadcast speed of the target broadcast text according to the broadcast length parameter includes: inputting the historical listening duration information, device information and/or scene information into the classification model; outputting the first Four broadcast length parameters; the fourth broadcast length parameters are different length categories; the broadcast speed of the target broadcast text is controlled according to the fourth broadcast length parameters.
在一种可能的实现方式中,所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:将所述历史收听时长信息、设备信息和/或场景信息输入所述回归模型;输出第五播报长度参数,所述第五播报长度参数为长度限制值;根据所述第五播报长度参数对所述目标播报文本的播报速度进行控制。In a possible implementation manner, the controlling the broadcast speed of the target broadcast text according to the broadcast length parameter includes: inputting the historical listening duration information, device information and/or scene information into the regression model; Outputting a fifth broadcast length parameter, where the fifth broadcast length parameter is a length limit value; and controlling the broadcast speed of the target broadcast text according to the fifth broadcast length parameter.
第三方面,本申请的实施例提供了一种电子设备,包括:至少一个存储器,用于存储程序;和至少一个处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述任一实施例所述的方法。In a third aspect, an embodiment of the present application provides an electronic device, including: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory, when the programs stored in the memory When executed, the processor is configured to execute the method described in any one of the foregoing embodiments.
第四方面,本申请的实施例一种存储介质,所述存储介质中存储有指令,当所述指令在终端上运行时,使得第一终端执行上述任一实施例所述的方法。In a fourth aspect, an embodiment of the present application is a storage medium, where an instruction is stored in the storage medium, and when the instruction is run on a terminal, the first terminal is made to execute the method described in any one of the foregoing embodiments.
附图说明Description of drawings
为了更清楚地说明本说明书披露的多个实施例的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书披露的多个实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the multiple embodiments disclosed in this specification, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only disclosed in this specification. For multiple embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort.
下面对实施例或现有技术描述中所需使用的附图作简单地介绍。The following briefly introduces the drawings used in the embodiments or the description of the prior art.
图1为一种人工智能主体框架示意图;Fig. 1 is a schematic diagram of an artificial intelligence main frame;
图2为本申请实施例提出的语音助手的应用系统示意图;FIG. 2 is a schematic diagram of the application system of the voice assistant proposed in the embodiment of the present application;
图3为本申请实施例中的语音助手的功能架构图;FIG. 3 is a functional architecture diagram of the voice assistant in the embodiment of the present application;
图4为本申请实施例一提出的播报文本生成的方法的流程图;FIG. 4 is a flowchart of a method for generating broadcast text proposed in Embodiment 1 of the present application;
图5为本申请实施例一提出的播报文本生成的方法的应用示意图;FIG. 5 is an application schematic diagram of a method for generating broadcast text proposed in Embodiment 1 of the present application;
图6为本申请实施例三提出的播报文本生成的方法基于随机森林的机器学习模型的结构示意图;FIG. 6 is a schematic structural diagram of a random forest-based machine learning model based on a method for generating broadcast text proposed in Embodiment 3 of the present application;
图7为本申请实施例四提出的播报文本生成的方法基于典型预训练语言模型结构方式示意图。FIG. 7 is a schematic diagram of a structure of a typical pre-trained language model based on a method for generating broadcast text proposed in Embodiment 4 of the present application.
具体实施方式detailed description
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.
在以下的描述中,所涉及的术语“第一\第二\第三等”或模块A、模块B、模块C等,仅用于区别类似的对象,不代表针对对象的特定排序,可以理解地,在允许的情况下可以互换特定的顺序或先后次序,以使这里描述的本申请实施例能够以除了在这里图示或描述的以外的顺序实施。In the following description, the terms "first\second\third, etc." or module A, module B, module C, etc. are only used to distinguish similar objects, and do not represent a specific ordering of objects. It is understandable Obviously, where permitted, the specific order or sequence can be interchanged such that the embodiments of the application described herein can be practiced in other sequences than those illustrated or described herein.
在以下的描述中,所涉及的表示步骤的标号,如S110、S120……等,并不表示一定会按此步骤执行,在允许的情况下可以互换前后步骤的顺序,或同时执行。In the following description, the involved reference numerals representing steps, such as S110, S120, etc., do not mean that this step must be executed, and the order of the previous and subsequent steps can be interchanged or executed simultaneously if allowed.
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本申请实施例的目的,不是旨在限制本申请。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
自然语言生成(natural language generation,NLG)是自然语言处理的一部分,是从知识库或逻辑形式等机器表述系统生成自然语言。NLG可以视为自然语言理解(natural language understand,NLU)的反向:NLU须要厘清输入语言的涵义,产生机器表述语言;而NLG须要决定如何把概念化的机器表述语言转化成用户能接收到的自然语言。Natural language generation (natural language generation, NLG) is a part of natural language processing, which generates natural language from machine representation systems such as knowledge bases or logical forms. NLG can be regarded as the reverse of natural language understanding (NLU): NLU needs to clarify the meaning of the input language and generate a machine expression language; and NLG needs to decide how to convert the conceptual machine expression language into a natural language that users can receive. language.
在一种可能的方案中,用户唤醒语音助手,发出与查询天气相关的语音指令,语音助手利用自然语言理解(NLU)能力,理解用户发出的与查询天气相关的语音指令,并将该语音指令按照类似表1的自然语言分类体系进行分类,根据分类的结果查询天气,根据天气查询结果选择对应的模板生成对应天气的播报文本,或生成对应于天气信息类别及其关联属性的播报文本,播报文本内容符合该语音指令所属的类别。In a possible solution, the user wakes up the voice assistant and issues a voice command related to querying the weather. The voice assistant uses the natural language understanding (NLU) capability to understand the voice command issued by the user related to querying the weather and interprets the voice command Classify according to the natural language classification system similar to Table 1, query the weather according to the classification results, select the corresponding template according to the weather query results to generate the broadcast text corresponding to the weather, or generate the broadcast text corresponding to the weather information category and its associated attributes, broadcast The text content matches the category to which the voice command belongs.
表1Table 1
Figure PCTCN2022095805-appb-000001
Figure PCTCN2022095805-appb-000001
该方案根据用户输入的不同语音指令生成不同类别的播报文本,但该播报文本的内容仅与用户输入的语音指令的类别有关,未考虑用户个人使用习惯、设备差异或所处场景的差异,无法提供千人千面的个性化天气播报体验。This solution generates different types of broadcast text according to different voice commands input by the user, but the content of the broadcast text is only related to the type of voice command input by the user, and does not consider the user's personal usage habits, differences in equipment or differences in the scene they are in. Provide a personalized weather broadcast experience for thousands of people.
本申请实施例提出一种播报文本生成的方法,涉及AI领域,适用于语音助手内,通过引入用户信息、设备信息和/或场景信息,语音助手能够根据用户个人使用习惯、设备差异和/或所处环境生成个性化时长的播报文本,并通过TTS生成对应语速的播报语音信息,告知用户播报内容并引导用户继续使用设备。The embodiment of the present application proposes a method for generating broadcast text, which relates to the field of AI and is applicable to voice assistants. By introducing user information, device information and/or scene information, the voice assistant can be based on the user's personal usage habits, device differences and/or The environment generates a personalized broadcast text, and generates broadcast voice information corresponding to the speech rate through TTS, informs the user of the broadcast content and guides the user to continue using the device.
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。下面基于图1示出的人工智能主体框架从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对人工智能主体框架进行说明。Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements. Based on the main framework of artificial intelligence shown in Figure 1, the main framework of artificial intelligence will be described from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。"Intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
“IT价值链”从人工智能的底层基础设施、信息提供和处理技术实现、到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。"IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, the realization of information provision and processing technology, to the systematic industrial ecological process.
(1)基础设施10:(1) Infrastructure 10:
基础设施10为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。其中,传感器用于与外部沟通获得数据流;智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)用于提供训练、计算和执行能力;基础平台用于进行云存储和云计算、网络互联互通等,包括分布式计算框架及网络等相关的平台保障和支持等。The infrastructure 10 provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Among them, sensors are used to communicate with the outside to obtain data streams; smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA) are used to provide training, calculation and execution capabilities; basic platforms are used for cloud storage and cloud computing, Network interconnection, etc., including distributed computing framework and network related platform guarantee and support.
(2)数据11(2) Data 11
基础设施10的上一层的数据11用于表示人工智能领域的数据来源。The data 11 on the upper layer of the infrastructure 10 is used to represent data sources in the field of artificial intelligence.
在本申请实施例提出的一种播报文本生成方法中,基础设施10的上一层的数据11来源于在终端侧获取的语音指令、所用终端的设备信息以及通过传感器与外部沟通获得的场景信息。In a broadcast text generation method proposed in the embodiment of the present application, the data 11 of the upper layer of the infrastructure 10 comes from the voice commands acquired on the terminal side, the equipment information of the terminal used, and the scene information obtained through sensor communication with the outside .
(3)数据处理12(3) Data processing 12
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies. The typical functions are search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
在本申请实施例提出的一种播报文本生成方法中,数据处理过程包括对接收到的用户的语音指令进行前端处理、语音识别(ASR),语义理解(NLU)、对话管理(DM)、自然语言生成(NLG)、语音合成(TTS)等处理。In a broadcast text generation method proposed in the embodiment of the present application, the data processing process includes front-end processing, speech recognition (ASR), semantic understanding (NLU), dialog management (DM), natural Language generation (NLG), speech synthesis (TTS) and other processing.
(4)通用能力13(4) General ability 13
数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统。After the data has undergone the data processing mentioned above, some general-purpose capabilities can be formed based on the results of data processing, such as algorithms or a general-purpose system.
在本申请实施例中,用户输入的语音指令、所用终端的设备信息以及通过传感器与外部沟通获得的场景信息经过上述数据处理后,基于数据处理的结果可以生成个性化时长的播报文本,并生成对应语速的播报语音,提供千人千面的个性化播报体验。In the embodiment of the present application, after the above-mentioned data processing is performed on the voice commands input by the user, the equipment information of the terminal used, and the scene information obtained through communication with the outside world through the sensor, a personalized broadcast text can be generated based on the result of the data processing, and a The broadcast voice corresponding to the speed of speech provides a personalized broadcast experience for thousands of people.
(5)智能产品及行业应用14(5) Smart products and industry applications 14
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,智能终端等。Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, smart terminals, etc.
本申请实施例提出的一种播报文本生成方法,可应用于智能终端、智能家居、智能安防、自动驾驶等领域智能设备的语音助手中,在智能手机、音箱、智能车载终端(electronic control unit,ECU)上提供语音用户界面(VUI),并根据用户输入的语音指令完成相应的任务或提供相关服务。A broadcast text generation method proposed in the embodiment of the present application can be applied to voice assistants of smart devices in the fields of smart terminals, smart homes, smart security, and automatic driving. ECU) provides a voice user interface (VUI), and completes corresponding tasks or provides related services according to voice commands input by users.
示例性地,智能设备包括智能电视、智能音箱、机器人、智能空调、智能烟雾警报器、智能灭火器、智能车载终端、手机、平板、笔记本电脑、台式电脑、一体机等。Exemplarily, smart devices include smart TVs, smart speakers, robots, smart air conditioners, smart smoke alarms, smart fire extinguishers, smart vehicle terminals, mobile phones, tablets, laptops, desktop computers, and all-in-one machines.
图2为本申请实施例提出的语音助手的应用系统示意图。如图2所示,在系统示意图200中,数据采集设备260用于采集用户信息、设备信息、场景信息和/或历史收听时长等信息,并将这些信息存入数据库230。数据采集设备260对应于图1中基础设施的传感器,包括运动感应器、位移传感器、红外传感器等与智能设备通信连接的装置,用于收集用户当前场景信息,例如运动、开会、休息或聊天等。FIG. 2 is a schematic diagram of the application system of the voice assistant proposed by the embodiment of the present application. As shown in FIG. 2 , in the system diagram 200 , the data collection device 260 is used to collect information such as user information, device information, scene information and/or historical listening duration, and store the information in the database 230 . The data acquisition device 260 corresponds to the sensors of the infrastructure in Figure 1, including devices such as motion sensors, displacement sensors, and infrared sensors that communicate with smart devices, and are used to collect user current scene information, such as sports, meetings, rest or chatting, etc. .
数据采集设备260还包括摄像头设备、GPS等与智能设备通信连接的装置,用于收集用户当前所处的位置或场所的场景信息,例如在车辆、客厅或卧室内等。The data collection device 260 also includes a camera device, GPS, and other devices that are communicatively connected with the smart device, and is used to collect scene information of the user's current location or place, such as in a vehicle, living room or bedroom.
数据采集设备260还包括计时器,用于记录播报语音的起始时刻、结束时刻及播报时长。将该播报时长作为用户的历史收听时长记录在用户信息中。The data collection device 260 also includes a timer, which is used to record the start time, end time and broadcast duration of the broadcast voice. The broadcast duration is recorded in the user information as the user's historical listening duration.
客户设备240对应于图1中基础设施的基础平台,用于与用户进行交互,获取用户发出的语音指令,播报语音指令的播报内容,向用户展示播报内容,并将这些信息存入数据库230;客户设备240包括提供语音用户界面(VUI)的智能手机、智能车载终端等的显示屏和传声器、扬声器、按键、蓝牙耳机麦克风等。The client device 240 corresponds to the basic platform of the infrastructure in FIG. 1, and is used for interacting with the user, obtaining the voice command sent by the user, broadcasting the broadcast content of the voice command, showing the broadcast content to the user, and storing the information in the database 230; The client device 240 includes a smart phone providing a voice user interface (VUI), a display screen and a microphone, a speaker, a button, a Bluetooth earphone microphone, and the like, such as a smart vehicle terminal.
其中传声器可以是收音设备,包括集成的麦克风、与智能设备相连的麦克风或者麦克风阵列、或通过短距离连接技术与智能设备通信连接的麦克风或者麦克风阵列等,用于收集用户发出的语音指令。The microphone can be a radio device, including an integrated microphone, a microphone or a microphone array connected to a smart device, or a microphone or a microphone array connected to a smart device through a short-distance connection technology, and is used to collect voice commands issued by the user.
训练设备220对应于图1中基础设施的智能芯片,基于数据库230中维护的用户信息、设备信息、场景信息和/或历史播报时长等数据训练语音助手201。语音助手201能够在用户与客户设备240进行语音对话场景中提供个性化时长的播报文本,并生成对应语速的播报语音,告知用户播报内容并引导用户继续使用客户设备240。The training device 220 corresponds to the smart chip of the infrastructure in FIG. 1 , and trains the voice assistant 201 based on data maintained in the database 230 such as user information, device information, scene information and/or historical broadcast duration. The voice assistant 201 can provide a personalized broadcast text in the voice dialogue scene between the user and the client device 240 , and generate a broadcast voice corresponding to the speech rate, inform the user of the broadcast content and guide the user to continue using the client device 240 .
在图2中,执行设备210对应于图1中基础设施的智能芯片,配置有I/O接口212,与客户设备240进行数据交互,执行设备210通过I/O接口212获取用户通过客户设备240输入的语音指令信息;通过I/O接口212向客户设备240输出播报内容,例如,通过扬声器广播播报内容,或将播报内容通过语音用户界面(VUI)展示在智能手机、智能车载终端等的显示屏上。In FIG. 2 , the execution device 210 corresponds to the smart chip of the infrastructure in FIG. 1 , and is equipped with an I/O interface 212 for data interaction with the client device 240 . Input voice command information; output the broadcast content to the client device 240 through the I/O interface 212, for example, broadcast the broadcast content through the loudspeaker, or display the broadcast content on the display of smart phones, smart vehicle terminals, etc. through the Voice User Interface (VUI) screen.
执行设备210可以调用数据存储系统250中的数据、代码等,也可以将数据、代码指令等存入数据存储系统250中。The execution device 210 may call data, codes, etc. in the data storage system 250 , and may also store data, code instructions, etc. in the data storage system 250 .
训练设备220和执行设备210可以是同一个智能芯片,也可以是不同的智能芯片。The training device 220 and the executing device 210 may be the same smart chip or different smart chips.
数据库230为存储在存储介质上的用户信息、设备信息和/或场景信息的数据集合。The database 230 is a data collection of user information, device information and/or scene information stored on a storage medium.
语音助手201是用于执行语音指令或服务的代理软件,执行设备210执行语音助手201,获取用户发出的语音指令后,会根据用户信息、设备信息和/或场景信息生成个性化长度的目标播报文本,并控制播报语音的语速,告知用户播报内容并引导用户继续使用设备。The voice assistant 201 is an agent software for executing voice instructions or services. The execution device 210 executes the voice assistant 201. After acquiring the voice instructions issued by the user, it will generate a personalized length target broadcast according to user information, device information and/or scene information Text, and control the speech rate of the broadcast voice, inform the user of the broadcast content and guide the user to continue using the device.
最后,I/O接口212将语音助手201生成的个性化长度的目标播报文本作为输出数据返回给客户设备240,客户设备240展示该播报文本并以对应语速播报给用户。Finally, the I/O interface 212 returns the target broadcast text of personalized length generated by the voice assistant 201 to the client device 240 as output data, and the client device 240 displays the broadcast text and broadcasts it to the user at a corresponding speech speed.
更深层地,训练设备220获取数据库230中存储的训练数据和语料,基于获取历史记录的用户信息、设备信息和/或场景信息等数据,以输出与该用户历史收听历史记录相匹配的长度的播报文本为训练目标,训练语音助手201,以输出更佳的目标播报文本。At a deeper level, the training device 220 acquires the training data and corpus stored in the database 230, and based on the acquired data such as user information, device information and/or scene information of the historical record, to output a length that matches the user's historical listening history record. The broadcast text is the training target, and the voice assistant 201 is trained to output better target broadcast text.
在附图2中所示情况下,用户可以向执行设备210输入语音指令信息,例如,可以在客户设备240提供的语音用户界面(VUI)中操作。另一种情况下,客户设备240可以自动地向I/O接口212输入指令并获得播报内容,如果客户设备240自动输入指令信息需要获得用户的授权,用户可以在客户设备240中设置相应权限。用户可以在客户设备240查看或收听执行设备210输出的播报内容,具体的呈现形式可以是显示、唤醒音、广播等具体方式。客户设备240也可以作为语音数据采集端将采集到用户的唤醒音或声纹数据存入数据库230。In the situation shown in FIG. 2 , the user can input voice instruction information to the execution device 210 , for example, can operate in a voice user interface (VUI) provided by the client device 240 . In another case, the client device 240 can automatically input instructions to the I/O interface 212 and obtain broadcast content. If the client device 240 needs to obtain user authorization for automatically inputting instruction information, the user can set corresponding permissions in the client device 240 . The user can view or listen to the broadcast content output by the execution device 210 on the client device 240 , and the specific presentation form may be specific ways such as display, wake-up sound, and broadcast. The client device 240 can also serve as a voice data collection terminal and store the collected wake-up sound or voiceprint data of the user into the database 230 .
值得注意的,附图2仅是本申请实施例提供的一种系统应用的场景示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,附图2的一种系统可以对应一个或多个设备实体,例如,在附图2中,数据存储系统250相对执行设备210是外部存储器,在其它情况下,也可以将数据存储系统250置于执行设备210中。It is worth noting that Figure 2 is only a schematic diagram of a system application scenario provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation. The system shown in Figure 2 It may correspond to one or more device entities. For example, in FIG. 2 , the data storage system 250 is an external memory relative to the execution device 210 . In other cases, the data storage system 250 may also be placed in the execution device 210 .
图3为本申请实施例中语音助手的功能架构图。下面对于语音助手201中各个功能模块进行说明,如图3所示,语音助手201包括前端处理模块31、语音识别模块32、语义理解模块33、对话状态模块34、对话策略学习模块35、自然语言生成模块36、语音合成模块37和对话输出模块38。FIG. 3 is a functional architecture diagram of the voice assistant in the embodiment of the present application. Each functional module in voice assistant 201 is described below, as shown in Figure 3, voice assistant 201 comprises front-end processing module 31, speech recognition module 32, semantic understanding module 33, dialog state module 34, dialog strategy learning module 35, natural language generation module 36 , speech synthesis module 37 and dialogue output module 38 .
其中,前端处理模块31用于将用户输入的语音指令处理获得网络模型需要的数据格式供语音识别模块32使用。Wherein, the front-end processing module 31 is used to process the voice command input by the user to obtain the data format required by the network model for use by the voice recognition module 32 .
示例性地,前端处理模块31获取用户输入的opus压缩格式的语音指令,对该语音指令进行音频解码,解码成pcm格式的音频信号;利用声纹或其他特征对该音频信号进行分离、降噪、特征提取,并通过分帧、开窗、短时傅里叶变换等音频处理算法,得到梅尔频率倒谱分析(mel-frequency cepstral coefficients,MFCC)滤波器组(filter bank)的音频特征向量。前端处理模块31一般设置于终端侧。Exemplarily, the front-end processing module 31 obtains the voice command of opus compression format input by the user, performs audio decoding on the voice command, and decodes it into an audio signal in pcm format; uses voiceprint or other features to separate and denoise the audio signal , feature extraction, and through audio processing algorithms such as framing, windowing, and short-time Fourier transform, the audio feature vector of the mel-frequency cepstral coefficients (MFCC) filter bank (filter bank) is obtained . The front-end processing module 31 is generally disposed on the terminal side.
语音识别(automatic speech recognition,ASR)模块32用于获取前端处理模块31处理得到的音频特征向量,通过声学模型和语言模型将音频特征向量转换为文本,供语义理解模块33进行理解。其中,声学模型用于把声学特征分类对应到(解码)音素或字词,语言模型用于把音素或字词解码成一个完整的句子。The speech recognition (automatic speech recognition, ASR) module 32 is used for obtaining the audio feature vector obtained by the front-end processing module 31, and converts the audio feature vector into text through an acoustic model and a language model for the semantic understanding module 33 to understand. Among them, the acoustic model is used to classify the acoustic features and correspond to (decode) phonemes or words, and the language model is used to decode phonemes or words into a complete sentence.
示例性地,声学模型和语言模型通过串联的方式对音频特征向量进行处理,通过声学模型将音频特征向量转换为音素或字词,再通过语言模型将音素或字词转换为文字序列,输出用户语音对应的文本。Exemplarily, the acoustic model and the language model process the audio feature vectors in series, convert the audio feature vectors into phonemes or words through the acoustic model, and then convert the phonemes or words into text sequences through the language model, and output user Speech-to-text.
示例性地,ASR模块32可以采用端到端的实现方式,其中声学模型和语言模型采用神经网络结构,通过对声学模型和语言模型进行联合训练,使训练的结果是输出与用户语音相对应的汉字序列。例如声学模型可以采用隐马尔可夫模型(HMM)建模,语言模型可以是n-gram模型。Exemplarily, the ASR module 32 can be implemented in an end-to-end manner, wherein the acoustic model and the language model adopt a neural network structure, and the acoustic model and the language model are jointly trained so that the result of the training is to output Chinese characters corresponding to the user's voice sequence. For example, the acoustic model may be modeled using a Hidden Markov Model (HMM), and the language model may be an n-gram model.
语义理解(natural language understanding,NLU)模块33用于将用户语音对应的文本或汉字序列转换为结构化信息,其中结构化信息包括机器可执行的意图信息和可识别的槽位信息。其目的是通过语法、语义和语用的分析,获取自然语言的语义表示。The semantic understanding (natural language understanding, NLU) module 33 is used to convert the text or Chinese character sequence corresponding to the user's voice into structured information, wherein the structured information includes machine-executable intention information and recognizable slot information. Its purpose is to obtain the semantic representation of natural language through the analysis of syntax, semantics and pragmatics.
可以理解的是,意图信息是指用户发出的语音指令需要执行的任务;槽位信息是指执行任务需要确定的参数信息。It can be understood that the intent information refers to the task that needs to be performed by the voice command issued by the user; the slot information refers to the parameter information that needs to be determined to perform the task.
示例性地,用户询问语音助手201“南京今天气温如何?”NLU模块33对该语音指令对应的文本进行理解,获得该语音指令的意图为“查天气”,槽位为“地点:南京”和“时间:今天”。Exemplarily, the user asks the voice assistant 201 "What's the temperature in Nanjing today?" The NLU module 33 understands the text corresponding to the voice command, and obtains the voice command's intention as "check the weather", and the slot is "Location: Nanjing" and "Time: Today".
NLU模块33可以通过分类器将语音指令对应的文本分类为语音助手201可支持的意图信息,再使用序列标注模型标注文本中的槽位信息。The NLU module 33 can use a classifier to classify the text corresponding to the voice instruction into the intent information that the voice assistant 201 can support, and then use the sequence labeling model to label the slot information in the text.
其中,分类器可以是传统机器学习算法中能用来分类的模型,例如,NB模型,随机森林模型(RF),SVM分类模型,KNN分类模型等;也可以是深度学习文本分类模型,例如,FastText模型,TextCNN等。Among them, the classifier can be a model that can be used for classification in traditional machine learning algorithms, for example, NB model, random forest model (RF), SVM classification model, KNN classification model, etc.; it can also be a deep learning text classification model, for example, FastText model, TextCNN, etc.
序列标注模型用于对文本信息或汉字序列中每个元素进行标记,输出标记序列,这些标记序列可以用来指示槽位的开始、结束和类型。序列标注模型可以是以下模型之一:线性模型、隐马尔可夫模型、最大熵马尔可夫模型、条件随机场等。The sequence labeling model is used to mark each element in the text information or Chinese character sequence, and output the mark sequence, which can be used to indicate the beginning, end and type of the slot. The sequence labeling model can be one of the following models: linear model, hidden Markov model, maximum entropy Markov model, conditional random field, etc.
NLU模块33也可以采用端到端模型同时输出意图信息和槽位信息。The NLU module 33 may also use an end-to-end model to simultaneously output intent information and slot information.
对话状态跟踪(dialog state tracking,DST)模块34用于对语音助手201的对话状态进行管理。DST模块34利用NLU模块33输出的当前轮对话的意图信息和槽位信息,维护多轮对话场景中的当前轮对话意图、已填充的槽位和对话状态。The dialog state tracking (dialog state tracking, DST) module 34 is used to manage the dialog state of the voice assistant 201. The DST module 34 uses the intent information and slot information of the current round of dialogue output by the NLU module 33 to maintain the current round of dialogue intention, filled slots and dialogue status in the multi-round dialogue scene.
DST模块34的输入是上一轮对话状态、上一轮第三方应用返回的播报内容和当前轮对话的意图信息和槽位信息,输出是当前轮的对话状态。The input of the DST module 34 is the last round of dialogue state, the broadcast content returned by the last round of third-party applications, and the intent information and slot information of the current round of dialogue, and the output is the current round of dialogue state.
DST模块34模块记录了语音助手201的对话历史和对话状态,辅助语音助手201结合上下文管理器(即图2中的数据库230)记录的对话历史理解当前轮对话用户语音的指令,并给出合适的反馈。The DST module 34 module has recorded the dialog history and the dialog status of the voice assistant 201, and the assistant voice assistant 201 understands the instruction of the current round dialog user's voice in combination with the dialog history recorded by the context manager (that is, the database 230 in FIG. 2 ), and gives an appropriate feedback of.
示例性的,在第一轮对话中,用户A向语音助手201请求“预定南京的机票”;在第二轮对话中,用户A向语音助手201询问“那里天气怎么样?”。NLU模块33输出当前轮对话的意图为“查天气”,槽位为“地点:那里”和“时间:”由于DST模块34记录了第一轮对话状态,系统结合上下文管理器记录的对话历史理解到槽位“地点:那里”中的“那里”为“南京”,则将“南京”填充至地点槽位中。DST模块34输出当前轮的对话状态信息,包括意图信息(查天气)、已填充的槽位(南京)和未填充的槽位(时间:)。Exemplarily, in the first round of dialogue, user A asks the voice assistant 201 to "book a flight ticket to Nanjing"; in the second round of dialogue, user A asks the voice assistant 201 "how is the weather there?". The NLU module 33 outputs the intention of the current round of dialogue as "check the weather", and the slot is "place: there" and "time:" because the DST module 34 has recorded the first round of dialogue state, the system combines the dialogue history understanding of the context manager record If "there" in the slot "Location: There" is "Nanjing", then "Nanjing" is filled into the location slot. The DST module 34 outputs the dialogue state information of the current round, including intent information (check the weather), filled slots (Nanjing) and unfilled slots (time:).
对话策略学习(dialog policy learning,DPL)模块35用于决定语音助手201下一步执行的动作,包括询问用户、执行用户的指令、推荐用户其他的指令、生成回复等。The dialog policy learning (dialog policy learning, DPL) module 35 is used to determine the next action performed by the voice assistant 201, including asking the user, executing the user's instruction, recommending other user instructions, and generating a reply.
DPL模块35利用DST模块34输出的对话状态信息确定下一步的执行动作。The DPL module 35 uses the dialog state information output by the DST module 34 to determine the next execution action.
在一个可以实现的实施例中,DPL模块35可以根据当前轮对话状态确定下一步执行动作信息是生成询问用户的播报内容。In a practicable embodiment, the DPL module 35 may determine according to the state of the current round of dialogue that the next step to perform action information is to generate a broadcast content asking the user.
例如,针对上一示例中,DST模块34输出当前轮的对话状态信息有未填充的槽位(时间:),DPL模块35可以确定下一步执行动作是询问用户“哪一天?”以维护对话系统的控制逻辑,保障对话能继续执行下去。该执行动作信息为动作标签或结构化信息,例如“REQUEST-SLOT:date”,表示接下来要向用户询问的时间。For example, in the last example, the DST module 34 outputs the dialogue status information of the current round and there is an unfilled slot (time: ), the DPL module 35 can determine that the next step of execution action is to ask the user "what day?" to maintain the dialogue system The control logic ensures that the dialogue can continue to be executed. The execution action information is an action tag or structured information, such as "REQUEST-SLOT: date", indicating the next time to be queried to the user.
在一个可以实现的实施例中,DPL模块35可以根据当前轮对话状态,确定下一步执行动作是选择合适的第三方应用(app)执行该语音指令,将意图和槽位信息发送至选择的第 三方应用;获取第三方应用返回的执行结果,该执行结果为与所述语音指令对应的播报内容。In an embodiment that can be implemented, the DPL module 35 can determine that the next step to execute is to select an appropriate third-party application (app) to execute the voice command according to the current round of dialogue status, and send the intention and slot information to the selected third-party application (app). The third-party application; obtaining the execution result returned by the third-party application, where the execution result is the broadcast content corresponding to the voice command.
第三方应用(app)是能够根据槽位信息执行或满足该语音指令的意图并返回播报内容的应用,例如能够查询天气的app、能够提供商品信息的app、能够导提供航或定位信息的app等。A third-party application (app) is an application that can execute or meet the intention of the voice command according to the slot information and return the broadcast content, such as an app that can query the weather, an app that can provide product information, and an app that can provide navigation or positioning information Wait.
DPL模块35根据当前轮对话状态确定的播报内容或第三方应用(app)或服务器根据意图和槽位信息执行语音指令后返回的播报内容,可以作为DST模块34下一轮对话状态的输入参数,也可以作为NLG模块36的输入参数。The broadcast content determined by the DPL module 35 according to the current round of dialogue state or the broadcast content returned by the third-party application (app) or server after executing the voice command according to the intention and slot information, can be used as the input parameter of the next round of dialogue state of the DST module 34, It can also be used as an input parameter of the NLG module 36 .
自然语言生成模块(natural language understanding,NLG)模块36是一种将结构化信息资料转换成自然语言表述的翻译器,当前被广泛应用在语音助手中。在生成语音助手播报语时由于不同设备、不同网页位置的布局和大小不同,需要引入时长限制参数来限制生成的文本的长度,以自适应地匹配不同用户、不同设备、不同场景下对播报内容和播报时长的要求。The natural language understanding (NLG) module 36 is a translator that converts structured information into natural language expressions, and is currently widely used in voice assistants. When generating voice assistant announcements, due to the different layouts and sizes of different devices and different webpage locations, it is necessary to introduce a time limit parameter to limit the length of the generated text, so as to adaptively match the content of the broadcast under different users, different devices, and different scenarios. and broadcast time requirements.
在本申请的实施例中,NLG模块36用于获取DST模块34维护的当前对话状态、DPL模块35确定的下一步执行动作和/或第三方应用(app)返回的播报内容,结合用户信息、设备信息和/或场景信息生成个性化长度的目标播报文本。In the embodiment of the present application, the NLG module 36 is used to obtain the current dialogue status maintained by the DST module 34, the next step to execute the action determined by the DPL module 35, and/or the broadcast content returned by the third-party application (app), combined with user information, Device information and/or scene information generate a target broadcast text of personalized length.
示例性地,在DST模块34维护的当前对话状态是意图信息(查天气)、已填充的槽位(南京)和未填充的槽位(时间:)的情况下,DPL模块35确定的下一步执行动作是询问用户,则NLG模块36生成的播报文本为“请问您需要查询哪一天?”Exemplarily, in the case that the current dialog state maintained by the DST module 34 is intent information (check the weather), filled slots (Nanjing) and unfilled slots (time:), the next step determined by the DPL module 35 The execution action is to ask the user, and the broadcast text generated by the NLG module 36 is "Which day do you need to inquire?"
示例性地,NLG模块36将当前对话状态和第三方应用返回的播报内容输入匹配当前意图、设备或场景的模板,输出该模板配置长度的目标播报文本。NLG模块36也可以采用基于模型的黑盒输出个性化长度的目标播报文本。Exemplarily, the NLG module 36 inputs the current dialogue state and the announcement content returned by the third-party application into a template matching the current intention, device or scene, and outputs the target announcement text of the length configured by the template. The NLG module 36 can also use the model-based black box to output the target broadcast text of personalized length.
用户画像(user profile,UP)模块213,用于通过查询图2所示的数据库230中的数据获取用户信息,用户信息中记录用户收听语音助手播报的历史收听时长等信息。User portrait (user profile, UP) module 213, is used for by querying the data in the database 230 shown in Figure 2 to obtain user information, record user to listen to information such as the historical listening duration of voice assistant broadcast in user information.
用户信息,也称为用户画像,通过收集用户的社会属性、消费习惯、偏好特征、使用系统的行为等各个维度的数据,对用户使用习惯进行刻画,并对这些特征进行分析、统计,挖掘潜在价值信息,从而抽象出用户信息的全貌,用于给用户推荐个性化的内容,或提供符合用户使用习惯的服务。User information, also known as user portrait, describes the user's usage habits by collecting data in various dimensions such as the user's social attributes, consumption habits, preference characteristics, and system usage behavior, and analyzes and counts these characteristics to tap potential Value information, so as to abstract the whole picture of user information, and use it to recommend personalized content to users, or provide services in line with user habits.
设备画像(device profile,DP)模块214,用于获取图2所示的客户设备240的设备信息,包括显示器分辨率、大小、类别、扬声器的音量、音色等。A device profile (device profile, DP) module 214 is used to obtain device information of the client device 240 shown in FIG.
场景感知(context awareness,CA)模块215,用于通过图2所示数据采集设备260获取当前的场景信息,场景信息包括房间类别、背景噪音大小、用户当前的运动状态等。Scene perception (context awareness, CA) module 215, is used for obtaining current scene information through data acquisition device 260 shown in Figure 2, and scene information includes room category, background noise level, user's current state of motion etc.
CA模块215、DP模块214、UP模块213相对于语音助手201也可以是外部模块,在此不做具体限定。The CA module 215 , the DP module 214 , and the UP module 213 may also be external modules relative to the voice assistant 201 , which are not specifically limited here.
在本申请的实施例中,语音助手通过自然语言理解NLU模块35理解用户语音指令并发送至对应的第三方应用(app)执行后,可以获取第三方应用返回的结构化的播报内容,使用NLG模块36将返回的结构化的播报内容转换为播报文本,供TTS模块生成播报语音后告知用户播报内容。In the embodiment of this application, the voice assistant understands the user's voice command through the natural language understanding NLU module 35 and sends it to the corresponding third-party application (app) for execution, and can obtain the structured broadcast content returned by the third-party application, using NLG The module 36 converts the returned structured broadcast content into a broadcast text for the TTS module to generate a broadcast voice to inform the user of the broadcast content.
语音合成(Text-to-Speech,TTS)模块37用于根据播报长度参数对所述目标播报文本的播报速度进行控制,所述播报长度参数指示历史收听时长信息。The speech synthesis (Text-to-Speech, TTS) module 37 is used to control the broadcast speed of the target broadcast text according to the broadcast length parameter, and the broadcast length parameter indicates historical listening duration information.
本申请的实施例中TTS模块37在将目标播报文本转换为播报语音时,通过引入播报长度参数,结合用户信息、设备信息和/或场景信息控制播报的语速,从而限定目标播报文本的播报时长,在保证了语音生成的准确性的同时,还控制生成语音的语速、音色、音量等特征。In the embodiment of the present application, when the TTS module 37 converts the target broadcast text into broadcast voice, by introducing the broadcast length parameter, combined with user information, device information and/or scene information to control the speech rate of the broadcast, thereby limiting the broadcast of the target broadcast text Duration, while ensuring the accuracy of speech generation, it also controls the speech rate, timbre, volume and other characteristics of the generated speech.
对话输出模块38,用于根据目标播报语音生成对应的播报卡片后,展示给用户。The dialog output module 38 is configured to generate a corresponding broadcast card according to the target broadcast voice, and then present it to the user.
实施例一Embodiment one
本申请实施例提出一种播报文本生成的方法,该方法应用于语音助手,通过接收用户的语音指令,获取语音指令对应的播报内容,根据播报长度参数和播报内容生成目标播报文本,其中,播报长度参数指示历史收听时长信息。The embodiment of the present application proposes a method for generating broadcast text. This method is applied to a voice assistant. By receiving the user's voice command, the broadcast content corresponding to the voice command is obtained, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, wherein the broadcast The length parameter indicates historical listening duration information.
图4为本申请实施例一提出的一种播报文本生成的方法的流程图。如图4所示,语音助手执行以下S401-S404的步骤。FIG. 4 is a flow chart of a method for generating broadcast text proposed in Embodiment 1 of the present application. As shown in Figure 4, the voice assistant performs the following steps S401-S404.
S401,接收用户的语音指令。S401. Receive a user's voice instruction.
语音助手201接收用户的语音指令。The voice assistant 201 receives a user's voice instruction.
示例性地,用户A唤醒语音助手201后,向语音助手201发出语音指令“南京今天气温如何?”。Exemplarily, after user A wakes up the voice assistant 201, he sends a voice instruction "what's the temperature in Nanjing today?" to the voice assistant 201 .
S402,获取语音指令对应的播报内容。S402. Obtain broadcast content corresponding to the voice command.
语音助手201对该语音指令“南京今天气温如何”进行前端处理,获得音频特征向量;通过声学模型和语言模型将音频特征向量识别为文本;对该文本进行理解,获取该语音指令对应的意图为“查天气”,槽位为“地点:南京”“时间:今天”;对话状态进行管理,根据上一轮对话状态、上一轮播报内容和当前语音指令对应的意图信息和槽位信息,获得当前的对话状态,包括意图信息、已填充的槽位和未填充的槽位,确定语音指令是否可执行。The voice assistant 201 performs front-end processing on the voice command "What's the temperature in Nanjing today" to obtain an audio feature vector; recognizes the audio feature vector as text through an acoustic model and a language model; understands the text, and obtains the corresponding intent of the voice command as "Check the weather", the slots are "Location: Nanjing" and "Time: Today"; the dialog status is managed, and according to the last round of dialog status, the content of the last round of broadcast, and the corresponding intent information and slot information of the current voice command, obtain The current dialog state, including intent information, filled slots, and unfilled slots, determines whether voice commands can be executed.
在一个可以实现的实施方式中,语音助手201可以根据当前的对话状态为可执行的情况下确定执行意图信息的第三方应用;将语音指令对应的意图信息和槽位信息发送至该第三方应用;获取由第三方应用(app)或服务器返回的执行结果,该执行结果为与当前语音指令对应的播报内容。In an implementation that can be implemented, the voice assistant 201 can determine the third-party application that executes the intent information according to the current dialog state that is executable; and send the intent information and slot information corresponding to the voice command to the third-party application ; Obtain the execution result returned by the third-party application (app) or the server, and the execution result is the broadcast content corresponding to the current voice command.
示例性地,用户向语音助手201发出语音指令“南京今天气温如何”,语音助手201结合该与用户请求相关的意图信息和槽位信息选择合适的第三方应用(app)执行该语音指令,输出第三方应用(app)返回的与该用户请求相关的执行结果,该执行结果为结构化的播报内容“{“temperature”:“15-23”,“unit”:“C”,“location”:“Nanjing”}”。Exemplarily, the user sends a voice command "What's the temperature in Nanjing today" to the voice assistant 201, and the voice assistant 201 selects an appropriate third-party application (app) to execute the voice command in combination with the intent information and slot information related to the user request, and outputs The execution result related to the user request returned by the third-party application (app), the execution result is a structured broadcast content "{"temperature": "15-23", "unit": "C", "location": "Nanjing"}".
在一个可以实现的实施方式中,在当前的对话状态中有未填充的槽位的情况下确定语音指令不可执行,语音助手201可以根据所述对话状态生成所述播报内容。In an implementable implementation manner, it is determined that the voice command is not executable when there is an unfilled slot in the current dialogue state, and the voice assistant 201 may generate the broadcast content according to the dialogue state.
示例性地,在当前的对话状态中有未填充的槽位的情况下,语音助手201获取由DPL模块35确定的下一步的动作信息,该动作信息为动作标签或结构化信息,确定播报内容为“REQUEST-SLOT:date”,表示接下来要向用户询问的时间。Exemplarily, in the case that there is an unfilled slot in the current dialog state, the voice assistant 201 acquires the next action information determined by the DPL module 35, the action information is an action tag or structured information, and determines the broadcast content It is "REQUEST-SLOT:date", indicating the time to ask the user next.
S403,根据播报长度参数和播报内容生成目标播报文本,其中播报长度参数指示历史收听时长信息。S403. Generate target broadcast text according to the broadcast length parameter and the broadcast content, where the broadcast length parameter indicates historical listening duration information.
在一个可以实现的实施方式中,NLG模块36可以为生成式模型,可以将播报内容作为该生成式模型的输入,将用户的播报长度参数作为额外参数,通过训练数据隐式限定输出播报文本的长度,生成目标播报文本,目标播报文本为时长与播报长度参数相匹配的播报文本。In an implementation that can be realized, the NLG module 36 can be a generative model, and the broadcast content can be used as the input of the generative model, the user's broadcast length parameter can be used as an additional parameter, and the output broadcast text is implicitly limited by the training data. Length, generate the target broadcast text, the target broadcast text is the broadcast text whose duration matches the broadcast length parameter.
在另一个可以实现的实施方式中,该生成式模型生成的文本长度或长度范围可以通过输入的播报长度参数限定,将播报内容和播报长度参数作为该模型的输入,输出限定长度的目标播报文本。In another possible implementation, the length or length range of the text generated by the generative model can be limited by the input broadcast length parameter, and the broadcast content and the broadcast length parameter are used as the input of the model, and the target broadcast text of the limited length is output .
在一个可以实现的实施方式中,NLG模块36可以为检索式模型,将播报内容作为该检索式模型的输入,根据播报内容检索对应的模板,通过检索到的模板生成的目标播报文本。In a possible implementation, the NLG module 36 can be a retrieval model, which takes the broadcast content as input to the retrieval model, retrieves the corresponding template according to the broadcast content, and generates the target broadcast text through the retrieved template.
在一个可以实现的实施方式中,将播报内容和用户的播报长度参数作为该检索式模型的输入,根据播报长度参数限定的长度在预定义的模板库中检索对应播报内容的模板,通过检索到的模板输出目标播报文本。In an implementation that can be implemented, the broadcast content and the user's broadcast length parameter are used as the input of the retrieval model, and the template corresponding to the broadcast content is retrieved in the predefined template library according to the length defined by the broadcast length parameter. The template output target announcement text.
在一个可以实现的实施方式中,可以根据至少一条历史收听时长信息的平均值或加权平均值,确定播报长度参数。示例性地,设备画像模块213获取用户信息,获得用户每一次的收听语音播报的历史收听时长信息,根据用户每一次的收听语音播报的历史收听时长信息统计平均值或加权平均值,得到播报长度参数;也可以将历史收听时长的最小/大值、最近值做为播报长度参数。In a possible implementation manner, the broadcast length parameter may be determined according to an average value or a weighted average value of at least one piece of historical listening duration information. Exemplarily, the device portrait module 213 obtains user information, obtains the historical listening duration information of each time the user listens to the voice broadcast, and obtains the broadcast length according to the statistical average or weighted average of the historical listening duration information of each user's listening to the voice broadcast parameter; the minimum/maximum value and the latest value of the historical listening time can also be used as the broadcast length parameter.
例如,设备画像模块213获取用户的历史收听时长t=5s,经过映射表转换后确定生成的播报文本的字符长度为20,则播报长度参数为20。NLG模块36根据上述返回的播报内容的播报内容“{“temperature”:“15-23”,“unit”:“C”,“location”:“Nanjing”}”和播报长度参数20生成字符长度为20字左右目标播报文本“南京今天晴,最低15摄氏度,最高23摄氏度”。For example, the device portrait module 213 obtains the user's historical listening duration of t=5s, and determines that the length of the generated broadcast text is 20 characters after conversion through the mapping table, so the broadcast length parameter is 20. NLG module 36 according to the broadcast content "{"temperature": "15-23", "unit": "C", "location": "Nanjing"}" and the broadcast length parameter 20 of the above-mentioned returned broadcast content generate a character length of The target broadcast text of about 20 characters is "Nanjing is sunny today, with a minimum of 15 degrees Celsius and a maximum of 23 degrees Celsius."
在智能设备的语音助手启用之前,历史收听时长的数值可以为初始值。该值可以是一个精确的数值记录,如”5秒”、“20字”等,也可以是一个映射到一定时间段的标识,如“中”、“简洁”等;初始值还可以为通过智能设备厂家通过用户调研获得的用户平均收听时长的值,或者用户所属群体的平均收听时长的值。本申请实施例对历史收听时长的初始值不做限定。Before the voice assistant of the smart device is enabled, the value of the historical listening time may be an initial value. The value can be a precise numerical record, such as "5 seconds", "20 words", etc., or an identifier mapped to a certain period of time, such as "medium", "concise", etc.; the initial value can also be passed The value of the average listening time of users obtained by smart device manufacturers through user research, or the value of the average listening time of the group to which the user belongs. The embodiment of the present application does not limit the initial value of the historical listening duration.
在用户每一次的收听语音播报时,语音助手201会持续记录收听播报的收听时长,并将每一次收听播报的时长信息收集在用户画像中,获得多条历史收听时长信息。When the user listens to the voice broadcast each time, the voice assistant 201 will continue to record the listening duration of the broadcast, and collect the duration information of each listening broadcast in the user portrait to obtain multiple pieces of historical listening duration information.
在一个可以实现的实施方式中,收听时长的记录可以从开始播报的时刻开始计时,当以下情况之一发生时结束计时:播报完毕、打断播报、关闭或切换至其他程序。收听时长是从开始计时到结束计时之间的时间间隔。In a possible implementation, the recording of the listening duration can be timed from the moment when the broadcast starts, and ends when one of the following situations occurs: the broadcast is completed, the broadcast is interrupted, or the program is closed or switched to another program. The listening duration is the time interval from the start of the timer to the end of the timer.
S404,根据播报长度参数对目标播报文本的播报速度进行控制。S404. Control the broadcast speed of the target broadcast text according to the broadcast length parameter.
在一个可以实现的实施方式中,TTS模块37获取播报文本后,以播报长度参数作为播报语音的语速的限定条件控制播报语音的语速,将该播报文本转化为符合当前用户历史收听习惯的播报语音。In an implementation that can be realized, after the TTS module 37 acquires the broadcast text, it controls the speech rate of the broadcast voice with the broadcast length parameter as the speech rate limit of the broadcast voice, and converts the broadcast text into a text that conforms to the current user's historical listening habits. Announce voice.
图5为本申请实施例一提出的播报文本生成的方法的应用示意图。如图5所示,用户A唤醒语音助手201,并询问“南京今天气温如何?”FIG. 5 is a schematic diagram of the application of the method for generating broadcast text proposed in Embodiment 1 of the present application. As shown in Figure 5, user A wakes up the voice assistant 201 and asks "how is the temperature in Nanjing today?"
前端处理模块31对用户A输入的语音指令“南京今天气温如何?”进行音频解码,解码成pcm格式的音频信号;利用声纹或其他特征对该音频信号进行分离、降噪、特征提取,并通过分帧、开窗、短时傅里叶变换等音频处理算法,获得音频特征向量。The front-end processing module 31 performs audio decoding on the voice command "What's the temperature in Nanjing today?" input by user A, and decodes it into an audio signal in pcm format; uses voiceprint or other features to separate, denoise, and feature extract the audio signal, and Audio feature vectors are obtained through audio processing algorithms such as framing, windowing, and short-time Fourier transform.
ASR模块32通过声学模型和语言模型将音频特征向量转换为文本。具体地,通过声学模型将音频特征向量中的声学特征转换为音素或字词,再通过语言模型将音素或字词转换为文字序列,输出用户A的语音指令对应的文本。The ASR module 32 converts audio feature vectors into text through an acoustic model and a language model. Specifically, the acoustic features in the audio feature vector are converted into phonemes or words through the acoustic model, and then the phonemes or words are converted into text sequences through the language model, and the text corresponding to the voice command of user A is output.
NLU模块33对该文本进行理解,获得该用户的意图为“查天气”,槽位为“地点:南京”。The NLU module 33 understands the text, and obtains that the user's intention is "to check the weather", and the slot is "location: Nanjing".
DST模块34利用NLU模块33输出的当前轮对话的“查天气”,槽位为“地点:南京”,输出当前轮的对话状态信息,包括意图信息(查天气)、已填充的槽位(南京)和(时间:今天)。The DST module 34 utilizes " check the weather " of the current round conversation of NLU module 33 output, and the slot position is " place: Nanjing ", outputs the dialogue state information of current round, comprises intention information (check weather), filled slot (Nanjing ) and (time: today).
DPL模块35利用DST模块34输出的对话状态信息确定下一步的执行动作是执行指令,DPL模块35使用槽位信息作为参数,根据意图信息选择合适的第三方服务或应用(app)来执行用户的语音指令;发送至“查天气”至对应的第三方应用(服务提供方W)。The DPL module 35 utilizes the dialogue state information that the DST module 34 outputs to determine that the execution action of the next step is to execute an instruction, and the DPL module 35 uses the slot information as a parameter to select a suitable third-party service or application (app) according to the intention information to execute the user's action. Voice command; send to "check the weather" to the corresponding third-party application (service provider W).
NLG模块36获取返回的播报内容为结构化信息“{“temperature”:“15-23”,“unit”:“C”,“location”:“Nanjing”}”。同时通过UP模块213获取用户A的历史收听时长t A=5s,经过映射表转换后确定生成的播报文本的字符长度为20,则播报长度参数为20。 The NLG module 36 acquires and returns the broadcast content as structured information “{“temperature”:“15-23”,“unit”:“C”,“location”:“Nanjing”}”. At the same time, the historical listening duration t A =5s of user A is acquired through the UP module 213 , and the character length of the generated broadcast text is determined to be 20 after conversion of the mapping table, so the broadcast length parameter is 20.
NLG模块36根据上述返回的播报内容和播报长度参数生成字符长度为20字左右目标播报文本“南京今天晴,最低15摄氏度,最高23摄氏度”。The NLG module 36 generates a target broadcast text with a character length of about 20 characters according to the broadcast content returned above and the broadcast length parameter "Nanjing is sunny today, the lowest is 15 degrees Celsius, and the highest is 23 degrees Celsius".
TTS模块37根据目标播报文本和收听时长t=5s,进行速度控制,生成长度在4.5s~5.5s的播报语音进行播报。The TTS module 37 performs speed control according to the target broadcast text and the listening duration t=5s, and generates a broadcast voice with a length of 4.5s˜5.5s for broadcast.
播报完成后,语音助手201将本次用户的收听时长发送给UP模块213,UP模块213记录用户A的本次收听播报的时长。After the broadcast is completed, the voice assistant 201 sends the user's listening time to the UP module 213, and the UP module 213 records the user A's listening time to the broadcast.
用户B唤醒语音助手201,用户B输入的语音指令同用户A,DPL模块35获得返回结果的过程同用户A,同时通过UP模块213获取用户B的历史收听时长t B=2s,经过转换后确定生成播报文本的字符长度为10字左右,则播报长度参数为10;NLG模块36生成的目标播报文本为“晴,15到23摄氏度”,TTS模块37生成播报语音时长为1.5~2.5s。 User B wakes up the voice assistant 201, the voice command input by user B is the same as that of user A, and the process of DPL module 35 obtaining the returned result is the same as that of user A. At the same time, the historical listening duration t B = 2s of user B is obtained by UP module 213, which is determined after conversion The character length of the generated broadcast text is about 10 characters, and the broadcast length parameter is 10; the target broadcast text generated by the NLG module 36 is "Sunny, 15 to 23 degrees Celsius", and the TTS module 37 generates a broadcast voice with a duration of 1.5-2.5s.
从图5所示的实施例可以看出,针对不同的用户A和用户B,根据两人历史收听时长的个性化差异,本申请实施例提出的播报文本生成的方法对于同样的语音指令可以生成长度不同长度的播报文本,从而使语音助手能够根据用户的使用习惯生成的个性化的播报文本,进而根据个性化的播报文本进行个性化的播报。It can be seen from the embodiment shown in Figure 5 that for different users A and B, according to the personalized differences in the historical listening time of the two, the method for generating the broadcast text proposed in the embodiment of the present application can generate the same voice command Broadcast texts of different lengths, so that the voice assistant can generate personalized broadcast texts according to the user's usage habits, and then perform personalized broadcasts according to the personalized broadcast texts.
本申请实施例提出的播报文本生成的方法在播报文本和播报语音的生成阶段引入了用户信息,根据用户信息中记录的用户的历史收听时长控制目标播报文本的详细程度,为用户与语音助手之间提供千人千面的性化交互体验。The method for generating the broadcast text proposed in the embodiment of the present application introduces user information in the generation stage of the broadcast text and broadcast voice, and controls the detail level of the target broadcast text according to the user's historical listening time recorded in the user information. Provide a personalized interactive experience for thousands of people.
实施例二Embodiment two
本申请实施例提出的一种播报文本生成的方法,在实施例一的基础上,通过引入的用户信息、设备信息和/或当前的场景信息的数据,将用户的语音指令与用户的历史收听时长、设备信息和/或当前场景信息相结合生成长度与该用户历史收听习惯相匹配的播报文本,并以对应的语速进行播报,提供个性化的播报体验。其中,用户信息包括用户的历史收听时 长;设备信息包括播报设备的显示器分辨率、大小、播报设备类别等配置信息;场景信息包括房间类别、背景噪音大小、用户当前的运动状态等信息。A method for generating broadcast text proposed in the embodiment of the present application. On the basis of Embodiment 1, the user's voice instruction and the user's history are listened to through the imported data of user information, device information and/or current scene information. The duration, device information, and/or current scene information are combined to generate a broadcast text whose length matches the user's historical listening habits, and broadcast at a corresponding speech rate to provide a personalized broadcast experience. Among them, user information includes the user's historical listening time; device information includes configuration information such as display resolution, size, and broadcast device type of the broadcast device; scene information includes information such as room type, background noise level, and the user's current exercise status.
语音助手通过通过DP模块214获得所用的播报设备的设备信息,通过CA模块213获得当前场景信息,UP模块213分别以设备信息和场景信息为索引在数据库213中检索,获得满足阈值要求的最细粒度的历史收听时长信息列表,如表2所示。The voice assistant obtains the device information of the used broadcasting device through the DP module 214, obtains the current scene information through the CA module 213, and the UP module 213 uses the device information and the scene information as indexes to search in the database 213 to obtain the most detailed information that meets the threshold requirements. The granular historical listening duration information list is shown in Table 2.
在语音助手201的对话系统中,用户的历史收听时长根据设备信息、当前场景分为三个层级分别进行计算。将根据当前可用的最细粒度层级计算的得到的收听时长作为播报长度参数,执行实施例一的步骤S403生成目标播报文本;执行实施例一的步骤S404进行速度控制,以符合当前用户收听历史习惯的语速播报该播报文本。In the dialogue system of the voice assistant 201, the historical listening duration of the user is divided into three levels and calculated according to the device information and the current scene. Use the listening duration calculated according to the currently available most fine-grained level as the broadcast length parameter, execute step S403 of the first embodiment to generate the target broadcast text; execute step S404 of the first embodiment to control the speed, so as to conform to the current user's listening history habits The broadcast text will be broadcast at a speaking speed of .
用户完成一次播报文本收听事件后,会基于三个层级索引结构更新相应层级的历史收听时长。三个层级历史收听时长信息列表如表2所示:After the user completes a broadcast text listening event, the historical listening duration of the corresponding level will be updated based on the three-level index structure. The list of historical listening duration information of the three levels is shown in Table 2:
表2Table 2
时间time 设备dequipment d 场景escene e 收听时长tlistening time t
20:00:0320:00:03 手机cell phone 车辆vehicle 1.7s1.7s
22:05:0322:05:03 电视television 卧室bedroom 8.1s8.1s
12:20:1012:20:10 电视television 客厅living room 5.2s5.2s
19:05:5419:05:54 电视television 客厅living room 7.1s7.1s
08:03:0308:03:03 手机cell phone 卧室bedroom 2.5s2.5s
08:30:4508:30:45 手机cell phone 客厅living room 3s3s
17:35:0417:35:04 手机cell phone 车辆vehicle 1.5s1.5s
18:30:0818:30:08 手机cell phone 车辆vehicle 1.9s1.9s
示例性地,按照个三层级收听时长计算播报长度参数,根据表2,主要有以下所示的这些可用播报长度参数:整体收听时长t_total、手机收听时长t_d 1、电视收听时长t_d 2、车辆收听时长t_e 1、客厅收听时长t_e 2和手机在车辆内的收听时长。根据表2的数据可得到: Exemplarily, the broadcast length parameters are calculated according to the three-level listening duration. According to Table 2, there are mainly the following available broadcast length parameters: overall listening duration t_total, mobile phone listening duration t_d 1 , TV listening duration t_d 2 , vehicle listening duration The duration t_e 1 , the listening duration t_e 2 in the living room and the listening duration of the mobile phone in the vehicle. According to the data in Table 2, it can be obtained:
t_total=average(all)=3.875s;t_total=average(all)=3.875s;
t_d 1=average(d 1)=2.12s; t_d 1 =average(d 1 )=2.12s;
t_d 2=average(d 2)=6.8s; t_d 2 =average(d 2 )=6.8s;
t_e 1=average(e 1)=1.7s; t_e 1 =average(e 1 )=1.7s;
t_e 2=average(e 2)=5.1s; t_e 2 =average(e 2 )=5.1s;
t_d 1e 1=average(d 1e 1)=1.7s; t_d 1 e 1 = average(d 1 e 1 ) = 1.7s;
上述式中,average()为均值函数,括号内的索引值d 1为手机,d 2为电视,e 1为车辆、e 2为客厅,d 1e 1为手机在车辆内。 In the above formula, average() is the mean function, the index value d 1 in brackets is the mobile phone, d 2 is the TV, e 1 is the vehicle, e 2 is the living room, and d 1 e 1 is the mobile phone in the vehicle.
在一个可以实现的实施方式中,可以根据设备信息或场景信息确定多条所述历史收听时长信息;根据多条历史收听时长信息的平均值或加权平均值,确定播报长度参数。将根据设备信息确定的播报长度参数记为第一播报长度参数;将根据场景信息确定的播报长度参数记为第二播报长度参数。In a possible implementation, multiple pieces of historical listening duration information may be determined according to device information or scene information; and the broadcast length parameter may be determined according to the average or weighted average of multiple pieces of historical listening duration information. The broadcast length parameter determined according to the device information is recorded as the first broadcast length parameter; the broadcast length parameter determined according to the scene information is recorded as the second broadcast length parameter.
示例性地,当在设备信息或场景信息下收集的历史收听时长记录的数量均小于阈值时,语音助手使用第一层级计算获得的播报长度参数。Exemplarily, when the number of historical listening duration records collected under device information or scene information is less than a threshold, the voice assistant uses the broadcast length parameter obtained through the first-level calculation.
第一层级的计算为计算整体收听时长t_total。整体收听时长t_total与实施例一中定义的用户的历史收听时长一致,为多条历史收听时长信息的平均值或加权平均值,将整体收听时长t_total作为播报长度参数。The calculation of the first level is to calculate the overall listening time t_total. The overall listening duration t_total is consistent with the user's historical listening duration defined in Embodiment 1, which is the average or weighted average of multiple pieces of historical listening duration information, and the overall listening duration t_total is used as the broadcast length parameter.
例如阈值设置为3个记录以上才有效的时候,在用户的收听记录少于3个时,语音助手可以根据用户每一次的收听语音播报的历史时长信息统计平均值或加权平均值,确定用户的整体收听时长t_total,将整体收听时长t_total作为播报长度参数。For example, when the threshold is set to be more than 3 records, it is valid, and when the user's listening records are less than 3, the voice assistant can determine the user's listening history based on the statistical average or weighted average The overall listening time t_total, the overall listening time t_total is used as the broadcast length parameter.
示例性地,当在设备信息或场景信息下收集的历史收听时长记录的数量大于阈值时,使用第二层级计算获得的历史收听时长信息,根据多条历史收听时长信息的平均值或加权平均值,确定播报长度参数。Exemplarily, when the number of historical listening duration records collected under device information or scene information is greater than a threshold, use the historical listening duration information obtained through the second-level calculation, and use the average or weighted average of multiple pieces of historical listening duration information , to determine the broadcast length parameter.
第二层级的计算为根据在设备信息下收听时长t_d统计对应设备上的多条历史收听时长信息,或根据场景信息下收听时长t_e,统计对应场景下的多条历史收听时长,。The calculation of the second level is to count the multiple pieces of historical listening duration information on the corresponding device according to the listening duration t_d under the device information, or count the multiple historical listening duration information under the corresponding scene according to the listening duration t_e under the scene information.
例如表2中设备信息对应的设备可以为手机或电视等智能终端;场景信息对应的场景可以为车辆、卧室或客厅等场所,以及运动、休息等运动状态。For example, the device corresponding to the device information in Table 2 may be a smart terminal such as a mobile phone or a TV; the scene corresponding to the scene information may be a place such as a vehicle, a bedroom or a living room, and a state of motion such as exercising or resting.
示例性地,当用户A通过登录手机终端的语音助手收听天气播报的历史收听时长记录为5条,超过系统设定的阈值3条时,语音助手可以根据手机终端记录的每一条历史收听时长信息统计平均值或加权平均值,获得用户A在手机终端下的报播报长度参数。For example, when user A logs in the voice assistant of the mobile terminal to listen to the historical listening time record of the weather broadcast is 5, which exceeds the threshold of 3 set by the system, the voice assistant can record each piece of historical listening time information according to the mobile terminal Statistical average value or weighted average value to obtain user A's report broadcast length parameter under the mobile terminal.
示例性地,当用户B通过登录手机终端的语音助手在客厅收听天气播报的记录为1条,通过登录智能电视的语音助手在客厅收听天气播报的记录为2条,用户B通过语音助手在客厅收听天气播报的记录达到对话系统设定的阈值3条时,用户B登录的语音助手可以根据每一条在客厅收听播报的历史时长记录统计平均值或加权平均值,获得用户通过不同的智能终端在同一场景下的播报长度参数。For example, when user B logs into the voice assistant of the mobile terminal to listen to the weather broadcast in the living room, there is 1 record, and the record of listening to the weather broadcast in the living room through the voice assistant of the smart TV is 2, and user B uses the voice assistant in the living room. When the record of listening to the weather report reaches the threshold set by the dialog system of 3, the voice assistant logged in by user B can record statistical average or weighted average according to the historical duration of each record of listening to the broadcast in the living room, and obtain the user through different smart terminals. The broadcast length parameter in the same scene.
在一个可以实现的实施方式中,可以根据设备信息和场景信息确定至少一条所述历史收听时长信息;根据多条历史收听时长信息的平均值或加权平均值,确定播报长度参数。In a possible implementation manner, at least one piece of historical listening duration information can be determined according to device information and scene information; and the broadcast length parameter can be determined according to the average or weighted average of multiple pieces of historical listening duration information.
示例性地,当该在设备信息和场景信息组合已收集的历史收听时长记录的数量大于阈值时,使用第三层级计算获得的历史收听时长信息。将根据设备信息和场景信息组合确定的播报长度参数记为第三播报长度参数。Exemplarily, when the number of historical listening duration records collected in combination of device information and scene information is greater than a threshold, the historical listening duration information obtained through the third-level calculation is used. The broadcast length parameter determined according to the combination of device information and scene information is recorded as the third broadcast length parameter.
第三层级的计算为根据设备场景收听时长t_de,统计当前设备d在当前场景e中的用户的历史收听时长。The calculation of the third level is to count the historical listening duration of the user of the current device d in the current scene e according to the listening duration t_de of the device scene.
示例性地,当用户C通过手机终端在车辆内收听天气播报的历史收听时长记录为3条,达到对话系统设定的阈值时,用户C登录的语音助手可以根据记录的每一条在车辆内收听播报的历史收听时长信息统计平均值或加权平均值,获得用户C通过手机终端在车辆内收听播报的播报长度参数。For example, when user C listens to the weather broadcast in the vehicle through the mobile phone terminal, the historical listening time record is 3, reaching the threshold set by the dialogue system, the voice assistant logged in by user C can listen to the weather report in the vehicle according to each recorded item The statistical average or weighted average of the historical listening duration information of the broadcast is used to obtain the broadcast length parameter for user C to listen to the broadcast in the vehicle through the mobile terminal.
同时,语音助手完成一次播报文本收听事件后,将本次用户的收听时长发送给UP模块213,UP模块213会在表2所示的三个层级历史收听时长信息列表上记录相应层级的历史收听时长及时刻。At the same time, after the voice assistant completes a broadcast text listening event, it sends the user's listening time to the UP module 213, and the UP module 213 will record the historical listening time of the corresponding level on the three-level historical listening time information list shown in Table 2. duration and time.
本申请实施例二提出的播报文本生成方法,针对不同设备上、不同场景下历史收听时长不同的用户,对于同样的语音指令语音助手201可以生成长度不同目标播报文本,为用户提供更加精细的个性化的交互体验。本申请实施例二将用户历史收听时长按照设备类型、所处场景进行了精细化统计,提供更适配用户使用场景的个性化播报语音交互体验。The broadcast text generation method proposed in Embodiment 2 of the present application is aimed at users with different historical listening durations on different devices and in different scenarios. For the same voice command, the voice assistant 201 can generate target broadcast texts with different lengths to provide users with more refined personalities. personalized interactive experience. Embodiment 2 of the present application conducts refined statistics on the user's historical listening time according to the type of device and the scene in which it is located, so as to provide a personalized broadcast voice interaction experience that is more suitable for the user's usage scene.
语音助手201的对话系统在播报文本生成流程中结合用户的历史收听时长信息、设备的相关参数和/或当前场景的信息,能够为用户提供播报长度和语速符合当前用户收听历史记录的、适配设备信息和场景信息的播报语音,从而提升语音交互的自然度,大大提升用户体验。The dialog system of the voice assistant 201 can provide the user with a broadcast length and speech rate in line with the current user's listening history record, suitable for the user in combination with the user's historical listening duration information, device related parameters and/or current scene information during the broadcast text generation process. It is equipped with broadcast voice of device information and scene information, thereby improving the naturalness of voice interaction and greatly improving user experience.
实施例三Embodiment Three
本申请实施例提出的一种播报文本生成的方法,在实施例一和二的基础上,可以通过一个机器学习模型来获取播报长度参数,该机器学习模型可以基于随机森林(random forest)来实现,利用用户收听播报的历史收听时长、屏幕大小、屏幕分辨率和/或所处环境的噪声大小以及所处房间类别训练播报长度参数,将播报长度参数和播报内容输入机器学习模型,输出播报长度参数,根据播报长度参数和播报内容生成目标播报文本,并以对应的语速进行播报,提供个性化的播报体验。A method for generating a broadcast text proposed in the embodiment of the present application, on the basis of Embodiments 1 and 2, can obtain the broadcast length parameter through a machine learning model, and the machine learning model can be realized based on a random forest (random forest) , use the historical listening time, screen size, screen resolution and/or noise level of the environment where the user is listening to the broadcast, and the room type to train the broadcast length parameter, input the broadcast length parameter and broadcast content into the machine learning model, and output the broadcast length Parameters, according to the broadcast length parameters and broadcast content to generate the target broadcast text, and broadcast at the corresponding speech rate, providing a personalized broadcast experience.
图6为本申请实施例三提出的播报文本生成的方法基于随机森林的机器学习模型的结构示意图。如图6所示,图中x为机器学习模型的输入特征,输出播报长度参数y。FIG. 6 is a schematic structural diagram of a random forest-based machine learning model of a broadcast text generation method proposed in Embodiment 3 of the present application. As shown in Figure 6, x in the figure is the input feature of the machine learning model, and the broadcast length parameter y is output.
示例性地,输入特征x包括用户信息、设备信息和/或场景信息等数据;其中,用户信息包括用户的历史收听时长;设备信息包括当前播报设备的屏幕大小、屏幕分辨率等;场景信息相关数据包括环境噪音大小、所处房间类别等。Exemplarily, the input feature x includes data such as user information, device information, and/or scene information; wherein, the user information includes the user's historical listening time; the device information includes the screen size and screen resolution of the current broadcasting device; the scene information is related to The data includes the level of ambient noise, the type of room it is in, and so on.
示例性地,播报长度参数y包括“简洁”、或“适中”等分类结果,或预测的播报文本的长度限制值L。Exemplarily, the broadcast length parameter y includes classification results such as "concise" or "moderate", or a predicted length limit value L of the broadcast text.
在一个可以实现的实施方式中,该机器学习模型为分类模型,输入用户信息、设备信息、和/或场景信息等特征数据,输出的播报长度参数y为目标播报文本的长度分类结果,记为第四播报长度参数;如简洁、适中、详细。该分类模型可以使用标准的随机森林分类学习器(random forest classifier)进行训练。In an implementation that can be implemented, the machine learning model is a classification model, input user information, device information, and/or scene information and other feature data, and the output broadcast length parameter y is the length classification result of the target broadcast text, which is recorded as The fourth broadcast length parameter; such as concise, moderate, detailed. The classification model can be trained using a standard random forest classifier.
在一个可以实现的实施方式中,该机器学习模型可以是回归模型,输入用户信息、设备信息和/或场景信息等特征数据,输出播报长度参数y为目标播报文本的长度限制值L,记为第五播报长度参数;该回归模型模型可以使用标准的随机森林回归学习器为(random forest regressor)进行训练。In an implementation that can be implemented, the machine learning model can be a regression model, which inputs feature data such as user information, device information, and/or scene information, and outputs the broadcast length parameter y, which is the length limit value L of the target broadcast text, denoted as The fifth broadcast length parameter; the regression model model can be trained using a standard random forest regression learner (random forest regressor).
上述机器学习模型的每个初始模型是离线训练获取的,后续会持续收集用户在特定屏幕大小、屏幕分辨率和/或所处环境的噪声大小以及所处房间类别条件下的历史收听时长进行在线学习,提供适配用户历史收听习惯的播报长度参数。Each initial model of the above-mentioned machine learning model is obtained through offline training, and then continuously collects the historical listening time of the user under the conditions of a specific screen size, screen resolution, and/or noise level of the environment and the room category in which the user is located. Learn to provide broadcast length parameters that adapt to the user's historical listening habits.
该机器学习模型的训练数据包括用户历史收听时长和/或设备信息,例如当前播报设备的屏幕大小和/或屏幕分辨率等,以及场景信息,例如环境噪音大小和/或所处房间类别等,每条训练数据的标签为预期生成的播报长度参数。每条训练数据可以通过前述实施例一和实施例二对应的步骤中获得,或结合用户反馈从网络环境中收集,在此不做限定。The training data of the machine learning model includes the user's historical listening time and/or device information, such as the screen size and/or screen resolution of the current broadcast device, and scene information, such as the level of ambient noise and/or the type of room in which it is located, etc. The label of each piece of training data is the expected broadcast length parameter. Each piece of training data can be obtained through the steps corresponding to Embodiment 1 and Embodiment 2 above, or collected from the network environment in combination with user feedback, which is not limited here.
NLG模块36使用上述机器学习模型输出的播报长度参数来控制目标播报文本的生成长度。The NLG module 36 uses the broadcast length parameter output by the above machine learning model to control the generated length of the target broadcast text.
TTS模块37使用机器学习该模型输出的播报长度参数来控制播报语音的语速,并以对应的语速进行播报。The TTS module 37 uses the broadcast length parameter output by the machine learning model to control the speech rate of the broadcast voice, and broadcast at the corresponding speech speed.
相比实施例二,本申请实施例提出的播报文本生成的方法引入了机器学习模型,根据用户的历史收听时长、设备信息和/或场景信息获得播报长度参数,根据播报长度参数限制 播报文本以及播报语音的长度,并通过在线学习机制,保持机器学习模型持续学习,更新匹配用户个性化的播报长度参数。应用本申请实施例三的播报文本生成的方法语音助手201播报生成的个性化体验越用越准。Compared with Embodiment 2, the method for generating the broadcast text proposed in the embodiment of the present application introduces a machine learning model, obtains the broadcast length parameter according to the user's historical listening time, device information and/or scene information, and limits the broadcast text according to the broadcast length parameter and Announce the length of the voice, and through the online learning mechanism, keep the machine learning model to learn continuously, and update the personalized broadcast length parameters to match the user. The personalized experience generated by voice assistant 201 using the broadcast text generation method in Embodiment 3 of the present application becomes more accurate as it is used.
本申请实施例三提出的播报文本生成的方法通过机器学习模型学习用户的历史收听时长到期望的播报文本长度和播报语音时长的映射关系,并通过在线学习的方式提供越用越准的个性化体验。而实施例一是规则映射的方式。The broadcast text generation method proposed in Embodiment 3 of the present application learns the mapping relationship between the user's historical listening duration to the expected broadcast text length and broadcast voice duration through a machine learning model, and provides more accurate personalization through online learning experience. The first embodiment is the way of rule mapping.
实施例四Embodiment Four
由于预训练语言模型的发展,当前很多NLP任务都可以通过该范式获取较大的指标提升,本申请实施例提出的一种播报文本生成的方法可以利用预训练语言模型,如BERT语言模型、GPT-2语言模型等,将播报长度参数融入可控NLG模块36/TTS模块37中去,端到端生成播报文本或语音。Due to the development of pre-trained language models, many current NLP tasks can obtain a large index improvement through this paradigm. A method for generating broadcast text proposed in the embodiment of this application can use pre-trained language models, such as BERT language model, GPT -2 language model, etc., integrate the broadcast length parameters into the controllable NLG module 36/TTS module 37, and generate broadcast text or voice end-to-end.
图7为本申请实施例四提出的播报文本生成的方法基于典型预训练语言模型结构方式示意图。图7所示,该模块针对不同类别用户信息、设备信息和/或场景信息分别使用线性编码器(linear)进行编码后,通过融合模块(fusion)获取播报长度参数的表征向量,记为第六播报长度参数;该第六播报长度参数与DST模块34输出的当前轮对话状态和DPL模块35输出的当前用户语音指令的播报内容一起输入GPT-2语言模型,生成长度与该用户收听历史记录相匹配的目标播报文本。FIG. 7 is a schematic diagram of a structure of a typical pre-trained language model based on a method for generating broadcast text proposed in Embodiment 4 of the present application. As shown in Figure 7, this module uses a linear encoder (linear) to encode different types of user information, device information, and/or scene information, and obtains the characterization vector of the broadcast length parameter through the fusion module (fusion), which is recorded as the sixth Broadcast length parameter; The sixth broadcast length parameter and the broadcast content of the current user's voice command output by the DST module 34 and the current round dialogue state output by the DPL module 35 are input into the GPT-2 language model together, and the generation length is corresponding to the user's listening history record. The matching target announces the text.
在一个可以实现的实施方式中,NLG模块36先利用无标注的文本数据对GPT-2语言模型进行预训练,获取语言特征信息。再使用包括播报内容、对话状态、对应的用户信息、设备信息和/或场景信息和得到了用户正反馈的播报结果的播报内容信息进行微调,学习各参数对应的编码器参数,并调整预训练的GPT-2模型输出层的参数,以生成长度与该用户收听历史记录相匹配的目标播报文本,适应该生成任务。In a possible implementation, the NLG module 36 uses unlabeled text data to pre-train the GPT-2 language model to obtain language feature information. Then use the broadcast content information including the broadcast content, dialogue state, corresponding user information, device information and/or scene information, and the broadcast results that have received positive feedback from the user to fine-tune, learn the encoder parameters corresponding to each parameter, and adjust the pre-training The parameters of the output layer of the GPT-2 model are used to generate the target broadcast text whose length matches the user's listening history to adapt to the generation task.
本申请实施例提出的播报文本生成的方法,在播报文本生成时,引入用户信息之外,还引入设备信息和/或场景信息,生成长度不同的播报文本。通过用户信息收集用户收听播报文本的历史时长,并结合所处环境和/或使用的设备对该播报文本收听时长进行差异化存储,在特定场景下生成播报文本时,使用该场景下的播报长度参数指导播报文本生成,可生成匹配用户习惯的、适配设备信息和/或使用场景的目标播报文本,提升交互体验和效率,提供更懂用户的个性化的语音助手201。The method for generating the broadcast text proposed in the embodiment of the present application introduces not only user information but also device information and/or scene information when generating the broadcast text, and generates broadcast texts with different lengths. Collect the historical duration of users listening to the broadcast text through user information, and store the listening time of the broadcast text in combination with the environment and/or the equipment used. When generating the broadcast text in a specific scenario, use the broadcast length in this scenario Parameter-guided broadcast text generation can generate target broadcast texts that match user habits, adapt to device information and/or usage scenarios, improve interactive experience and efficiency, and provide a personalized voice assistant 201 that better understands users.
除了根据用户请求生成播报文本或语音,语音助手201主动发出的欢迎语、系统开机或关机时生成播报文本或语音,以及在其他可能与用户个性化使用记录、设备信息和/或场景信息相匹配的情景下生成播报文本或语音,都可以采用本申请上述实施例的方法。In addition to generating broadcast text or voice according to the user's request, the voice assistant 201 actively sends out the welcome words, the broadcast text or voice generated when the system is turned on or off, and other things that may match the user's personalized usage records, device information and/or scene information The methods of the above-mentioned embodiments of the present application can be used to generate the broadcast text or voice under the scenario of the present application.
实施例五Embodiment five
本申请实施例提出一种播报文本的方法,可以根据用户请求生成播报语音,在播报语音的生成阶段引入用户信息,根据用户信息中记录的用户的历史收听时长控制目标播报语音的语速,为用户与语音助手之间提供千人千面的性化交互体验。The embodiment of this application proposes a method for broadcasting text, which can generate broadcast voice according to user request, introduce user information in the generation stage of broadcast voice, and control the speech rate of target broadcast voice according to the user's historical listening time recorded in user information, for Provide a personalized interactive experience with thousands of faces between users and voice assistants.
本申请实施例提出一种播报文本的方法,包括:接收用户的语音指令;生成语音指令对应的目标播报文本;根据播报长度参数对目标播报文本的播报速度进行控制,播报长度参数指示历史收听时长信息。The embodiment of the present application proposes a method for broadcasting text, including: receiving a user's voice command; generating a target broadcast text corresponding to the voice command; controlling the broadcast speed of the target broadcast text according to the broadcast length parameter, and the broadcast length parameter indicates the historical listening duration information.
语音助手可以根据多条历史收听时长信息的平均值或加权平均值,确定播报长度参数。具体可以参考实施例一中与确定播报长度参数相关的实施方式,此处不再赘述。The voice assistant can determine the broadcast length parameter based on the average or weighted average of multiple pieces of historical listening duration information. For details, reference may be made to the implementation manner related to determining the broadcast length parameter in Embodiment 1, which will not be repeated here.
在一些可以实现的实施方式中,播报长度参数与设备信息关联,可以根据设备信息确定第一播报长度参数,根据播报长度参数对目标播报文本的播报速度进行控制,包括:根据第一播报长度参数对目标播报文本的播报速度进行控制;第一播报长度参数指示与设备信息关联的第一历史收听时长信息。具体可以参考实施例二中的与第一播报长度参数相关的实施方式,此处不再赘述。In some implementations that can be implemented, the broadcast length parameter is associated with device information, the first broadcast length parameter can be determined according to the device information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including: according to the first broadcast length parameter The broadcast speed of the target broadcast text is controlled; the first broadcast length parameter indicates the first historical listening duration information associated with the device information. For details, reference may be made to the implementation manner related to the first broadcast length parameter in Embodiment 2, which will not be repeated here.
在一些可以实现的实施方式中,播报长度参数与场景信息关联,可以根据场景信息确定第二播报长度参数,根据播报长度参数对目标播报文本的播报速度进行控制,包括:根据第二播报长度参数对目标播报文本的播报速度进行控制;第二播报长度参数指示与场景信息关联的第二历史收听时长信息。具体可以参考实施例二中的与第二播报长度参数相关的实施方式,此处不再赘述。In some implementations that can be implemented, the broadcast length parameter is associated with the scene information, the second broadcast length parameter can be determined according to the scene information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including: according to the second broadcast length parameter The broadcast speed of the target broadcast text is controlled; the second broadcast length parameter indicates the second historical listening duration information associated with the scene information. For details, reference may be made to the implementation manner related to the second broadcast length parameter in Embodiment 2, which will not be repeated here.
在一些可以实现的实施方式中,播报长度参数与场设备信息和场景信息关联,可以根据设备信息和场景信息确定第三播报长度参数,根据播报长度参数对目标播报文本的播报速度进行控制,包括:根据第三播报长度参数对目标播报文本的播报速度进行控制;第三播报长度参数指示与设备信息和场景信息关联的第三历史收听时长信息,具体可以参考实施例二的与第三播报长度参数相关的实施方式,此处不再赘述。。In some implementations that can be implemented, the broadcast length parameter is associated with field device information and scene information, and the third broadcast length parameter can be determined according to the device information and scene information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including : According to the third broadcast length parameter, the broadcast speed of the target broadcast text is controlled; the third broadcast length parameter indicates the third historical listening duration information associated with the device information and scene information, and can refer to the third broadcast length in Embodiment 2 for details Implementation methods related to parameters will not be repeated here. .
在一些可以实现的实施方式中,根据播报长度参数对目标播报文本的播报速度进行控制,可以包括:将历史收听时长信息、设备信息和/或场景信息输入分类模型;输出第四播报长度参数;第四播报长度参数为不同的长度类别;根据第四播报长度参数对目标播报文本的播报速度进行控制。具体可以参考实施例三的通过分类模型获得第四播报长度参数相关的实施方式,此处不再赘述。In some implementations that can be implemented, controlling the broadcast speed of the target broadcast text according to the broadcast length parameter may include: inputting historical listening duration information, device information and/or scene information into the classification model; outputting the fourth broadcast length parameter; The fourth broadcast length parameter is a different length category; the broadcast speed of the target broadcast text is controlled according to the fourth broadcast length parameter. For details, reference may be made to the implementation manner related to obtaining the fourth broadcast length parameter by using a classification model in Embodiment 3, which will not be repeated here.
在一种可能的实现方式中,根据播报长度参数对目标播报文本的播报速度进行控制,可以包括:将历史收听时长信息、设备信息和/或场景信息输入回归模型;输出第五播报长度参数,第五播报长度参数为长度限制值;根据第五播报长度参数对目标播报文本的播报速度进行控制。具体可以参考实施例三的通过回归模型获得第五播报长度参数相关的实施方式,此处不再赘述。In a possible implementation manner, controlling the broadcast speed of the target broadcast text according to the broadcast length parameter may include: inputting historical listening duration information, device information and/or scene information into the regression model; outputting the fifth broadcast length parameter, The fifth broadcast length parameter is a length limit value; the broadcast speed of the target broadcast text is controlled according to the fifth broadcast length parameter. For details, reference may be made to the implementation manner related to obtaining the fifth broadcast length parameter through a regression model in Embodiment 3, which will not be repeated here.
可以理解的是,本申请实施例中各实施例并非孤立的实施例,本领域技术人员可以对各实施例进行关联或组合,其关联和组合的方案均在本申请实施例的保护范围中。It can be understood that the embodiments in the embodiments of the present application are not isolated embodiments, those skilled in the art can associate or combine the embodiments, and the association and combination solutions are within the protection scope of the embodiments of the present application.
本申请的实施例提供了一种电子设备,包括:至少一个存储器,用于存储程序;和至少一个处理器,用于执行存储器存储的程序,当存储器存储的程序被执行时,处理器用于执行上述任一实施例的方法。An embodiment of the present application provides an electronic device, including: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the processor is used for executing The method of any of the above embodiments.
本申请的实施例一种存储介质,存储介质中存储有指令,当指令在终端上运行时,使得第一终端执行上述任一实施例的方法。An embodiment of the present application is a storage medium, and an instruction is stored in the storage medium, and when the instruction is run on a terminal, the first terminal is made to execute the method in any one of the foregoing embodiments.
本申请实施例限定的播报文本收听时长在纯文本生成场景,也可以转换为用户查看播报文本的时间等等价指标。The broadcast text listening duration defined in the embodiment of the present application may also be converted into an equivalent index such as the time for the user to view the broadcast text in a plain text generation scenario.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请实施例的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the embodiments of the present application.
此外,本申请实施例的各个方面或特征可以实现成方法、装置或使用标准编程和/或工程技术的制品。本申请中使用的术语“制品”涵盖可从任何计算机可读器件、载体或介质访问的计算机程序。例如,计算机可读介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,压缩盘(compact disc,CD)、数字通用盘(digital versatile disc,DVD)等),智能卡和闪存器件(例如,可擦写可编程只读存储器(erasable programmable read-only memory,EPROM)、卡、棒或钥匙驱动器等)。另外,本文描述的各种存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读介质。术语“机器可读介质”可包括但不限于,无线信道和能够存储、包括和/或承载指令和/或数据的各种其它介质。Furthermore, various aspects or features of the embodiments of the present application may be implemented as methods, apparatuses, or articles of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used in this application covers a computer program accessible from any computer readable device, carrier or media. For example, computer-readable media may include, but are not limited to: magnetic storage devices (e.g., hard disks, floppy disks, or tapes, etc.), optical disks (e.g., compact discs (compact discs, CDs), digital versatile discs (digital versatile discs, DVDs), etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), card, stick or key drive, etc.). Additionally, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" may include, but is not limited to, wireless channels and various other media capable of storing, including and/or carrying instructions and/or data.
应当理解的是,在本申请实施例的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that, in various embodiments of the embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not The implementation process of the embodiment of the present application constitutes no limitation.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以播报长度参数软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者接入网设备等)执行本申请实施例各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of broadcast length parameter software products, and the computer software products are stored in a The storage medium includes several instructions to enable a computer device (which may be a personal computer, a server, or an access network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换, 都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. Should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims (21)

  1. 一种播报文本生成方法,其特征在于,所述方法包括:A method for generating broadcast text, characterized in that the method comprises:
    接收用户的语音指令;Receive the user's voice command;
    获取所述语音指令对应的播报内容;Obtain the broadcast content corresponding to the voice command;
    根据播报长度参数和所述播报内容生成目标播报文本,所述播报长度参数指示历史收听时长信息。The target broadcast text is generated according to the broadcast length parameter and the broadcast content, and the broadcast length parameter indicates historical listening duration information.
  2. 根据权利要求1所述的播报文本生成方法,其特征在于,所述根据播报长度参数和所述播报内容生成目标播报文本,包括:The broadcast text generating method according to claim 1, wherein said generating target broadcast text according to the broadcast length parameter and said broadcast content includes:
    将所述播报内容和所述播报长度参数作为模型的输入,所述模型输出目标播报文本,所述目标播报文本为时长与所述播报长度参数相匹配的播报文本。The broadcast content and the broadcast length parameter are used as input of a model, and the model outputs a target broadcast text, and the target broadcast text is a broadcast text whose duration matches the broadcast length parameter.
  3. 根据权利要求2所述的播报文本生成方法,其特征在于,所述模型为生成式模型或检索式模型;The broadcast text generation method according to claim 2, wherein the model is a generative model or a retrieval model;
    所述根据播报长度参数和所述播报内容生成目标播报文本,包括:The generating target broadcast text according to the broadcast length parameter and the broadcast content includes:
    将所述播报内容和所述播报长度参数作为生成式模型的输入,所述生成式模型生成目标播报文本并输出;或Using the broadcast content and the broadcast length parameter as the input of the generative model, and the generative model generates the target broadcast text and outputs it; or
    将所述播报内容和所述播报长度参数作为检索式模型的输入,所述检索式模型根据所述播报长度参数在预定义的模板库中检索限定长度的文本模板;通过检索到的所述限定长度的文本模板输出目标播报文本,所述目标播报文本为时长与所述历史收听时长信息相匹配的播报文本。The broadcast content and the broadcast length parameter are used as the input of the retrieval model, and the retrieval model retrieves a text template of a limited length in a predefined template library according to the broadcast length parameter; The text template of the length outputs the target broadcast text, and the target broadcast text is the broadcast text whose duration matches the historical listening duration information.
  4. 根据权利要求1-3之一所述的播报文本生成方法,其特征在于,所述播报长度参数与设备信息关联,根据所述设备信息确定第一播报长度参数,所述根据播报长度参数和所述播报内容生成目标播报文本,具体包括:The broadcast text generation method according to any one of claims 1-3, wherein the broadcast length parameter is associated with device information, the first broadcast length parameter is determined according to the device information, and the broadcast length parameter and the Generate the target broadcast text based on the above broadcast content, including:
    根据所述第一播报长度参数和所述播报内容生成第一目标播报文本;所述第一播报长度参数指示与所述设备信息关联的第一历史收听时长信息。Generate a first target broadcast text according to the first broadcast length parameter and the broadcast content; the first broadcast length parameter indicates first historical listening duration information associated with the device information.
  5. 根据权利要求1-3之一所述的播报文本生成方法,其特征在于,所述播报长度参数与场景信息关联,根据所述场景信息确定第二播报长度参数,所述根据播报长度参数和所述播报内容生成目标播报文本,具体包括:The broadcast text generating method according to any one of claims 1-3, wherein the broadcast length parameter is associated with scene information, and the second broadcast length parameter is determined according to the scene information, and the broadcast length parameter and the Generate the target broadcast text based on the above broadcast content, including:
    根据所述第二播报长度参数和所述播报内容生成第二目标播报文本;所述第二播报长度参数指示与所述场景信息关联的第二历史收听时长信息。Generate a second target broadcast text according to the second broadcast length parameter and the broadcast content; the second broadcast length parameter indicates second historical listening duration information associated with the scene information.
  6. 根据权利要求1-3之一所述的播报文本生成方法,其特征在于,所述播报长度参数与场设备信息和场景信息关联,根据所述设备信息和场景信息确定第三播报长度参数,所述根据播报长度参数和所述播报内容生成目标播报文本,具体包括:The broadcast text generation method according to any one of claims 1-3, wherein the broadcast length parameter is associated with field equipment information and scene information, and the third broadcast length parameter is determined according to the equipment information and scene information, so The target broadcast text is generated according to the broadcast length parameter and the broadcast content, specifically including:
    根据所述第三播报长度参数和所述播报内容生成第三目标播报文本;所述第三播报长度参数指示与所述设备信息和所述场景信息关联的第三历史收听时长信息。Generate a third target broadcast text according to the third broadcast length parameter and the broadcast content; the third broadcast length parameter indicates third historical listening duration information associated with the device information and the scene information.
  7. 根据权利要求1-3之一所述的播报文本生成方法,其特征在于,所述播报长度参数与场设备信息和/或场景信息关联,所述根据播报长度参数和所述播报内容生成目标播报文本,包括:The broadcast text generation method according to any one of claims 1-3, wherein the broadcast length parameter is associated with field equipment information and/or scene information, and the target broadcast is generated according to the broadcast length parameter and the broadcast content text, including:
    将所述历史收听时长信息、设备信息和/或场景信息输入分类模型;输出第四播报长度参数;所述第四播报长度参数为不同的长度类别;Input the historical listening duration information, device information and/or scene information into the classification model; output the fourth broadcast length parameter; the fourth broadcast length parameter is a different length category;
    根据所述第四播报长度参数和所述播报内容生成第四目标播报文本。Generate a fourth target broadcast text according to the fourth broadcast length parameter and the broadcast content.
  8. 根据权利要求1之一所述的播报文本生成方法,其特征在于,所述播报长度参数与场设备信息和/或场景信息关联,所述根据播报长度参数和所述播报内容生成目标播报文本,包括:The broadcast text generation method according to claim 1, wherein the broadcast length parameter is associated with field equipment information and/or scene information, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, include:
    将所述历史收听时长信息、设备信息和/或场景信息输入所述回归模型;输出第五播报长度参数,所述第五播报长度参数为长度限制值;Input the historical listening duration information, device information and/or scene information into the regression model; output the fifth broadcast length parameter, and the fifth broadcast length parameter is a length limit value;
    根据所述第五播报长度参数和所述播报内容生成第五目标播报文本。A fifth target broadcast text is generated according to the fifth broadcast length parameter and the broadcast content.
  9. 根据权利要求1所述的播报文本生成方法,其特征在于,所述播报长度参数与场设备信息和/或场景信息关联,所述根据播报长度参数和所述播报内容生成目标播报文本,包括:The broadcast text generation method according to claim 1, wherein the broadcast length parameter is associated with field equipment information and/or scene information, and generating the target broadcast text according to the broadcast length parameter and the broadcast content includes:
    将设备信息、场景信息和/或所述历史收听时长信息分别进行线性编码后进行融合,获得第六播报长度参数;所述第六播报长度参数为播报长度参数的表征向量;The device information, the scene information and/or the historical listening duration information are respectively linearly encoded and then fused to obtain a sixth broadcast length parameter; the sixth broadcast length parameter is a characterization vector of the broadcast length parameter;
    将所述第六播报长度参数、所述播报内容和所述语音指令是可执行/不可执行作为预训练语言模型的输入,输出第六目标播报文本。Taking the sixth broadcast length parameter, the broadcast content and whether the voice instruction is executable or non-executable as the input of the pre-trained language model, and output the sixth target broadcast text.
  10. 根据权利要求1-9之一所述的播报文本生成方法,其特征在于,所述获取所述语音指令对应的播报内容,包括:The broadcast text generating method according to any one of claims 1-9, wherein said acquiring the broadcast content corresponding to the voice instruction comprises:
    根据所述语音指令获取意图和槽位信息;Acquiring intent and slot information according to the voice command;
    根据所述意图和槽位信息确定所述语音指令是否可执行;determining whether the voice command is executable according to the intent and slot information;
    在所述所述语音指令不可执行的情况下,生成播报内容,所述播报内容为询问信息。In a case where the voice instruction is not executable, broadcast content is generated, and the broadcast content is inquiry information.
  11. 根据权利要求1-9之一所述的播报文本生成方法,其特征在于,所述根据所述对话状态确定所述播报内容,包括:The method for generating broadcast text according to any one of claims 1-9, wherein the determining the broadcast content according to the dialog state includes:
    根据所述语音指令获取意图和槽位信息;Acquiring intent and slot information according to the voice command;
    根据所述意图和槽位信息确定所述语音指令是否可执行;determining whether the voice command is executable according to the intent and slot information;
    在所述所述语音指令可执行的情况下,确定执行所述意图的第三方服务;If the voice command is executable, determine a third-party service that executes the intent;
    从所述第三方服务获取所述播报内容,所述播报内容为与所述语音指令对应的结果信息。The broadcast content is obtained from the third-party service, and the broadcast content is result information corresponding to the voice instruction.
  12. 根据权利要求1-11之一所述的播报文本生成方法,其特征在于,所述方法还包括:According to the broadcast text generation method described in one of claims 1-11, it is characterized in that, described method also comprises:
    根据所述播报长度参数对所述目标播报文本的播报速度进行控制。The broadcast speed of the target broadcast text is controlled according to the broadcast length parameter.
  13. 根据权利要求1-12之一所述的播报文本生成方法,其特征在于,所述方法还包括:The broadcast text generating method according to any one of claims 1-12, wherein the method further comprises:
    记录当前所述目标播报文本的播报时长,获得所述历史收听时长信息。Record the broadcast duration of the current target broadcast text, and obtain the historical listening duration information.
  14. 一种播报文本的方法,其特征在于,应用于语音助手,所述方法包括:A method for broadcasting text, characterized in that it is applied to voice assistants, the method comprising:
    接收用户的语音指令;Receive the user's voice command;
    生成所述语音指令对应的目标播报文本;Generate the target broadcast text corresponding to the voice instruction;
    根据播报长度参数对所述目标播报文本的播报速度进行控制,所述播报长度参数指示历史收听时长信息。The broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, and the broadcast length parameter indicates historical listening duration information.
  15. 根据权利要求14所述的播报文本方法,其特征在于,所述播报长度参数与设备信息关联,根据所述设备信息确定第一播报长度参数,所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:The method for broadcasting text according to claim 14, wherein the broadcast length parameter is associated with device information, the first broadcast length parameter is determined according to the device information, and the target broadcast text is determined according to the broadcast length parameter. Broadcast speed control, including:
    根据所述第一播报长度参数对所述目标播报文本的播报速度进行控制;所述第一播报长度参数指示与所述设备信息关联的第一历史收听时长信息。The broadcast speed of the target broadcast text is controlled according to the first broadcast length parameter; the first broadcast length parameter indicates first historical listening duration information associated with the device information.
  16. 根据权利要求14所述的播报文本方法,其特征在于,所述播报长度参数与场景信息关联,根据所述场景信息确定第二播报长度参数,所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:根据所述第二播报长度参数对所述目标播报文本的播报速度进行控制;所述第二播报长度参数指示与所述场景信息关联的第二历史收听时长信息。The method for broadcasting text according to claim 14, wherein the broadcast length parameter is associated with scene information, the second broadcast length parameter is determined according to the scene information, and the target broadcast text is determined according to the broadcast length parameter. The broadcast speed control includes: controlling the broadcast speed of the target broadcast text according to the second broadcast length parameter; the second broadcast length parameter indicates the second historical listening duration information associated with the scene information.
  17. 根据权利要求14所述的播报文本方法,其特征在于,所述所述播报长度参数与场设备信息和场景信息关联,根据所述设备信息和场景信息确定第三播报长度参数,所述所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:The method for broadcasting text according to claim 14, wherein the broadcast length parameter is associated with field equipment information and scene information, and a third broadcast length parameter is determined according to the equipment information and scene information, and the said Control the broadcast speed of the target broadcast text according to the broadcast length parameter, including:
    根据所述第三播报长度参数对所述目标播报文本的播报速度进行控制;所述第三播报长度参数指示与所述所述设备信息和场景信息关联的第三历史收听时长信息。The broadcast speed of the target broadcast text is controlled according to the third broadcast length parameter; the third broadcast length parameter indicates third historical listening duration information associated with the device information and scene information.
  18. 根据权利要求14所述的播报文本方法,其特征在于,所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:将所述历史收听时长信息、设备信息和/或场景信息输入分类模型;输出第四播报长度参数;所述第四播报长度参数为不同的长度类别;The method for broadcasting text according to claim 14, wherein the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter comprises: storing the historical listening duration information, device information and/or scene information Input the classification model; output the fourth broadcast length parameter; the fourth broadcast length parameter is a different length category;
    根据所述第四播报长度参数对所述目标播报文本的播报速度进行控制。The broadcast speed of the target broadcast text is controlled according to the fourth broadcast length parameter.
  19. 根据权利要求14所述的播报文本方法,其特征在于,所述根据播报长度参数对所述目标播报文本的播报速度进行控制,包括:The method for broadcasting text according to claim 14, wherein said controlling the broadcasting speed of said target broadcast text according to the broadcast length parameter comprises:
    将所述历史收听时长信息、设备信息和/或场景信息输入所述回归模型;输出第五播报长度参数,所述第五播报长度参数为长度限制值;Input the historical listening duration information, device information and/or scene information into the regression model; output the fifth broadcast length parameter, and the fifth broadcast length parameter is a length limit value;
    根据所述第五播报长度参数对所述目标播报文本的播报速度进行控制。The broadcast speed of the target broadcast text is controlled according to the fifth broadcast length parameter.
  20. 一种电子设备,其特征在于,包括:An electronic device, characterized in that it comprises:
    至少一个存储器,用于存储程序;和at least one memory for storing programs; and
    至少一个处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如权利要求1-19任一所述的方法。At least one processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method according to any one of claims 1-19.
  21. 一种存储介质,所述存储介质中存储有指令,当所述指令在终端上运行时,使得第一终端执行如权利要求1-19任一所述的方法。A storage medium, wherein instructions are stored in the storage medium, and when the instructions are run on the terminal, the first terminal is made to execute the method according to any one of claims 1-19.
PCT/CN2022/095805 2021-06-30 2022-05-28 Broadcasting text generation method and apparatus, and electronic device WO2023273749A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202280029750.4A CN117203703A (en) 2021-06-30 2022-05-28 Method and device for generating broadcast text and electronic equipment

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110741280.1 2021-06-30
CN202110741280 2021-06-30
CNPCT/CN2022/084068 2022-03-30
CN2022084068 2022-03-30

Publications (1)

Publication Number Publication Date
WO2023273749A1 true WO2023273749A1 (en) 2023-01-05

Family

ID=84692502

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/095805 WO2023273749A1 (en) 2021-06-30 2022-05-28 Broadcasting text generation method and apparatus, and electronic device

Country Status (2)

Country Link
CN (1) CN117203703A (en)
WO (1) WO2023273749A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116156275A (en) * 2023-04-19 2023-05-23 江西省气象服务中心(江西省专业气象台、江西省气象宣传与科普中心) Meteorological information broadcasting method and system
CN117789680A (en) * 2024-02-23 2024-03-29 青岛海尔科技有限公司 Method, device and storage medium for generating multimedia resources based on large model

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766428A (en) * 2018-06-01 2018-11-06 安徽江淮汽车集团股份有限公司 A kind of voice broadcast control method and system
US20180332357A1 (en) * 2015-11-30 2018-11-15 Sony Corporation Information processing apparatus, information processing method, and program
CN108846054A (en) * 2018-05-31 2018-11-20 出门问问信息科技有限公司 A kind of audio data continuous playing method and device
US20180350366A1 (en) * 2017-05-30 2018-12-06 Hyundai Motor Company Situation-based conversation initiating apparatus, system, vehicle and method
CN110136705A (en) * 2019-04-10 2019-08-16 华为技术有限公司 A kind of method and electronic equipment of human-computer interaction
CN111081244A (en) * 2019-12-23 2020-04-28 广州小鹏汽车科技有限公司 Voice interaction method and device
CN112071313A (en) * 2020-07-22 2020-12-11 特斯联科技集团有限公司 Voice broadcasting method and device, electronic equipment and medium
CN112700775A (en) * 2020-12-29 2021-04-23 维沃移动通信有限公司 Method and device for updating voice receiving period and electronic equipment
CN112820289A (en) * 2020-12-31 2021-05-18 广东美的厨房电器制造有限公司 Voice playing method, voice playing system, electric appliance and readable storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180332357A1 (en) * 2015-11-30 2018-11-15 Sony Corporation Information processing apparatus, information processing method, and program
US20180350366A1 (en) * 2017-05-30 2018-12-06 Hyundai Motor Company Situation-based conversation initiating apparatus, system, vehicle and method
CN108846054A (en) * 2018-05-31 2018-11-20 出门问问信息科技有限公司 A kind of audio data continuous playing method and device
CN108766428A (en) * 2018-06-01 2018-11-06 安徽江淮汽车集团股份有限公司 A kind of voice broadcast control method and system
CN110136705A (en) * 2019-04-10 2019-08-16 华为技术有限公司 A kind of method and electronic equipment of human-computer interaction
CN111081244A (en) * 2019-12-23 2020-04-28 广州小鹏汽车科技有限公司 Voice interaction method and device
CN112071313A (en) * 2020-07-22 2020-12-11 特斯联科技集团有限公司 Voice broadcasting method and device, electronic equipment and medium
CN112700775A (en) * 2020-12-29 2021-04-23 维沃移动通信有限公司 Method and device for updating voice receiving period and electronic equipment
CN112820289A (en) * 2020-12-31 2021-05-18 广东美的厨房电器制造有限公司 Voice playing method, voice playing system, electric appliance and readable storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116156275A (en) * 2023-04-19 2023-05-23 江西省气象服务中心(江西省专业气象台、江西省气象宣传与科普中心) Meteorological information broadcasting method and system
CN117789680A (en) * 2024-02-23 2024-03-29 青岛海尔科技有限公司 Method, device and storage medium for generating multimedia resources based on large model
CN117789680B (en) * 2024-02-23 2024-05-24 青岛海尔科技有限公司 Method, device and storage medium for generating multimedia resources based on large model

Also Published As

Publication number Publication date
CN117203703A (en) 2023-12-08

Similar Documents

Publication Publication Date Title
US11676575B2 (en) On-device learning in a hybrid speech processing system
US20210142794A1 (en) Speech processing dialog management
Latif et al. A survey on deep reinforcement learning for audio-based applications
WO2023273749A1 (en) Broadcasting text generation method and apparatus, and electronic device
US11189277B2 (en) Dynamic gazetteers for personalized entity recognition
US20240153525A1 (en) Alternate response generation
US11250857B1 (en) Polling with a natural language interface
US11580982B1 (en) Receiving voice samples from listeners of media programs
US11276403B2 (en) Natural language speech processing application selection
US11687526B1 (en) Identifying user content
US11070644B1 (en) Resource grouped architecture for profile switching
US11348601B1 (en) Natural language understanding using voice characteristics
US11605376B1 (en) Processing orchestration for systems including machine-learned components
CN114051639A (en) Emotion detection using speaker baseline
US10600419B1 (en) System command processing
US11132994B1 (en) Multi-domain dialog state tracking
US11257482B2 (en) Electronic device and control method
US20230377574A1 (en) Word selection for natural language interface
US11893310B2 (en) System command processing
US11335346B1 (en) Natural language understanding processing
US11996081B2 (en) Visual responses to user inputs
US11288513B1 (en) Predictive image analysis
CN113393841A (en) Training method, device and equipment of speech recognition model and storage medium
US20210337274A1 (en) Artificial intelligence apparatus and method for providing visual information
US11430435B1 (en) Prompts for user feedback

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22831571

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280029750.4

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE