WO2023273749A1

WO2023273749A1 - Broadcasting text generation method and apparatus, and electronic device

Info

Publication number: WO2023273749A1
Application number: PCT/CN2022/095805
Authority: WO
Inventors: 陈开济
Original assignee: 华为技术有限公司
Priority date: 2021-06-30
Filing date: 2022-05-28
Publication date: 2023-01-05
Also published as: CN117203703A

Abstract

The present application relates to the field of artificial intelligence (AI), and provides a broadcasting text generation method, which is applied to a speech assistant. The method comprises: receiving a speech instruction of a user; acquiring broadcasting content corresponding to the speech instruction; and generating target broadcasting text according to a broadcasting length parameter and the broadcasting content, wherein the broadcasting length parameter indicates historical listening duration information. In the present application, differential processing is performed on broadcasting text by means of a historical duration of listening to the broadcasting text by a user and in combination with a scenario where the user is located in and a device used; and in a specific scenario, the generation of broadcasting text is guided by means of a historical listening duration, and the speed of broadcasting target broadcast text is further controlled according to a broadcasting length parameter, so as to obtain target broadcasting speech that matches a historical usage habit of the user and is adapted to device information and a usage scenario, thereby improving the interaction experience and efficiency, and providing a personalized speech assistant that understands the user more.

Description

A method, device and electronic equipment for generating broadcast text

This application requires submission of a Chinese patent application to the China Patent Office on June 30, 2021, with application number 202110741280.1, and the application name is "A Method, Device, and Electronic Equipment for Generating Broadcast Text", and it is required to be filed on March 30, 2022. The priority of the international application with application number PCT/CN2022/084068 filed on . The entire contents of the above two applications are incorporated by reference in this application.

technical field

The embodiments of the present application relate to the field of artificial intelligence (AI), and in particular to a method, device and electronic device for generating broadcast text.

Background technique

Voice assistant or virtual assistant is a kind of agent software that can perform tasks or services instead of individuals, and is widely used in devices such as smartphones, smart speakers, and smart vehicle terminals (electronic control unit, ECU). A voice assistant or virtual assistant provides a voice user interface (voice user interface, VUI), and completes corresponding tasks or provides related services according to the user's voice command input. After the voice assistant executes the voice command issued by the user, it will generate the broadcast text and generate the corresponding broadcast voice through the text-to-speech (TTS) module, inform the user of the broadcast content and guide the user to continue using the device.

The broadcast text of the current voice assistant generally adopts a fixed method, and when interacting with different users, there is no difference in the broadcast voice/broadcast text. How to provide users with broadcasts that conform to their personal usage habits and improve the naturalness of user interaction is an urgent problem to be solved.

Contents of the invention

In order to solve the above problems, the embodiments of the present application provide a method, device, terminal device and system for generating broadcast text.

In the first aspect, an embodiment of the present application provides a method for generating broadcast text, the method comprising: receiving a user's voice command; acquiring the broadcast content corresponding to the voice command; generating a target according to the broadcast length parameter and the broadcast content The broadcast text, the broadcast length parameter indicates historical listening duration information. In this way, it can provide users with voice assistant broadcasts that conform to their personal historical usage habits, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.

In a possible implementation manner, the generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: using the broadcast content and the broadcast length parameter as input to a model, and the model outputs the target broadcast text , the target broadcast text is a broadcast text whose duration matches the broadcast length parameter. In this way, according to the broadcast length parameter, users can be provided with voice assistant broadcast texts that conform to personal historical usage habits through the model, providing a personalized broadcast experience for thousands of people, and improving the naturalness of voice assistant interaction.

In a possible implementation manner, the model is a generative model or a retrieval model; generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: combining the broadcast content and the broadcast length parameter As an input of the generative model, the generative model outputs a target broadcast text, and the target broadcast text is a broadcast text whose duration matches the broadcast length parameter. or

The broadcast content and the broadcast length parameter are used as the input of the retrieval model, and the retrieval model retrieves a text template of a limited length in a predefined template library according to the broadcast length parameter; The text template of the length outputs the target broadcast text, and the target broadcast text is the broadcast text whose duration matches the historical listening duration information. In this way, it is possible to provide users with voice assistant broadcast texts that conform to personal historical usage habits through generative models or retrieval models according to the broadcast length parameters, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.

In a possible implementation manner, the broadcast length parameter is associated with device information, the first broadcast length parameter is determined according to the device information, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, specifically including: Generate a first target broadcast text according to the first broadcast length parameter and the broadcast content; the first broadcast length parameter indicates first historical listening duration information associated with the device information. In this way, it can provide users with voice assistant broadcasts that conform to personal historical usage habits and adapt to devices, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.

In a possible implementation manner, the broadcast length parameter is associated with scene information, the second broadcast length parameter is determined according to the scene information, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, specifically including: Generate a second target broadcast text according to the second broadcast length parameter and the broadcast content; the second broadcast length parameter indicates second historical listening duration information associated with the scene information. In this way, it can provide users with voice assistant broadcasts that conform to personal historical usage habits and current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.

In a possible implementation manner, the broadcast length parameter is associated with field device information and scene information, the third broadcast length parameter is determined according to the device information and scene information, and the third broadcast length parameter is generated according to the broadcast length parameter and the broadcast content. The target broadcast text specifically includes: generating a third target broadcast text according to the third broadcast length parameter and the broadcast content; the third broadcast length parameter indicates a third history associated with the device information and the scene information Listen to duration information. In this way, it can provide users with voice assistant broadcasts that conform to personal historical usage habits, adapt to devices and current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.

In a possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: storing the historical listening duration information , device information and/or scene information input classification model; output a fourth broadcast length parameter; the fourth broadcast length parameter is a different length category; generate a fourth target broadcast according to the fourth broadcast length parameter and the broadcast content text. In this way, the broadcast length parameters obtained through the classification model conform to the personal historical usage habits, adapt to the device and/or the current scene of the voice assistant broadcast, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interaction.

In a possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: storing the historical listening duration information , device information and/or scene information are input into the regression model; a fifth broadcast length parameter is output, and the fifth broadcast length parameter is a length limit value; a fifth target is generated according to the fifth broadcast length parameter and the broadcast content Announce text. In this way, the regression model can be used to generate a voice assistant broadcast that conforms to personal historical usage habits, adapts to the device and/or the current scene, provides a personalized broadcast experience for thousands of people, and improves the naturalness of voice assistant interaction.

In a possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: combining the device information, scene information and /or the historical listening duration information is respectively linearly coded and fused to obtain the sixth broadcast length parameter; the sixth broadcast length parameter is a characterization vector of the broadcast length parameter; the sixth broadcast length parameter, the broadcast Whether the content and the voice instruction is executable or non-executable is used as the input of the pre-trained language model, and the sixth target broadcast text is output. In this way, the pre-trained language model can be used to generate voice assistant broadcasts that conform to personal historical usage habits, adapt to devices and/or current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.

In a possible implementation manner, the acquiring the broadcast content corresponding to the voice command includes: acquiring intent and slot information according to the voice command; determining whether the voice command can be used according to the intent and slot information Executing: in the case that the voice instruction is not executable, generate broadcast content, where the broadcast content is inquiry information. In this way, in the case that the voice command cannot be executed, it is possible to obtain the broadcast content that the voice assistant asks the user.

In a possible implementation manner, the determining the broadcast content according to the dialog state includes: acquiring intent and slot information according to the voice instruction; determining whether the voice instruction is Executable; in the case that the voice instruction is executable, determine the third-party service that executes the intention; obtain the broadcast content from the third-party service, and the broadcast content is corresponding to the voice instruction result information. In this way, when the voice command is executable, the broadcast content returned after the third-party service executes the voice command is obtained.

In a possible implementation manner, the method further includes: controlling the broadcast speed of the target broadcast text according to the broadcast length parameter. In this way, it is possible to generate voices that conform to personal historical usage habits, adapt to devices and/or current scenarios, provide a personalized broadcast experience for thousands of people, and improve the naturalness of voice assistant interactions.

In a possible implementation manner, the method further includes: recording the current broadcast duration of the target broadcast text, and obtaining the historical listening duration information. In this way, a personalized broadcast experience that conforms to personal historical usage habits can be obtained, and the naturalness of voice assistant interaction can be improved.

In a second aspect, the embodiment of the present application provides a method for broadcasting text, the method comprising: receiving a user's voice command; generating a target broadcast text corresponding to the voice command; and broadcasting the target text according to the broadcast length parameter The broadcast speed is controlled, and the broadcast length parameter indicates historical listening duration information. The beneficial effect of controlling the broadcast speed of the target broadcast text according to the broadcast length parameter is the same as that of the embodiments of the first aspect of the present application in which the target broadcast text is generated by the broadcast length parameter, and will not be repeated hereafter.

In a possible implementation manner, the broadcast length parameter is associated with device information, the first broadcast length parameter is determined according to the device information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including : Control the broadcast speed of the target broadcast text according to the first broadcast length parameter; the first broadcast length parameter indicates first historical listening duration information associated with the device information.

In a possible implementation manner, the broadcast length parameter is associated with scene information, the second broadcast length parameter is determined according to the scene information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including : Control the broadcast speed of the target broadcast text according to the second broadcast length parameter; the second broadcast length parameter indicates the second historical listening duration information associated with the device information.

In a possible implementation manner, the broadcast length parameter is associated with field device information and scene information, and a third broadcast length parameter is determined according to the device information and scene information, and the broadcast length parameter is used to determine the third broadcast length parameter. Controlling the broadcast speed of the target broadcast text includes: controlling the broadcast speed of the target broadcast text according to the third broadcast length parameter; the third broadcast length parameter indicates the third history associated with the device information Listen to duration information.

In a possible implementation manner, the controlling the broadcast speed of the target broadcast text according to the broadcast length parameter includes: inputting the historical listening duration information, device information and/or scene information into the classification model; outputting the first Four broadcast length parameters; the fourth broadcast length parameters are different length categories; the broadcast speed of the target broadcast text is controlled according to the fourth broadcast length parameters.

In a possible implementation manner, the controlling the broadcast speed of the target broadcast text according to the broadcast length parameter includes: inputting the historical listening duration information, device information and/or scene information into the regression model; Outputting a fifth broadcast length parameter, where the fifth broadcast length parameter is a length limit value; and controlling the broadcast speed of the target broadcast text according to the fifth broadcast length parameter.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory, when the programs stored in the memory When executed, the processor is configured to execute the method described in any one of the foregoing embodiments.

In a fourth aspect, an embodiment of the present application is a storage medium, where an instruction is stored in the storage medium, and when the instruction is run on a terminal, the first terminal is made to execute the method described in any one of the foregoing embodiments.

Description of drawings

In order to more clearly illustrate the technical solutions of the multiple embodiments disclosed in this specification, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only disclosed in this specification. For multiple embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without any creative effort.

The following briefly introduces the drawings used in the embodiments or the description of the prior art.

Fig. 1 is a schematic diagram of an artificial intelligence main frame;

FIG. 2 is a schematic diagram of the application system of the voice assistant proposed in the embodiment of the present application;

FIG. 3 is a functional architecture diagram of the voice assistant in the embodiment of the present application;

FIG. 4 is a flowchart of a method for generating broadcast text proposed in Embodiment 1 of the present application;

FIG. 5 is an application schematic diagram of a method for generating broadcast text proposed in Embodiment 1 of the present application;

FIG. 6 is a schematic structural diagram of a random forest-based machine learning model based on a method for generating broadcast text proposed in Embodiment 3 of the present application;

FIG. 7 is a schematic diagram of a structure of a typical pre-trained language model based on a method for generating broadcast text proposed in Embodiment 4 of the present application.

detailed description

In the following description, references to "some embodiments" describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or a different subset of all possible embodiments, and Can be combined with each other without conflict.

In the following description, the terms "first\second\third, etc." or module A, module B, module C, etc. are only used to distinguish similar objects, and do not represent a specific ordering of objects. It is understandable Obviously, where permitted, the specific order or sequence can be interchanged such that the embodiments of the application described herein can be practiced in other sequences than those illustrated or described herein.

In the following description, the involved reference numerals representing steps, such as S110, S120, etc., do not mean that this step must be executed, and the order of the previous and subsequent steps can be interchanged or executed simultaneously if allowed.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field to which this application belongs. The terms used herein are only for the purpose of describing the embodiments of the present application, and are not intended to limit the present application.

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

Natural language generation (natural language generation, NLG) is a part of natural language processing, which generates natural language from machine representation systems such as knowledge bases or logical forms. NLG can be regarded as the reverse of natural language understanding (NLU): NLU needs to clarify the meaning of the input language and generate a machine expression language; and NLG needs to decide how to convert the conceptual machine expression language into a natural language that users can receive. language.

In a possible solution, the user wakes up the voice assistant and issues a voice command related to querying the weather. The voice assistant uses the natural language understanding (NLU) capability to understand the voice command issued by the user related to querying the weather and interprets the voice command Classify according to the natural language classification system similar to Table 1, query the weather according to the classification results, select the corresponding template according to the weather query results to generate the broadcast text corresponding to the weather, or generate the broadcast text corresponding to the weather information category and its associated attributes, broadcast The text content matches the category to which the voice command belongs.

Table 1

This solution generates different types of broadcast text according to different voice commands input by the user, but the content of the broadcast text is only related to the type of voice command input by the user, and does not consider the user's personal usage habits, differences in equipment or differences in the scene they are in. Provide a personalized weather broadcast experience for thousands of people.

The embodiment of the present application proposes a method for generating broadcast text, which relates to the field of AI and is applicable to voice assistants. By introducing user information, device information and/or scene information, the voice assistant can be based on the user's personal usage habits, device differences and/or The environment generates a personalized broadcast text, and generates broadcast voice information corresponding to the speech rate through TTS, informs the user of the broadcast content and guides the user to continue using the device.

Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements. Based on the main framework of artificial intelligence shown in Figure 1, the main framework of artificial intelligence will be described from the two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis).

"Intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".

"IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, the realization of information provision and processing technology, to the systematic industrial ecological process.

(1) Infrastructure 10:

The infrastructure 10 provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform. Among them, sensors are used to communicate with the outside to obtain data streams; smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA) are used to provide training, calculation and execution capabilities; basic platforms are used for cloud storage and cloud computing, Network interconnection, etc., including distributed computing framework and network related platform guarantee and support.

(2) Data 11

The data 11 on the upper layer of the infrastructure 10 is used to represent data sources in the field of artificial intelligence.

In a broadcast text generation method proposed in the embodiment of the present application, the data 11 of the upper layer of the infrastructure 10 comes from the voice commands acquired on the terminal side, the equipment information of the terminal used, and the scene information obtained through sensor communication with the outside .

(3) Data processing 12

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.

Among them, machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.

Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies. The typical functions are search and matching.

Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.

In a broadcast text generation method proposed in the embodiment of the present application, the data processing process includes front-end processing, speech recognition (ASR), semantic understanding (NLU), dialog management (DM), natural Language generation (NLG), speech synthesis (TTS) and other processing.

(4) General ability 13

After the data has undergone the data processing mentioned above, some general-purpose capabilities can be formed based on the results of data processing, such as algorithms or a general-purpose system.

In the embodiment of the present application, after the above-mentioned data processing is performed on the voice commands input by the user, the equipment information of the terminal used, and the scene information obtained through communication with the outside world through the sensor, a personalized broadcast text can be generated based on the result of the data processing, and a The broadcast voice corresponding to the speed of speech provides a personalized broadcast experience for thousands of people.

(5) Smart products and industry applications 14

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, smart terminals, etc.

A broadcast text generation method proposed in the embodiment of the present application can be applied to voice assistants of smart devices in the fields of smart terminals, smart homes, smart security, and automatic driving. ECU) provides a voice user interface (VUI), and completes corresponding tasks or provides related services according to voice commands input by users.

Exemplarily, smart devices include smart TVs, smart speakers, robots, smart air conditioners, smart smoke alarms, smart fire extinguishers, smart vehicle terminals, mobile phones, tablets, laptops, desktop computers, and all-in-one machines.

FIG. 2 is a schematic diagram of the application system of the voice assistant proposed by the embodiment of the present application. As shown in FIG. 2 , in the system diagram 200 , the data collection device 260 is used to collect information such as user information, device information, scene information and/or historical listening duration, and store the information in the database 230 . The data acquisition device 260 corresponds to the sensors of the infrastructure in Figure 1, including devices such as motion sensors, displacement sensors, and infrared sensors that communicate with smart devices, and are used to collect user current scene information, such as sports, meetings, rest or chatting, etc. .

The data collection device 260 also includes a camera device, GPS, and other devices that are communicatively connected with the smart device, and is used to collect scene information of the user's current location or place, such as in a vehicle, living room or bedroom.

The data collection device 260 also includes a timer, which is used to record the start time, end time and broadcast duration of the broadcast voice. The broadcast duration is recorded in the user information as the user's historical listening duration.

The client device 240 corresponds to the basic platform of the infrastructure in FIG. 1, and is used for interacting with the user, obtaining the voice command sent by the user, broadcasting the broadcast content of the voice command, showing the broadcast content to the user, and storing the information in the database 230; The client device 240 includes a smart phone providing a voice user interface (VUI), a display screen and a microphone, a speaker, a button, a Bluetooth earphone microphone, and the like, such as a smart vehicle terminal.

The microphone can be a radio device, including an integrated microphone, a microphone or a microphone array connected to a smart device, or a microphone or a microphone array connected to a smart device through a short-distance connection technology, and is used to collect voice commands issued by the user.

The training device 220 corresponds to the smart chip of the infrastructure in FIG. 1 , and trains the voice assistant 201 based on data maintained in the database 230 such as user information, device information, scene information and/or historical broadcast duration. The voice assistant 201 can provide a personalized broadcast text in the voice dialogue scene between the user and the client device 240 , and generate a broadcast voice corresponding to the speech rate, inform the user of the broadcast content and guide the user to continue using the client device 240 .

In FIG. 2 , the execution device 210 corresponds to the smart chip of the infrastructure in FIG. 1 , and is equipped with an I/O interface 212 for data interaction with the client device 240 . Input voice command information; output the broadcast content to the client device 240 through the I/O interface 212, for example, broadcast the broadcast content through the loudspeaker, or display the broadcast content on the display of smart phones, smart vehicle terminals, etc. through the Voice User Interface (VUI) screen.

The execution device 210 may call data, codes, etc. in the data storage system 250 , and may also store data, code instructions, etc. in the data storage system 250 .

The training device 220 and the executing device 210 may be the same smart chip or different smart chips.

The database 230 is a data collection of user information, device information and/or scene information stored on a storage medium.

The voice assistant 201 is an agent software for executing voice instructions or services. The execution device 210 executes the voice assistant 201. After acquiring the voice instructions issued by the user, it will generate a personalized length target broadcast according to user information, device information and/or scene information Text, and control the speech rate of the broadcast voice, inform the user of the broadcast content and guide the user to continue using the device.

Finally, the I/O interface 212 returns the target broadcast text of personalized length generated by the voice assistant 201 to the client device 240 as output data, and the client device 240 displays the broadcast text and broadcasts it to the user at a corresponding speech speed.

At a deeper level, the training device 220 acquires the training data and corpus stored in the database 230, and based on the acquired data such as user information, device information and/or scene information of the historical record, to output a length that matches the user's historical listening history record. The broadcast text is the training target, and the voice assistant 201 is trained to output better target broadcast text.

In the situation shown in FIG. 2 , the user can input voice instruction information to the execution device 210 , for example, can operate in a voice user interface (VUI) provided by the client device 240 . In another case, the client device 240 can automatically input instructions to the I/O interface 212 and obtain broadcast content. If the client device 240 needs to obtain user authorization for automatically inputting instruction information, the user can set corresponding permissions in the client device 240 . The user can view or listen to the broadcast content output by the execution device 210 on the client device 240 , and the specific presentation form may be specific ways such as display, wake-up sound, and broadcast. The client device 240 can also serve as a voice data collection terminal and store the collected wake-up sound or voiceprint data of the user into the database 230 .

It is worth noting that Figure 2 is only a schematic diagram of a system application scenario provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation. The system shown in Figure 2 It may correspond to one or more device entities. For example, in FIG. 2 , the data storage system 250 is an external memory relative to the execution device 210 . In other cases, the data storage system 250 may also be placed in the execution device 210 .

FIG. 3 is a functional architecture diagram of the voice assistant in the embodiment of the present application. Each functional module in voice assistant 201 is described below, as shown in Figure 3, voice assistant 201 comprises front-end processing module 31, speech recognition module 32, semantic understanding module 33, dialog state module 34, dialog strategy learning module 35, natural language generation module 36 , speech synthesis module 37 and dialogue output module 38 .

Wherein, the front-end processing module 31 is used to process the voice command input by the user to obtain the data format required by the network model for use by the voice recognition module 32 .

Exemplarily, the front-end processing module 31 obtains the voice command of opus compression format input by the user, performs audio decoding on the voice command, and decodes it into an audio signal in pcm format; uses voiceprint or other features to separate and denoise the audio signal , feature extraction, and through audio processing algorithms such as framing, windowing, and short-time Fourier transform, the audio feature vector of the mel-frequency cepstral coefficients (MFCC) filter bank (filter bank) is obtained . The front-end processing module 31 is generally disposed on the terminal side.

The speech recognition (automatic speech recognition, ASR) module 32 is used for obtaining the audio feature vector obtained by the front-end processing module 31, and converts the audio feature vector into text through an acoustic model and a language model for the semantic understanding module 33 to understand. Among them, the acoustic model is used to classify the acoustic features and correspond to (decode) phonemes or words, and the language model is used to decode phonemes or words into a complete sentence.

Exemplarily, the acoustic model and the language model process the audio feature vectors in series, convert the audio feature vectors into phonemes or words through the acoustic model, and then convert the phonemes or words into text sequences through the language model, and output user Speech-to-text.

Exemplarily, the ASR module 32 can be implemented in an end-to-end manner, wherein the acoustic model and the language model adopt a neural network structure, and the acoustic model and the language model are jointly trained so that the result of the training is to output Chinese characters corresponding to the user's voice sequence. For example, the acoustic model may be modeled using a Hidden Markov Model (HMM), and the language model may be an n-gram model.

The semantic understanding (natural language understanding, NLU) module 33 is used to convert the text or Chinese character sequence corresponding to the user's voice into structured information, wherein the structured information includes machine-executable intention information and recognizable slot information. Its purpose is to obtain the semantic representation of natural language through the analysis of syntax, semantics and pragmatics.

It can be understood that the intent information refers to the task that needs to be performed by the voice command issued by the user; the slot information refers to the parameter information that needs to be determined to perform the task.

Exemplarily, the user asks the voice assistant 201 "What's the temperature in Nanjing today?" The NLU module 33 understands the text corresponding to the voice command, and obtains the voice command's intention as "check the weather", and the slot is "Location: Nanjing" and "Time: Today".

The NLU module 33 can use a classifier to classify the text corresponding to the voice instruction into the intent information that the voice assistant 201 can support, and then use the sequence labeling model to label the slot information in the text.

Among them, the classifier can be a model that can be used for classification in traditional machine learning algorithms, for example, NB model, random forest model (RF), SVM classification model, KNN classification model, etc.; it can also be a deep learning text classification model, for example, FastText model, TextCNN, etc.

The sequence labeling model is used to mark each element in the text information or Chinese character sequence, and output the mark sequence, which can be used to indicate the beginning, end and type of the slot. The sequence labeling model can be one of the following models: linear model, hidden Markov model, maximum entropy Markov model, conditional random field, etc.

The NLU module 33 may also use an end-to-end model to simultaneously output intent information and slot information.

The dialog state tracking (dialog state tracking, DST) module 34 is used to manage the dialog state of the voice assistant 201. The DST module 34 uses the intent information and slot information of the current round of dialogue output by the NLU module 33 to maintain the current round of dialogue intention, filled slots and dialogue status in the multi-round dialogue scene.

The input of the DST module 34 is the last round of dialogue state, the broadcast content returned by the last round of third-party applications, and the intent information and slot information of the current round of dialogue, and the output is the current round of dialogue state.

The DST module 34 module has recorded the dialog history and the dialog status of the voice assistant 201, and the assistant voice assistant 201 understands the instruction of the current round dialog user's voice in combination with the dialog history recorded by the context manager (that is, the database 230 in FIG. 2 ), and gives an appropriate feedback of.

Exemplarily, in the first round of dialogue, user A asks the voice assistant 201 to "book a flight ticket to Nanjing"; in the second round of dialogue, user A asks the voice assistant 201 "how is the weather there?". The NLU module 33 outputs the intention of the current round of dialogue as "check the weather", and the slot is "place: there" and "time:" because the DST module 34 has recorded the first round of dialogue state, the system combines the dialogue history understanding of the context manager record If "there" in the slot "Location: There" is "Nanjing", then "Nanjing" is filled into the location slot. The DST module 34 outputs the dialogue state information of the current round, including intent information (check the weather), filled slots (Nanjing) and unfilled slots (time:).

The dialog policy learning (dialog policy learning, DPL) module 35 is used to determine the next action performed by the voice assistant 201, including asking the user, executing the user's instruction, recommending other user instructions, and generating a reply.

The DPL module 35 uses the dialog state information output by the DST module 34 to determine the next execution action.

In a practicable embodiment, the DPL module 35 may determine according to the state of the current round of dialogue that the next step to perform action information is to generate a broadcast content asking the user.

For example, in the last example, the DST module 34 outputs the dialogue status information of the current round and there is an unfilled slot (time: ), the DPL module 35 can determine that the next step of execution action is to ask the user "what day?" to maintain the dialogue system The control logic ensures that the dialogue can continue to be executed. The execution action information is an action tag or structured information, such as "REQUEST-SLOT: date", indicating the next time to be queried to the user.

In an embodiment that can be implemented, the DPL module 35 can determine that the next step to execute is to select an appropriate third-party application (app) to execute the voice command according to the current round of dialogue status, and send the intention and slot information to the selected third-party application (app). The third-party application; obtaining the execution result returned by the third-party application, where the execution result is the broadcast content corresponding to the voice command.

A third-party application (app) is an application that can execute or meet the intention of the voice command according to the slot information and return the broadcast content, such as an app that can query the weather, an app that can provide product information, and an app that can provide navigation or positioning information Wait.

The broadcast content determined by the DPL module 35 according to the current round of dialogue state or the broadcast content returned by the third-party application (app) or server after executing the voice command according to the intention and slot information, can be used as the input parameter of the next round of dialogue state of the DST module 34, It can also be used as an input parameter of the NLG module 36 .

The natural language understanding (NLG) module 36 is a translator that converts structured information into natural language expressions, and is currently widely used in voice assistants. When generating voice assistant announcements, due to the different layouts and sizes of different devices and different webpage locations, it is necessary to introduce a time limit parameter to limit the length of the generated text, so as to adaptively match the content of the broadcast under different users, different devices, and different scenarios. and broadcast time requirements.

In the embodiment of the present application, the NLG module 36 is used to obtain the current dialogue status maintained by the DST module 34, the next step to execute the action determined by the DPL module 35, and/or the broadcast content returned by the third-party application (app), combined with user information, Device information and/or scene information generate a target broadcast text of personalized length.

Exemplarily, in the case that the current dialog state maintained by the DST module 34 is intent information (check the weather), filled slots (Nanjing) and unfilled slots (time:), the next step determined by the DPL module 35 The execution action is to ask the user, and the broadcast text generated by the NLG module 36 is "Which day do you need to inquire?"

Exemplarily, the NLG module 36 inputs the current dialogue state and the announcement content returned by the third-party application into a template matching the current intention, device or scene, and outputs the target announcement text of the length configured by the template. The NLG module 36 can also use the model-based black box to output the target broadcast text of personalized length.

User portrait (user profile, UP) module 213, is used for by querying the data in the database 230 shown in Figure 2 to obtain user information, record user to listen to information such as the historical listening duration of voice assistant broadcast in user information.

User information, also known as user portrait, describes the user's usage habits by collecting data in various dimensions such as the user's social attributes, consumption habits, preference characteristics, and system usage behavior, and analyzes and counts these characteristics to tap potential Value information, so as to abstract the whole picture of user information, and use it to recommend personalized content to users, or provide services in line with user habits.

A device profile (device profile, DP) module 214 is used to obtain device information of the client device 240 shown in FIG.

Scene perception (context awareness, CA) module 215, is used for obtaining current scene information through data acquisition device 260 shown in Figure 2, and scene information includes room category, background noise level, user's current state of motion etc.

The CA module 215 , the DP module 214 , and the UP module 213 may also be external modules relative to the voice assistant 201 , which are not specifically limited here.

In the embodiment of this application, the voice assistant understands the user's voice command through the natural language understanding NLU module 35 and sends it to the corresponding third-party application (app) for execution, and can obtain the structured broadcast content returned by the third-party application, using NLG The module 36 converts the returned structured broadcast content into a broadcast text for the TTS module to generate a broadcast voice to inform the user of the broadcast content.

The speech synthesis (Text-to-Speech, TTS) module 37 is used to control the broadcast speed of the target broadcast text according to the broadcast length parameter, and the broadcast length parameter indicates historical listening duration information.

In the embodiment of the present application, when the TTS module 37 converts the target broadcast text into broadcast voice, by introducing the broadcast length parameter, combined with user information, device information and/or scene information to control the speech rate of the broadcast, thereby limiting the broadcast of the target broadcast text Duration, while ensuring the accuracy of speech generation, it also controls the speech rate, timbre, volume and other characteristics of the generated speech.

The dialog output module 38 is configured to generate a corresponding broadcast card according to the target broadcast voice, and then present it to the user.

Embodiment one

The embodiment of the present application proposes a method for generating broadcast text. This method is applied to a voice assistant. By receiving the user's voice command, the broadcast content corresponding to the voice command is obtained, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, wherein the broadcast The length parameter indicates historical listening duration information.

FIG. 4 is a flow chart of a method for generating broadcast text proposed in Embodiment 1 of the present application. As shown in Figure 4, the voice assistant performs the following steps S401-S404.

S401. Receive a user's voice instruction.

The voice assistant 201 receives a user's voice instruction.

Exemplarily, after user A wakes up the voice assistant 201, he sends a voice instruction "what's the temperature in Nanjing today?" to the voice assistant 201 .

S402. Obtain broadcast content corresponding to the voice command.

The voice assistant 201 performs front-end processing on the voice command "What's the temperature in Nanjing today" to obtain an audio feature vector; recognizes the audio feature vector as text through an acoustic model and a language model; understands the text, and obtains the corresponding intent of the voice command as "Check the weather", the slots are "Location: Nanjing" and "Time: Today"; the dialog status is managed, and according to the last round of dialog status, the content of the last round of broadcast, and the corresponding intent information and slot information of the current voice command, obtain The current dialog state, including intent information, filled slots, and unfilled slots, determines whether voice commands can be executed.

In an implementation that can be implemented, the voice assistant 201 can determine the third-party application that executes the intent information according to the current dialog state that is executable; and send the intent information and slot information corresponding to the voice command to the third-party application ; Obtain the execution result returned by the third-party application (app) or the server, and the execution result is the broadcast content corresponding to the current voice command.

Exemplarily, the user sends a voice command "What's the temperature in Nanjing today" to the voice assistant 201, and the voice assistant 201 selects an appropriate third-party application (app) to execute the voice command in combination with the intent information and slot information related to the user request, and outputs The execution result related to the user request returned by the third-party application (app), the execution result is a structured broadcast content "{"temperature": "15-23", "unit": "C", "location": "Nanjing"}".

In an implementable implementation manner, it is determined that the voice command is not executable when there is an unfilled slot in the current dialogue state, and the voice assistant 201 may generate the broadcast content according to the dialogue state.

Exemplarily, in the case that there is an unfilled slot in the current dialog state, the voice assistant 201 acquires the next action information determined by the DPL module 35, the action information is an action tag or structured information, and determines the broadcast content It is "REQUEST-SLOT:date", indicating the time to ask the user next.

S403. Generate target broadcast text according to the broadcast length parameter and the broadcast content, where the broadcast length parameter indicates historical listening duration information.

In an implementation that can be realized, the NLG module 36 can be a generative model, and the broadcast content can be used as the input of the generative model, the user's broadcast length parameter can be used as an additional parameter, and the output broadcast text is implicitly limited by the training data. Length, generate the target broadcast text, the target broadcast text is the broadcast text whose duration matches the broadcast length parameter.

In another possible implementation, the length or length range of the text generated by the generative model can be limited by the input broadcast length parameter, and the broadcast content and the broadcast length parameter are used as the input of the model, and the target broadcast text of the limited length is output .

In a possible implementation, the NLG module 36 can be a retrieval model, which takes the broadcast content as input to the retrieval model, retrieves the corresponding template according to the broadcast content, and generates the target broadcast text through the retrieved template.

In an implementation that can be implemented, the broadcast content and the user's broadcast length parameter are used as the input of the retrieval model, and the template corresponding to the broadcast content is retrieved in the predefined template library according to the length defined by the broadcast length parameter. The template output target announcement text.

In a possible implementation manner, the broadcast length parameter may be determined according to an average value or a weighted average value of at least one piece of historical listening duration information. Exemplarily, the device portrait module 213 obtains user information, obtains the historical listening duration information of each time the user listens to the voice broadcast, and obtains the broadcast length according to the statistical average or weighted average of the historical listening duration information of each user's listening to the voice broadcast parameter; the minimum/maximum value and the latest value of the historical listening time can also be used as the broadcast length parameter.

For example, the device portrait module 213 obtains the user's historical listening duration of t=5s, and determines that the length of the generated broadcast text is 20 characters after conversion through the mapping table, so the broadcast length parameter is 20. NLG module 36 according to the broadcast content "{"temperature": "15-23", "unit": "C", "location": "Nanjing"}" and the broadcast length parameter 20 of the above-mentioned returned broadcast content generate a character length of The target broadcast text of about 20 characters is "Nanjing is sunny today, with a minimum of 15 degrees Celsius and a maximum of 23 degrees Celsius."

Before the voice assistant of the smart device is enabled, the value of the historical listening time may be an initial value. The value can be a precise numerical record, such as "5 seconds", "20 words", etc., or an identifier mapped to a certain period of time, such as "medium", "concise", etc.; the initial value can also be passed The value of the average listening time of users obtained by smart device manufacturers through user research, or the value of the average listening time of the group to which the user belongs. The embodiment of the present application does not limit the initial value of the historical listening duration.

When the user listens to the voice broadcast each time, the voice assistant 201 will continue to record the listening duration of the broadcast, and collect the duration information of each listening broadcast in the user portrait to obtain multiple pieces of historical listening duration information.

In a possible implementation, the recording of the listening duration can be timed from the moment when the broadcast starts, and ends when one of the following situations occurs: the broadcast is completed, the broadcast is interrupted, or the program is closed or switched to another program. The listening duration is the time interval from the start of the timer to the end of the timer.

S404. Control the broadcast speed of the target broadcast text according to the broadcast length parameter.

In an implementation that can be realized, after the TTS module 37 acquires the broadcast text, it controls the speech rate of the broadcast voice with the broadcast length parameter as the speech rate limit of the broadcast voice, and converts the broadcast text into a text that conforms to the current user's historical listening habits. Announce voice.

FIG. 5 is a schematic diagram of the application of the method for generating broadcast text proposed in Embodiment 1 of the present application. As shown in Figure 5, user A wakes up the voice assistant 201 and asks "how is the temperature in Nanjing today?"

The front-end processing module 31 performs audio decoding on the voice command "What's the temperature in Nanjing today?" input by user A, and decodes it into an audio signal in pcm format; uses voiceprint or other features to separate, denoise, and feature extract the audio signal, and Audio feature vectors are obtained through audio processing algorithms such as framing, windowing, and short-time Fourier transform.

The ASR module 32 converts audio feature vectors into text through an acoustic model and a language model. Specifically, the acoustic features in the audio feature vector are converted into phonemes or words through the acoustic model, and then the phonemes or words are converted into text sequences through the language model, and the text corresponding to the voice command of user A is output.

The NLU module 33 understands the text, and obtains that the user's intention is "to check the weather", and the slot is "location: Nanjing".

The DST module 34 utilizes " check the weather " of the current round conversation of NLU module 33 output, and the slot position is " place: Nanjing ", outputs the dialogue state information of current round, comprises intention information (check weather), filled slot (Nanjing ) and (time: today).

The DPL module 35 utilizes the dialogue state information that the DST module 34 outputs to determine that the execution action of the next step is to execute an instruction, and the DPL module 35 uses the slot information as a parameter to select a suitable third-party service or application (app) according to the intention information to execute the user's action. Voice command; send to "check the weather" to the corresponding third-party application (service provider W).

The NLG module 36 acquires and returns the broadcast content as structured information “{“temperature”:“15-23”,“unit”:“C”,“location”:“Nanjing”}”. At the same time, the historical listening duration t _A =5s of user A is acquired through the UP module 213 , and the character length of the generated broadcast text is determined to be 20 after conversion of the mapping table, so the broadcast length parameter is 20.

The NLG module 36 generates a target broadcast text with a character length of about 20 characters according to the broadcast content returned above and the broadcast length parameter "Nanjing is sunny today, the lowest is 15 degrees Celsius, and the highest is 23 degrees Celsius".

The TTS module 37 performs speed control according to the target broadcast text and the listening duration t=5s, and generates a broadcast voice with a length of 4.5s˜5.5s for broadcast.

After the broadcast is completed, the voice assistant 201 sends the user's listening time to the UP module 213, and the UP module 213 records the user A's listening time to the broadcast.

User B wakes up the voice assistant 201, the voice command input by user B is the same as that of user A, and the process of DPL module 35 obtaining the returned result is the same as that of user A. At the same time, the historical listening duration t _B = 2s of user B is obtained by UP module 213, which is determined after conversion The character length of the generated broadcast text is about 10 characters, and the broadcast length parameter is 10; the target broadcast text generated by the NLG module 36 is "Sunny, 15 to 23 degrees Celsius", and the TTS module 37 generates a broadcast voice with a duration of 1.5-2.5s.

It can be seen from the embodiment shown in Figure 5 that for different users A and B, according to the personalized differences in the historical listening time of the two, the method for generating the broadcast text proposed in the embodiment of the present application can generate the same voice command Broadcast texts of different lengths, so that the voice assistant can generate personalized broadcast texts according to the user's usage habits, and then perform personalized broadcasts according to the personalized broadcast texts.

The method for generating the broadcast text proposed in the embodiment of the present application introduces user information in the generation stage of the broadcast text and broadcast voice, and controls the detail level of the target broadcast text according to the user's historical listening time recorded in the user information. Provide a personalized interactive experience for thousands of people.

Embodiment two

A method for generating broadcast text proposed in the embodiment of the present application. On the basis of Embodiment 1, the user's voice instruction and the user's history are listened to through the imported data of user information, device information and/or current scene information. The duration, device information, and/or current scene information are combined to generate a broadcast text whose length matches the user's historical listening habits, and broadcast at a corresponding speech rate to provide a personalized broadcast experience. Among them, user information includes the user's historical listening time; device information includes configuration information such as display resolution, size, and broadcast device type of the broadcast device; scene information includes information such as room type, background noise level, and the user's current exercise status.

The voice assistant obtains the device information of the used broadcasting device through the DP module 214, obtains the current scene information through the CA module 213, and the UP module 213 uses the device information and the scene information as indexes to search in the database 213 to obtain the most detailed information that meets the threshold requirements. The granular historical listening duration information list is shown in Table 2.

In the dialogue system of the voice assistant 201, the historical listening duration of the user is divided into three levels and calculated according to the device information and the current scene. Use the listening duration calculated according to the currently available most fine-grained level as the broadcast length parameter, execute step S403 of the first embodiment to generate the target broadcast text; execute step S404 of the first embodiment to control the speed, so as to conform to the current user's listening history habits The broadcast text will be broadcast at a speaking speed of .

After the user completes a broadcast text listening event, the historical listening duration of the corresponding level will be updated based on the three-level index structure. The list of historical listening duration information of the three levels is shown in Table 2:

Table 2

时间time	设备dequipment d	场景escene e	收听时长tlistening time t
20:00:0320:00:03	手机cell phone	车辆vehicle	1.7s1.7s
22:05:0322:05:03	电视television	卧室bedroom	8.1s8.1s
12:20:1012:20:10	电视television	客厅living room	5.2s5.2s
19:05:5419:05:54	电视television	客厅living room	7.1s7.1s
08:03:0308:03:03	手机cell phone	卧室bedroom	2.5s2.5s
08:30:4508:30:45	手机cell phone	客厅living room	3s3s
17:35:0417:35:04	手机cell phone	车辆vehicle	1.5s1.5s
18:30:0818:30:08	手机cell phone	车辆vehicle	1.9s1.9s

Exemplarily, the broadcast length parameters are calculated according to the three-level listening duration. According to Table 2, there are mainly the following available broadcast length parameters: overall listening duration t_total, mobile phone listening duration t_d ₁ , TV listening duration t_d ₂ , vehicle listening duration The duration t_e ₁ , the listening duration t_e ₂ in the living room and the listening duration of the mobile phone in the vehicle. According to the data in Table 2, it can be obtained:

t_total=average(all)=3.875s;

t_d ₁ =average(d ₁ )=2.12s;

t_d ₂ =average(d ₂ )=6.8s;

t_e ₁ =average(e ₁ )=1.7s;

t_e ₂ =average(e ₂ )=5.1s;

t_d ₁ e ₁ = average(d ₁ e ₁ ) = 1.7s;

In the above formula, average() is the mean function, the index value d ₁ in brackets is the mobile phone, d ₂ is the TV, e ₁ is the vehicle, e ₂ is the living room, and d ₁ e ₁ is the mobile phone in the vehicle.

In a possible implementation, multiple pieces of historical listening duration information may be determined according to device information or scene information; and the broadcast length parameter may be determined according to the average or weighted average of multiple pieces of historical listening duration information. The broadcast length parameter determined according to the device information is recorded as the first broadcast length parameter; the broadcast length parameter determined according to the scene information is recorded as the second broadcast length parameter.

Exemplarily, when the number of historical listening duration records collected under device information or scene information is less than a threshold, the voice assistant uses the broadcast length parameter obtained through the first-level calculation.

The calculation of the first level is to calculate the overall listening time t_total. The overall listening duration t_total is consistent with the user's historical listening duration defined in Embodiment 1, which is the average or weighted average of multiple pieces of historical listening duration information, and the overall listening duration t_total is used as the broadcast length parameter.

For example, when the threshold is set to be more than 3 records, it is valid, and when the user's listening records are less than 3, the voice assistant can determine the user's listening history based on the statistical average or weighted average The overall listening time t_total, the overall listening time t_total is used as the broadcast length parameter.

Exemplarily, when the number of historical listening duration records collected under device information or scene information is greater than a threshold, use the historical listening duration information obtained through the second-level calculation, and use the average or weighted average of multiple pieces of historical listening duration information , to determine the broadcast length parameter.

The calculation of the second level is to count the multiple pieces of historical listening duration information on the corresponding device according to the listening duration t_d under the device information, or count the multiple historical listening duration information under the corresponding scene according to the listening duration t_e under the scene information.

For example, the device corresponding to the device information in Table 2 may be a smart terminal such as a mobile phone or a TV; the scene corresponding to the scene information may be a place such as a vehicle, a bedroom or a living room, and a state of motion such as exercising or resting.

For example, when user A logs in the voice assistant of the mobile terminal to listen to the historical listening time record of the weather broadcast is 5, which exceeds the threshold of 3 set by the system, the voice assistant can record each piece of historical listening time information according to the mobile terminal Statistical average value or weighted average value to obtain user A's report broadcast length parameter under the mobile terminal.

For example, when user B logs into the voice assistant of the mobile terminal to listen to the weather broadcast in the living room, there is 1 record, and the record of listening to the weather broadcast in the living room through the voice assistant of the smart TV is 2, and user B uses the voice assistant in the living room. When the record of listening to the weather report reaches the threshold set by the dialog system of 3, the voice assistant logged in by user B can record statistical average or weighted average according to the historical duration of each record of listening to the broadcast in the living room, and obtain the user through different smart terminals. The broadcast length parameter in the same scene.

In a possible implementation manner, at least one piece of historical listening duration information can be determined according to device information and scene information; and the broadcast length parameter can be determined according to the average or weighted average of multiple pieces of historical listening duration information.

Exemplarily, when the number of historical listening duration records collected in combination of device information and scene information is greater than a threshold, the historical listening duration information obtained through the third-level calculation is used. The broadcast length parameter determined according to the combination of device information and scene information is recorded as the third broadcast length parameter.

The calculation of the third level is to count the historical listening duration of the user of the current device d in the current scene e according to the listening duration t_de of the device scene.

For example, when user C listens to the weather broadcast in the vehicle through the mobile phone terminal, the historical listening time record is 3, reaching the threshold set by the dialogue system, the voice assistant logged in by user C can listen to the weather report in the vehicle according to each recorded item The statistical average or weighted average of the historical listening duration information of the broadcast is used to obtain the broadcast length parameter for user C to listen to the broadcast in the vehicle through the mobile terminal.

At the same time, after the voice assistant completes a broadcast text listening event, it sends the user's listening time to the UP module 213, and the UP module 213 will record the historical listening time of the corresponding level on the three-level historical listening time information list shown in Table 2. duration and time.

The broadcast text generation method proposed in Embodiment 2 of the present application is aimed at users with different historical listening durations on different devices and in different scenarios. For the same voice command, the voice assistant 201 can generate target broadcast texts with different lengths to provide users with more refined personalities. personalized interactive experience. Embodiment 2 of the present application conducts refined statistics on the user's historical listening time according to the type of device and the scene in which it is located, so as to provide a personalized broadcast voice interaction experience that is more suitable for the user's usage scene.

The dialog system of the voice assistant 201 can provide the user with a broadcast length and speech rate in line with the current user's listening history record, suitable for the user in combination with the user's historical listening duration information, device related parameters and/or current scene information during the broadcast text generation process. It is equipped with broadcast voice of device information and scene information, thereby improving the naturalness of voice interaction and greatly improving user experience.

Embodiment Three

A method for generating a broadcast text proposed in the embodiment of the present application, on the basis of Embodiments 1 and 2, can obtain the broadcast length parameter through a machine learning model, and the machine learning model can be realized based on a random forest (random forest) , use the historical listening time, screen size, screen resolution and/or noise level of the environment where the user is listening to the broadcast, and the room type to train the broadcast length parameter, input the broadcast length parameter and broadcast content into the machine learning model, and output the broadcast length Parameters, according to the broadcast length parameters and broadcast content to generate the target broadcast text, and broadcast at the corresponding speech rate, providing a personalized broadcast experience.

FIG. 6 is a schematic structural diagram of a random forest-based machine learning model of a broadcast text generation method proposed in Embodiment 3 of the present application. As shown in Figure 6, x in the figure is the input feature of the machine learning model, and the broadcast length parameter y is output.

Exemplarily, the input feature x includes data such as user information, device information, and/or scene information; wherein, the user information includes the user's historical listening time; the device information includes the screen size and screen resolution of the current broadcasting device; the scene information is related to The data includes the level of ambient noise, the type of room it is in, and so on.

Exemplarily, the broadcast length parameter y includes classification results such as "concise" or "moderate", or a predicted length limit value L of the broadcast text.

In an implementation that can be implemented, the machine learning model is a classification model, input user information, device information, and/or scene information and other feature data, and the output broadcast length parameter y is the length classification result of the target broadcast text, which is recorded as The fourth broadcast length parameter; such as concise, moderate, detailed. The classification model can be trained using a standard random forest classifier.

In an implementation that can be implemented, the machine learning model can be a regression model, which inputs feature data such as user information, device information, and/or scene information, and outputs the broadcast length parameter y, which is the length limit value L of the target broadcast text, denoted as The fifth broadcast length parameter; the regression model model can be trained using a standard random forest regression learner (random forest regressor).

Each initial model of the above-mentioned machine learning model is obtained through offline training, and then continuously collects the historical listening time of the user under the conditions of a specific screen size, screen resolution, and/or noise level of the environment and the room category in which the user is located. Learn to provide broadcast length parameters that adapt to the user's historical listening habits.

The training data of the machine learning model includes the user's historical listening time and/or device information, such as the screen size and/or screen resolution of the current broadcast device, and scene information, such as the level of ambient noise and/or the type of room in which it is located, etc. The label of each piece of training data is the expected broadcast length parameter. Each piece of training data can be obtained through the steps corresponding to Embodiment 1 and Embodiment 2 above, or collected from the network environment in combination with user feedback, which is not limited here.

The NLG module 36 uses the broadcast length parameter output by the above machine learning model to control the generated length of the target broadcast text.

The TTS module 37 uses the broadcast length parameter output by the machine learning model to control the speech rate of the broadcast voice, and broadcast at the corresponding speech speed.

Compared with Embodiment 2, the method for generating the broadcast text proposed in the embodiment of the present application introduces a machine learning model, obtains the broadcast length parameter according to the user's historical listening time, device information and/or scene information, and limits the broadcast text according to the broadcast length parameter and Announce the length of the voice, and through the online learning mechanism, keep the machine learning model to learn continuously, and update the personalized broadcast length parameters to match the user. The personalized experience generated by voice assistant 201 using the broadcast text generation method in Embodiment 3 of the present application becomes more accurate as it is used.

The broadcast text generation method proposed in Embodiment 3 of the present application learns the mapping relationship between the user's historical listening duration to the expected broadcast text length and broadcast voice duration through a machine learning model, and provides more accurate personalization through online learning experience. The first embodiment is the way of rule mapping.

Embodiment Four

Due to the development of pre-trained language models, many current NLP tasks can obtain a large index improvement through this paradigm. A method for generating broadcast text proposed in the embodiment of this application can use pre-trained language models, such as BERT language model, GPT -2 language model, etc., integrate the broadcast length parameters into the controllable NLG module 36/TTS module 37, and generate broadcast text or voice end-to-end.

FIG. 7 is a schematic diagram of a structure of a typical pre-trained language model based on a method for generating broadcast text proposed in Embodiment 4 of the present application. As shown in Figure 7, this module uses a linear encoder (linear) to encode different types of user information, device information, and/or scene information, and obtains the characterization vector of the broadcast length parameter through the fusion module (fusion), which is recorded as the sixth Broadcast length parameter; The sixth broadcast length parameter and the broadcast content of the current user's voice command output by the DST module 34 and the current round dialogue state output by the DPL module 35 are input into the GPT-2 language model together, and the generation length is corresponding to the user's listening history record. The matching target announces the text.

In a possible implementation, the NLG module 36 uses unlabeled text data to pre-train the GPT-2 language model to obtain language feature information. Then use the broadcast content information including the broadcast content, dialogue state, corresponding user information, device information and/or scene information, and the broadcast results that have received positive feedback from the user to fine-tune, learn the encoder parameters corresponding to each parameter, and adjust the pre-training The parameters of the output layer of the GPT-2 model are used to generate the target broadcast text whose length matches the user's listening history to adapt to the generation task.

The method for generating the broadcast text proposed in the embodiment of the present application introduces not only user information but also device information and/or scene information when generating the broadcast text, and generates broadcast texts with different lengths. Collect the historical duration of users listening to the broadcast text through user information, and store the listening time of the broadcast text in combination with the environment and/or the equipment used. When generating the broadcast text in a specific scenario, use the broadcast length in this scenario Parameter-guided broadcast text generation can generate target broadcast texts that match user habits, adapt to device information and/or usage scenarios, improve interactive experience and efficiency, and provide a personalized voice assistant 201 that better understands users.

In addition to generating broadcast text or voice according to the user's request, the voice assistant 201 actively sends out the welcome words, the broadcast text or voice generated when the system is turned on or off, and other things that may match the user's personalized usage records, device information and/or scene information The methods of the above-mentioned embodiments of the present application can be used to generate the broadcast text or voice under the scenario of the present application.

Embodiment five

The embodiment of this application proposes a method for broadcasting text, which can generate broadcast voice according to user request, introduce user information in the generation stage of broadcast voice, and control the speech rate of target broadcast voice according to the user's historical listening time recorded in user information, for Provide a personalized interactive experience with thousands of faces between users and voice assistants.

The embodiment of the present application proposes a method for broadcasting text, including: receiving a user's voice command; generating a target broadcast text corresponding to the voice command; controlling the broadcast speed of the target broadcast text according to the broadcast length parameter, and the broadcast length parameter indicates the historical listening duration information.

The voice assistant can determine the broadcast length parameter based on the average or weighted average of multiple pieces of historical listening duration information. For details, reference may be made to the implementation manner related to determining the broadcast length parameter in Embodiment 1, which will not be repeated here.

In some implementations that can be implemented, the broadcast length parameter is associated with device information, the first broadcast length parameter can be determined according to the device information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including: according to the first broadcast length parameter The broadcast speed of the target broadcast text is controlled; the first broadcast length parameter indicates the first historical listening duration information associated with the device information. For details, reference may be made to the implementation manner related to the first broadcast length parameter in Embodiment 2, which will not be repeated here.

In some implementations that can be implemented, the broadcast length parameter is associated with the scene information, the second broadcast length parameter can be determined according to the scene information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including: according to the second broadcast length parameter The broadcast speed of the target broadcast text is controlled; the second broadcast length parameter indicates the second historical listening duration information associated with the scene information. For details, reference may be made to the implementation manner related to the second broadcast length parameter in Embodiment 2, which will not be repeated here.

In some implementations that can be implemented, the broadcast length parameter is associated with field device information and scene information, and the third broadcast length parameter can be determined according to the device information and scene information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including : According to the third broadcast length parameter, the broadcast speed of the target broadcast text is controlled; the third broadcast length parameter indicates the third historical listening duration information associated with the device information and scene information, and can refer to the third broadcast length in Embodiment 2 for details Implementation methods related to parameters will not be repeated here. .

In some implementations that can be implemented, controlling the broadcast speed of the target broadcast text according to the broadcast length parameter may include: inputting historical listening duration information, device information and/or scene information into the classification model; outputting the fourth broadcast length parameter; The fourth broadcast length parameter is a different length category; the broadcast speed of the target broadcast text is controlled according to the fourth broadcast length parameter. For details, reference may be made to the implementation manner related to obtaining the fourth broadcast length parameter by using a classification model in Embodiment 3, which will not be repeated here.

In a possible implementation manner, controlling the broadcast speed of the target broadcast text according to the broadcast length parameter may include: inputting historical listening duration information, device information and/or scene information into the regression model; outputting the fifth broadcast length parameter, The fifth broadcast length parameter is a length limit value; the broadcast speed of the target broadcast text is controlled according to the fifth broadcast length parameter. For details, reference may be made to the implementation manner related to obtaining the fifth broadcast length parameter through a regression model in Embodiment 3, which will not be repeated here.

It can be understood that the embodiments in the embodiments of the present application are not isolated embodiments, those skilled in the art can associate or combine the embodiments, and the association and combination solutions are within the protection scope of the embodiments of the present application.

An embodiment of the present application provides an electronic device, including: at least one memory for storing programs; and at least one processor for executing the programs stored in the memory, and when the programs stored in the memory are executed, the processor is used for executing The method of any of the above embodiments.

An embodiment of the present application is a storage medium, and an instruction is stored in the storage medium, and when the instruction is run on a terminal, the first terminal is made to execute the method in any one of the foregoing embodiments.

The broadcast text listening duration defined in the embodiment of the present application may also be converted into an equivalent index such as the time for the user to view the broadcast text in a plain text generation scenario.

Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Professionals and technicians may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the embodiments of the present application.

Furthermore, various aspects or features of the embodiments of the present application may be implemented as methods, apparatuses, or articles of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used in this application covers a computer program accessible from any computer readable device, carrier or media. For example, computer-readable media may include, but are not limited to: magnetic storage devices (e.g., hard disks, floppy disks, or tapes, etc.), optical disks (e.g., compact discs (compact discs, CDs), digital versatile discs (digital versatile discs, DVDs), etc.), smart cards and flash memory devices (for example, erasable programmable read-only memory (EPROM), card, stick or key drive, etc.). Additionally, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" may include, but is not limited to, wireless channels and various other media capable of storing, including and/or carrying instructions and/or data.

It should be understood that, in various embodiments of the embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the order of execution, and the order of execution of the processes should be determined by their functions and internal logic, and should not The implementation process of the embodiment of the present application constitutes no limitation.

Those skilled in the art can clearly understand that for the convenience and brevity of the description, the specific working process of the above-described system, device and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the embodiment of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of broadcast length parameter software products, and the computer software products are stored in a The storage medium includes several instructions to enable a computer device (which may be a personal computer, a server, or an access network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disc, etc., which can store program codes. .

The above is only a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone skilled in the art can easily think of changes or substitutions within the technical scope disclosed in the present invention. Should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.

Claims

A method for generating broadcast text, characterized in that the method comprises:

Receive the user's voice command;

Obtain the broadcast content corresponding to the voice command;

The target broadcast text is generated according to the broadcast length parameter and the broadcast content, and the broadcast length parameter indicates historical listening duration information.
The broadcast text generating method according to claim 1, wherein said generating target broadcast text according to the broadcast length parameter and said broadcast content includes:

The broadcast content and the broadcast length parameter are used as input of a model, and the model outputs a target broadcast text, and the target broadcast text is a broadcast text whose duration matches the broadcast length parameter.
The broadcast text generation method according to claim 2, wherein the model is a generative model or a retrieval model;

The generating target broadcast text according to the broadcast length parameter and the broadcast content includes:

Using the broadcast content and the broadcast length parameter as the input of the generative model, and the generative model generates the target broadcast text and outputs it; or

The broadcast content and the broadcast length parameter are used as the input of the retrieval model, and the retrieval model retrieves a text template of a limited length in a predefined template library according to the broadcast length parameter; The text template of the length outputs the target broadcast text, and the target broadcast text is the broadcast text whose duration matches the historical listening duration information.
The broadcast text generation method according to any one of claims 1-3, wherein the broadcast length parameter is associated with device information, the first broadcast length parameter is determined according to the device information, and the broadcast length parameter and the Generate the target broadcast text based on the above broadcast content, including:

Generate a first target broadcast text according to the first broadcast length parameter and the broadcast content; the first broadcast length parameter indicates first historical listening duration information associated with the device information.
The broadcast text generating method according to any one of claims 1-3, wherein the broadcast length parameter is associated with scene information, and the second broadcast length parameter is determined according to the scene information, and the broadcast length parameter and the Generate the target broadcast text based on the above broadcast content, including:

Generate a second target broadcast text according to the second broadcast length parameter and the broadcast content; the second broadcast length parameter indicates second historical listening duration information associated with the scene information.
The broadcast text generation method according to any one of claims 1-3, wherein the broadcast length parameter is associated with field equipment information and scene information, and the third broadcast length parameter is determined according to the equipment information and scene information, so The target broadcast text is generated according to the broadcast length parameter and the broadcast content, specifically including:

Generate a third target broadcast text according to the third broadcast length parameter and the broadcast content; the third broadcast length parameter indicates third historical listening duration information associated with the device information and the scene information.
The broadcast text generation method according to any one of claims 1-3, wherein the broadcast length parameter is associated with field equipment information and/or scene information, and the target broadcast is generated according to the broadcast length parameter and the broadcast content text, including:

Input the historical listening duration information, device information and/or scene information into the classification model; output the fourth broadcast length parameter; the fourth broadcast length parameter is a different length category;

Generate a fourth target broadcast text according to the fourth broadcast length parameter and the broadcast content.
The broadcast text generation method according to claim 1, wherein the broadcast length parameter is associated with field equipment information and/or scene information, and the target broadcast text is generated according to the broadcast length parameter and the broadcast content, include:

Input the historical listening duration information, device information and/or scene information into the regression model; output the fifth broadcast length parameter, and the fifth broadcast length parameter is a length limit value;

A fifth target broadcast text is generated according to the fifth broadcast length parameter and the broadcast content.
The broadcast text generation method according to claim 1, wherein the broadcast length parameter is associated with field equipment information and/or scene information, and generating the target broadcast text according to the broadcast length parameter and the broadcast content includes:

The device information, the scene information and/or the historical listening duration information are respectively linearly encoded and then fused to obtain a sixth broadcast length parameter; the sixth broadcast length parameter is a characterization vector of the broadcast length parameter;

Taking the sixth broadcast length parameter, the broadcast content and whether the voice instruction is executable or non-executable as the input of the pre-trained language model, and output the sixth target broadcast text.
The broadcast text generating method according to any one of claims 1-9, wherein said acquiring the broadcast content corresponding to the voice instruction comprises:

Acquiring intent and slot information according to the voice command;

determining whether the voice command is executable according to the intent and slot information;

In a case where the voice instruction is not executable, broadcast content is generated, and the broadcast content is inquiry information.
The method for generating broadcast text according to any one of claims 1-9, wherein the determining the broadcast content according to the dialog state includes:

Acquiring intent and slot information according to the voice command;

determining whether the voice command is executable according to the intent and slot information;

If the voice command is executable, determine a third-party service that executes the intent;

The broadcast content is obtained from the third-party service, and the broadcast content is result information corresponding to the voice instruction.
According to the broadcast text generation method described in one of claims 1-11, it is characterized in that, described method also comprises:

The broadcast speed of the target broadcast text is controlled according to the broadcast length parameter.
The broadcast text generating method according to any one of claims 1-12, wherein the method further comprises:

Record the broadcast duration of the current target broadcast text, and obtain the historical listening duration information.
A method for broadcasting text, characterized in that it is applied to voice assistants, the method comprising:

Receive the user's voice command;

Generate the target broadcast text corresponding to the voice instruction;

The broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, and the broadcast length parameter indicates historical listening duration information.
The method for broadcasting text according to claim 14, wherein the broadcast length parameter is associated with device information, the first broadcast length parameter is determined according to the device information, and the target broadcast text is determined according to the broadcast length parameter. Broadcast speed control, including:

The broadcast speed of the target broadcast text is controlled according to the first broadcast length parameter; the first broadcast length parameter indicates first historical listening duration information associated with the device information.
The method for broadcasting text according to claim 14, wherein the broadcast length parameter is associated with scene information, the second broadcast length parameter is determined according to the scene information, and the target broadcast text is determined according to the broadcast length parameter. The broadcast speed control includes: controlling the broadcast speed of the target broadcast text according to the second broadcast length parameter; the second broadcast length parameter indicates the second historical listening duration information associated with the scene information.
The method for broadcasting text according to claim 14, wherein the broadcast length parameter is associated with field equipment information and scene information, and a third broadcast length parameter is determined according to the equipment information and scene information, and the said Control the broadcast speed of the target broadcast text according to the broadcast length parameter, including:

The broadcast speed of the target broadcast text is controlled according to the third broadcast length parameter; the third broadcast length parameter indicates third historical listening duration information associated with the device information and scene information.
The method for broadcasting text according to claim 14, wherein the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter comprises: storing the historical listening duration information, device information and/or scene information Input the classification model; output the fourth broadcast length parameter; the fourth broadcast length parameter is a different length category;

The broadcast speed of the target broadcast text is controlled according to the fourth broadcast length parameter.
The method for broadcasting text according to claim 14, wherein said controlling the broadcasting speed of said target broadcast text according to the broadcast length parameter comprises:

Input the historical listening duration information, device information and/or scene information into the regression model; output the fifth broadcast length parameter, and the fifth broadcast length parameter is a length limit value;

The broadcast speed of the target broadcast text is controlled according to the fifth broadcast length parameter.
An electronic device, characterized in that it comprises:

at least one memory for storing programs; and

At least one processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method according to any one of claims 1-19.
A storage medium, wherein instructions are stored in the storage medium, and when the instructions are run on the terminal, the first terminal is made to execute the method according to any one of claims 1-19.