CN117203703A

CN117203703A - Method and device for generating broadcast text and electronic equipment

Info

Publication number: CN117203703A
Application number: CN202280029750.4A
Authority: CN
Inventors: 陈开济
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-06-30
Filing date: 2022-05-28
Publication date: 2023-12-08
Also published as: WO2023273749A1

Abstract

The application provides a broadcast text generation method, which relates to the field of artificial intelligence (artificial interlligence, AI), is applied to a voice assistant and comprises the following steps: receiving a voice instruction of a user; acquiring broadcasting content corresponding to the voice instruction; and generating a target broadcast text according to the broadcast length parameter and the broadcast content, wherein the broadcast length parameter indicates historical listening time information. According to the method and the device, the historical time length of the user listening to the broadcasting text is utilized, the broadcasting text is subjected to differentiation processing by combining the situation and the used equipment, the broadcasting text is guided to be generated by using the historical listening time length under a specific scene, the broadcasting speed of the target broadcasting text is further controlled according to the broadcasting length parameter, the target broadcasting voice which matches the historical using habit of the user and is adaptive to the equipment information and the using scene is obtained, the interaction experience and efficiency are improved, and a personalized voice assistant for the user is provided.

Description

Method and device for generating broadcast text and electronic equipment

The present application claims the chinese patent application filed 30 3 2021 with the priority of the chinese patent office under application number 202110741280.1 and entitled "method, apparatus and electronic device for generating broadcast text" and the international application filed 30 2022 with application number PCT/CN 2022/084068. The entire contents of both applications are incorporated herein by reference.

Technical Field

The embodiment of the application relates to the field of artificial intelligence (artificial interlligence, AI), in particular to a method, a device and electronic equipment for generating a broadcast text.

Background

The voice assistant or the virtual assistant is agent software capable of replacing a person to execute tasks or services, and is widely applied to equipment such as smart phones, smart speakers, intelligent vehicle terminals (electronic control unit, ECU) and the like. The voice assistant or virtual assistant provides a voice user interface (voice user interface, VUI) and performs a corresponding task or provides related services according to the user's voice command input. After the voice assistant executes the voice instruction sent by the user, a broadcasting text is generated, and a corresponding broadcasting voice is generated through a text-to-speech (TTS) module, so that the user is informed of broadcasting the content and is guided to continue using the device.

The current broadcasting text of the voice assistant generally adopts a fixed mode, and when the current broadcasting text interacts with different users, the broadcasting voice/broadcasting text has no difference. How to provide broadcasting conforming to personal use habit for users and improve naturalness of user interaction is a problem to be solved urgently.

Disclosure of Invention

In order to solve the problems, the embodiment of the application provides a method, a device, terminal equipment and a system for generating a broadcast text.

In a first aspect, an embodiment of the present application provides a method for generating a broadcast text, where the method includes: receiving a voice instruction of a user; acquiring broadcasting content corresponding to the voice instruction; and generating a target broadcast text according to the broadcast length parameter and the broadcast content, wherein the broadcast length parameter indicates historical listening time information. Therefore, voice assistant broadcasting conforming to personal history use habit can be provided for the user, personalized broadcasting experience of thousands of people and thousands of faces is provided, and naturalness of voice assistant interaction is improved.

In one possible implementation manner, the generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: and taking the broadcasting content and the broadcasting length parameter as inputs of a model, outputting a target broadcasting text by the model, wherein the target broadcasting text is a broadcasting text with the duration matched with the broadcasting length parameter. Therefore, voice assistant broadcasting texts conforming to personal history use habits can be provided for users through the models according to broadcasting length parameters, personalized broadcasting experience of thousands of people and thousands of faces is provided, and naturalness of voice assistant interaction is improved.

In one possible implementation, the model is a generative model or a retrievable model; generating the target broadcast text according to the broadcast length parameter and the broadcast content comprises the following steps: and taking the broadcasting content and the broadcasting length parameter as inputs of a generating model, wherein the generating model outputs a target broadcasting text, and the target broadcasting text is a broadcasting text with the duration matched with the broadcasting length parameter. Or (b)

Taking the broadcasting content and the broadcasting length parameter as input of a search model, and searching a text template with limited length in a predefined template library according to the broadcasting length parameter by the search model; outputting a target broadcasting text through the searched text template with the limited length, wherein the target broadcasting text is a broadcasting text with the duration matched with the historical listening duration information. Therefore, voice assistant broadcasting texts conforming to personal historical use habits can be provided for users through the generation type model or the search type model according to the broadcasting length parameters, personalized broadcasting experience of thousands of people and thousands of faces is provided, and naturalness of voice assistant interaction is improved.

In one possible implementation manner, the broadcast length parameter is associated with equipment information, a first broadcast length parameter is determined according to the equipment information, and a target broadcast text is generated according to the broadcast length parameter and the broadcast content, and specifically includes: generating a first target broadcasting text according to the first broadcasting length parameter and the broadcasting content; the first broadcast length parameter indicates first historical listening period information associated with the device information. Therefore, voice assistant broadcasting conforming to personal history using habit and adapting equipment can be provided for the user, personalized broadcasting experience of thousands of people and thousands of faces is provided, and naturalness of interaction of the voice assistant is improved.

In one possible implementation manner, the broadcasting length parameter is associated with scene information, a second broadcasting length parameter is determined according to the scene information, and a target broadcasting text is generated according to the broadcasting length parameter and the broadcasting content, and specifically includes: generating a second target broadcasting text according to the second broadcasting length parameter and the broadcasting content; the second broadcast length parameter indicates second historical listening period information associated with the scene information. Therefore, voice assistant broadcasting conforming to personal history using habit and current scene can be provided for the user, personalized broadcasting experience of thousands of people and thousands of faces is provided, and naturalness of interaction of the voice assistant is improved.

In one possible implementation manner, the broadcast length parameter is associated with field device information and scene information, and determines a third broadcast length parameter according to the device information and the scene information, and generates a target broadcast text according to the broadcast length parameter and the broadcast content, which specifically includes: generating a third target broadcasting text according to the third broadcasting length parameter and the broadcasting content; the third broadcast length parameter indicates third historical listening period information associated with the device information and the scene information. Therefore, voice assistant broadcasting of the adaptive equipment and the current scene conforming to the personal history use habit can be provided for the user, personalized broadcasting experience of thousands of people and thousands of faces is provided, and the naturalness of interaction of the voice assistant is improved.

In one possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and the generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: inputting the historical listening period information, the equipment information and/or the scene information into a classification model; outputting a fourth broadcast length parameter; the fourth broadcast length parameter is of different length types; and generating a fourth target broadcasting text according to the fourth broadcasting length parameter and the broadcasting content. Therefore, the broadcasting length parameters obtained through the classification model are in accordance with voice assistant broadcasting of the adapting device and/or the current scene of the historical use habit of the individual, personalized broadcasting experience of thousands of people and thousands of faces is provided, and the naturalness of interaction of the voice assistant is improved.

In one possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and the generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: inputting the historical listening period information, equipment information and/or scene information into the regression model; outputting a fifth broadcast length parameter, wherein the fifth broadcast length parameter is a length limit value; and generating a fifth target broadcasting text according to the fifth broadcasting length parameter and the broadcasting content. Therefore, voice assistant broadcasting conforming to the historical use habit of the person, adapting equipment and/or the current scene can be generated through the regression model, personalized broadcasting experience of thousands of people and thousands of faces is provided, and the naturalness of interaction of the voice assistant is improved.

In one possible implementation manner, the broadcast length parameter is associated with field device information and/or scene information, and the generating the target broadcast text according to the broadcast length parameter and the broadcast content includes: respectively performing linear encoding on the equipment information, the scene information and/or the historical listening time length information, and then fusing to obtain a sixth broadcasting length parameter; the sixth broadcast length parameter is a characterization vector of the broadcast length parameter; and taking the sixth broadcasting length parameter, the broadcasting content and the voice command as executable/non-executable inputs of a pre-training language model, and outputting a sixth target broadcasting text. Therefore, voice assistant broadcasting of the adapting device and/or the current scene which accords with the historical use habit of the person can be generated through the pre-training language model, personalized broadcasting experience of thousands of people and thousands of faces is provided, and the naturalness of interaction of the voice assistant is improved.

In one possible implementation manner, the obtaining the broadcast content corresponding to the voice command includes: acquiring intention and slot position information according to the voice instruction; determining whether the voice instruction is executable according to the intention and the slot position information; and generating broadcasting content which is inquiry information under the condition that the voice instruction is not executable. Thus, the broadcasting content of the voice assistant inquiring the user can be obtained under the condition that the voice instruction is not executable.

In one possible implementation manner, the determining the broadcast content according to the session state includes: acquiring intention and slot position information according to the voice instruction; determining whether the voice instruction is executable according to the intention and the slot position information; determining, if the voice instructions are executable, a third party service that performs the intent; and acquiring the broadcasting content from the third party service, wherein the broadcasting content is result information corresponding to the voice instruction. And the broadcast content returned after the third party service executes the voice instruction is obtained under the condition that the voice instruction is executable.

In one possible implementation manner, the method further comprises controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter. Therefore, the voice which accords with the historical use habit of the person and is matched with the equipment and/or the current scene can be generated, personalized broadcasting experience of thousands of people and thousands of faces is provided, and the naturalness of interaction of voice assistants is improved.

In one possible implementation, the method further includes: recording the broadcasting time of the current target broadcasting text, and obtaining the historical listening time information. Therefore, personalized broadcasting experience conforming to the historical use habit of the person can be obtained, and the naturalness of interaction of the voice assistant is improved.

In a second aspect, an embodiment of the present application provides a method for broadcasting text, where the method includes: receiving a voice instruction of a user; generating a target broadcasting text corresponding to the voice instruction; and controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter, wherein the broadcasting length parameter indicates historical listening time information. The beneficial effects of controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter are the same as those of each embodiment of generating the target broadcasting text according to the broadcasting length parameter of the first aspect of the present application, and will not be described in detail.

In one possible implementation manner, the broadcast length parameter is associated with equipment information, a first broadcast length parameter is determined according to the equipment information, and the broadcast speed of the target broadcast text is controlled according to the broadcast length parameter, including: controlling the broadcasting speed of the target broadcasting text according to the first broadcasting length parameter; the first broadcast length parameter indicates first historical listening period information associated with the device information.

In one possible implementation manner, the broadcasting length parameter is associated with scene information, a second broadcasting length parameter is determined according to the scene information, and the broadcasting speed of the target broadcasting text is controlled according to the broadcasting length parameter, including: controlling the broadcasting speed of the target broadcasting text according to the second broadcasting length parameter; the second broadcast length parameter indicates second historical listening period information associated with the device information.

In one possible implementation manner, the broadcasting length parameter is associated with field device information and scene information, and determines a third broadcasting length parameter according to the device information and the scene information, and the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter includes: controlling the broadcasting speed of the target broadcasting text according to the third broadcasting length parameter; the third broadcast length parameter indicates third historical listening period information associated with the device information.

In one possible implementation manner, the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter includes: inputting the historical listening period information, the equipment information and/or the scene information into a classification model; outputting a fourth broadcast length parameter; the fourth broadcast length parameter is of different length types; and controlling the broadcasting speed of the target broadcasting text according to the fourth broadcasting length parameter.

In one possible implementation manner, the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter includes: inputting the historical listening period information, equipment information and/or scene information into the regression model; outputting a fifth broadcasting length parameter, wherein the fifth broadcasting length parameter is a length limit value; and controlling the broadcasting speed of the target broadcasting text according to the fifth broadcasting length parameter.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one memory for storing a program; and at least one processor for executing the program stored in the memory, the processor being configured to perform the method according to any one of the embodiments described above when the program stored in the memory is executed.

In a fourth aspect, an embodiment of the present application provides a storage medium having stored therein instructions that, when executed on a terminal, cause the first terminal to perform the method according to any one of the embodiments described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

The drawings that accompany the detailed description can be briefly described as follows.

FIG. 1 is a schematic diagram of an artificial intelligence subject framework;

FIG. 2 is a schematic diagram of an application system of a voice assistant according to an embodiment of the present application;

FIG. 3 is a functional architecture diagram of a voice assistant in an embodiment of the present application;

fig. 4 is a flowchart of a method for generating a broadcast text according to a first embodiment of the present application;

fig. 5 is an application schematic diagram of a method for generating a broadcast text according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a machine learning model based on a random forest according to a method for generating a broadcast text according to a third embodiment of the present application;

fig. 7 is a schematic diagram of a method for generating a broadcast text according to a fourth embodiment of the present application based on a typical pre-training language model structure.

Detailed Description

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, references to the terms "first/second/third, etc." or module a, module B, module C, etc. are merely used to distinguish between similar objects and do not represent a particular ordering of the objects, it being understood that particular orders or precedence may be interchanged as permitted to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

In the following description, reference numerals indicating steps such as S110, S120, … …, etc. do not necessarily indicate that the steps are performed in this order, and the order of the steps may be interchanged or performed simultaneously as allowed.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Natural language generation (natural language generation, NLG) is part of natural language processing, and is the generation of natural language from a knowledge base or a machine expression system in logical form. NLG can be viewed as the inverse of natural language understanding (natural language understand, NLU): NLU has to clarify the meaning of the input language, and generates machine expression language; whereas NLG has to decide how to convert the conceptual machine expression language into a natural language that the user can receive.

In one possible scheme, a user wakes up a voice assistant to send out a voice instruction related to inquiring weather, the voice assistant utilizes Natural Language Understanding (NLU) capability to understand the voice instruction related to inquiring weather sent by the user, classifies the voice instruction according to a natural language classification system similar to table 1, inquires weather according to the classified result, selects a corresponding template according to the weather inquiry result to generate a broadcasting text corresponding to weather, or generates a broadcasting text corresponding to the weather information category and the associated attribute thereof, and the broadcasting text content accords with the category to which the voice instruction belongs.

TABLE 1

According to the scheme, different types of broadcasting texts are generated according to different voice commands input by a user, but the content of the broadcasting texts is only related to the type of the voice commands input by the user, personal use habits, equipment differences or scene differences of the user are not considered, and thousands of individual weather broadcasting experiences cannot be provided.

The embodiment of the application provides a method for generating a broadcast text, which relates to the AI field, is suitable for a voice assistant, and can generate the broadcast text with personalized duration according to personal use habits, equipment differences and/or environments of users by introducing user information, equipment information and/or scene information, and generate broadcast voice information corresponding to speech speed through TTS (text to speech) to inform the users of broadcast contents and guide the users to continuously use the equipment.

FIG. 1 illustrates a schematic diagram of an artificial intelligence framework that describes the overall workflow of an artificial intelligence system, applicable to general artificial intelligence field requirements. The artificial intelligence subject framework is described below based on the two dimensions of the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis) shown in fig. 1.

The "intelligent information chain" reflects a list of processes from the acquisition of data to the processing. For example, there may be general procedures of intelligent information awareness, intelligent information representation and formation, intelligent reasoning, intelligent decision making, intelligent execution and output. In this process, the data undergoes a "data-information-knowledge-wisdom" gel process.

The "IT value chain" reflects the value of artificial intelligence brought to the information technology industry from the underlying infrastructure, information providing and processing technology implementation of artificial intelligence to the industrial ecological process of the system.

(1) Infrastructure 10:

infrastructure 10 provides computing power support for the artificial intelligence system, enables communication with the outside world, and supports through the base platform. The sensor is used for communicating with the outside to obtain a data stream; a smart chip (CPU, NPU, GPU, ASIC, FPGA, etc. hardware acceleration chip) for providing training, computing and execution capabilities; the basic platform is used for carrying out cloud storage, cloud computing, network interconnection and the like, and comprises a distributed computing framework, network and other relevant platform guarantees and supports and the like.

(2) Data 11

The data 11 of the upper layer of the infrastructure 10 is used to represent a data source in the field of artificial intelligence.

In the method for generating the broadcast text according to the embodiment of the present application, the data 11 of the upper layer of the infrastructure 10 is derived from the voice command acquired at the terminal side, the equipment information of the terminal used, and the scene information acquired by communicating with the outside through the sensor.

(3) Data processing 12

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

Wherein machine learning and deep learning can perform symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Reasoning refers to the process of simulating human intelligent reasoning modes in a computer or an intelligent system, and carrying out machine thinking and problem solving by using formal information according to a reasoning control strategy, and typical functions are searching and matching.

Decision making refers to the process of making decisions after intelligent information is inferred, and generally provides functions of classification, sequencing, prediction and the like.

In the method for generating the broadcast text provided by the embodiment of the application, the data processing process comprises the steps of performing front-end processing, voice recognition (ASR), semantic understanding (NLU), dialogue Management (DM), natural Language Generation (NLG), voice synthesis (TTS) and the like on the received voice command of the user.

(4) General capability 13

After the data has been processed as described above, some general-purpose capabilities may be formed, such as algorithms or a general-purpose system, based further on the results of the data processing.

In the embodiment of the application, after the voice command input by the user, the equipment information of the used terminal and the scene information obtained by communication between the sensor and the outside are subjected to the data processing, the broadcasting text with personalized duration can be generated based on the result of the data processing, the broadcasting voice corresponding to the voice speed is generated, and the personalized broadcasting experience of thousands of people and thousands of faces is provided.

(5) Intelligent product and industry application 14

The intelligent product and industry application refers to products and applications of an artificial intelligent system in various fields, is encapsulation of an artificial intelligent overall solution, and realizes land application by making intelligent information decisions, and the application fields mainly comprise: intelligent manufacturing, intelligent transportation, intelligent home, intelligent medical treatment, intelligent security, automatic driving, intelligent terminals and the like.

The method for generating the broadcast text provided by the embodiment of the application can be applied to voice assistants of intelligent equipment in the fields of intelligent terminals, intelligent households, intelligent security, automatic driving and the like, a Voice User Interface (VUI) is provided on an intelligent mobile phone, a sound box and an intelligent vehicle-mounted terminal (electronic control unit, ECU), and corresponding tasks or related services are completed according to voice instructions input by a user.

Illustratively, the intelligent devices include intelligent televisions, intelligent speakers, robots, intelligent air conditioners, intelligent smoke alarms, intelligent fire extinguishers, intelligent vehicle terminals, cell phones, tablets, notebook computers, desktop computers, all-in-one machines, and the like.

Fig. 2 is a schematic diagram of an application system of a voice assistant according to an embodiment of the present application. As shown in fig. 2, in the system diagram 200, the data collection device 260 is configured to collect user information, device information, scene information, and/or historical listening periods, and store these information in the database 230. The data acquisition device 260 corresponds to the infrastructure sensor of fig. 1, and includes a motion sensor, a displacement sensor, an infrared sensor, etc. communicatively connected to the smart device for collecting current scene information of the user, such as sports, meetings, rest, chat, etc.

The data acquisition device 260 further includes a camera device, a GPS, etc. communicatively connected to the smart device for collecting scene information of a location or a place where the user is currently located, for example, in a vehicle, a living room, a bedroom, etc.

The data collection device 260 further includes a timer for recording a start time, an end time, and a broadcast duration of the broadcast voice. And recording the broadcasting time length as the historical listening time length of the user in the user information.

The client device 240 corresponds to the basic platform of the infrastructure in fig. 1, and is configured to interact with a user, obtain a voice command sent by the user, broadcast the broadcast content of the voice command, display the broadcast content to the user, and store the information in the database 230; the client device 240 includes a display screen and microphone of a smart phone, a smart car terminal, etc. providing a Voice User Interface (VUI), a speaker, keys, a bluetooth headset microphone, etc.

The microphone can be a sound receiving device, and comprises an integrated microphone, a microphone or a microphone array connected with the intelligent device, or a microphone array connected with the intelligent device in a communication way through a short-distance connection technology, and the like, and is used for collecting voice instructions sent by a user.

Training device 220 corresponds to the intelligent chip of the infrastructure of fig. 1, and trains voice assistant 201 based on user information, device information, scene information, and/or historical broadcast duration, etc., maintained in database 230. The voice assistant 201 can provide a broadcasting text with personalized duration in a voice dialogue scene between the user and the client device 240, and generate a broadcasting voice corresponding to the voice speed, inform the user of broadcasting content and guide the user to continue using the client device 240.

In fig. 2, the execution device 210 corresponds to the intelligent chip of the infrastructure in fig. 1, is configured with an I/O interface 212, performs data interaction with the client device 240, and the execution device 210 obtains voice instruction information input by the user through the client device 240 through the I/O interface 212; the broadcast content is output to the client device 240 through the I/O interface 212, for example, broadcast through a speaker, or presented on a display screen of a smart phone, a smart car terminal, or the like through a Voice User Interface (VUI).

The execution device 210 may call data, code, etc. in the data storage system 250, or may store data, code instructions, etc. in the data storage system 250.

The training device 220 and the execution device 210 may be the same smart chip or may be different smart chips.

Database 230 is a collection of data of user information, device information, and/or scene information stored on a storage medium.

The voice assistant 201 is agent software for executing voice instructions or services, and the executing device 210 executes the voice assistant 201, and after obtaining a voice instruction sent by a user, generates a target broadcast text with personalized length according to user information, device information and/or scene information, controls the speech speed of the broadcast voice, informs the user of the broadcast content, and guides the user to continue using the device.

Finally, the I/O interface 212 returns the target broadcast text with the personalized length generated by the voice assistant 201 as output data to the client device 240, and the client device 240 displays the broadcast text and broadcasts the broadcast text to the user at the corresponding speech rate.

Further, the training device 220 acquires training data and corpus stored in the database 230, and based on the user information, device information, and/or scene information of the acquired history, the training device outputs a broadcast text of a length matching the user's history of listening as a training target, and trains the voice assistant 201 to output a better target broadcast text.

In the case shown in fig. 2, the user may input voice instruction information to the execution device 210, for example, may operate in a Voice User Interface (VUI) provided by the client device 240. In another case, the client device 240 may automatically input an instruction to the I/O interface 212 and obtain the broadcast content, and if the client device 240 automatically inputs instruction information to obtain the authorization of the user, the user may set the corresponding authority in the client device 240. The user may view or listen to the broadcast content output by the execution device 210 at the client device 240, and the specific presentation form may be a specific manner such as display, wake-up sound, broadcast, etc. The client device 240 may also be used as a voice data collection terminal to store the collected wake-up sound or voiceprint data of the user into the database 230.

It should be noted that fig. 2 is only a schematic view of a scenario of a system application provided by an embodiment of the present application, where a positional relationship between devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, and a system of fig. 2 may correspond to one or more device entities, for example, in fig. 2, the data storage system 250 is an external memory with respect to the execution device 210, and in other cases, the data storage system 250 may also be disposed in the execution device 210.

Fig. 3 is a functional architecture diagram of a voice assistant according to an embodiment of the present application. The following describes the functional modules in the voice assistant 201, and as shown in fig. 3, the voice assistant 201 includes a front-end processing module 31, a voice recognition module 32, a semantic understanding module 33, a dialogue state module 34, a dialogue strategy learning module 35, a natural language generation module 36, a voice synthesis module 37, and a dialogue output module 38.

The front-end processing module 31 is configured to process a voice command input by a user to obtain a data format required by the network model for the voice recognition module 32.

Illustratively, the front-end processing module 31 obtains a voice command in opus compression format input by a user, and performs audio decoding on the voice command to obtain an audio signal in pcm format; the audio signal is separated, noise reduced and feature extracted by utilizing voiceprint or other features, and audio feature vectors of a mel frequency cepstrum analysis (MFCC) filter bank (filter bank) are obtained through audio processing algorithms such as framing, windowing, short-time fourier transform and the like. The front-end processing module 31 is generally provided on the terminal side.

The speech recognition (automatic speech recognition, ASR) module 32 is configured to obtain the audio feature vector processed by the front-end processing module 31, and convert the audio feature vector into text through an acoustic model and a language model for understanding by the semantic understanding module 33. Wherein the acoustic model is used to correspond (decode) the acoustic feature class to the phonemes or words and the language model is used to decode the phonemes or words into a complete sentence.

The acoustic model and the language model are used for processing the audio feature vector in a serial mode, converting the audio feature vector into phonemes or words through the acoustic model, converting the phonemes or words into a text sequence through the language model, and outputting a text corresponding to the user voice.

Illustratively, the ASR module 32 may employ an end-to-end implementation in which the acoustic model and the language model employ a neural network structure, with the result of training being to output a sequence of Chinese characters corresponding to the user's speech by jointly training the acoustic model and the language model. For example, the acoustic model may be modeled using a Hidden Markov Model (HMM), and the language model may be an n-gram model.

The semantic understanding (natural language understanding, NLU) module 33 is configured to convert text or a chinese character sequence corresponding to a user's voice into structured information, wherein the structured information includes machine-executable intent information and recognizable slot information. The purpose is to obtain semantic representation of natural language through analysis of grammar, semantics and language.

It can be understood that the intention information refers to a task that a voice instruction issued by a user needs to execute; the slot information refers to parameter information which needs to be determined for executing tasks.

Illustratively, the user asks the voice assistant 201 "how does the air temperature in south Beijing today? The "NLU module 33 understands the text corresponding to the voice command, and the intention of obtaining the voice command is" looking up weather ", and the slot is" place: nanjing "and" time: today.

The NLU module 33 may classify the text corresponding to the voice command into intention information supportable by the voice assistant 201 through a classifier, and then annotate the slot information in the text using a sequence annotation model.

The classifier can be a model which can be used for classification in a traditional machine learning algorithm, such as an NB model, a random forest model (RF), an SVM classification model, a KNN classification model and the like; it may also be a deep-learning text classification model, such as FastText model, textCNN, etc.

The sequence labeling model is used for labeling each element in the text information or Chinese character sequence, and outputting labeling sequences which can be used for indicating the start, the end and the type of the slot. The sequence annotation model may be one of the following: linear models, hidden markov models, maximum entropy markov models, conditional random fields, etc.

The NLU module 33 may also use an end-to-end model to output both intent information and slot information.

The dialog state tracking (dialog state tracking, DST) module 34 is used to manage the dialog state of the voice assistant 201. DST module 34 maintains current wheel dialog intents, filled slots, and dialog states in the multi-wheel dialog scene using the intent information and slot information of the current wheel dialog output by NLU module 33.

The inputs of DST module 34 are the last round of dialogue state, the broadcast content returned by the last round of third party application, and the intention information and slot information of the current round of dialogue, and the output is the current round of dialogue state.

DST module 34 module records the dialog history and dialog state of voice assistant 201, assists voice assistant 201 in understanding the instructions of the current round of dialog user speech in conjunction with the dialog history recorded by the context manager (i.e., database 230 in fig. 2), and gives appropriate feedback.

Illustratively, in the first round of dialogue, user A requests "book Nanjing' ticket" from the voice assistant 201; in the second round of dialogue, user a asks the voice assistant 201 "what is there weather? ". The NLU module 33 outputs the intent of the current wheel dialog as "check weather", and the slot as "place: there" and "time: "since DST module 34 recorded the first round of dialog state, the system, in conjunction with the dialog history recorded by the context manager, understands the slot" place "where" there "is" nanjing ", and fills" nanjing "into the place slot. DST module 34 outputs dialog state information for the current wheel, including intent information (weather), filled slots (nanjing), and unfilled slots (time:).

The dialogue policy learning (dialog policy learning, DPL) module 35 is used to determine actions to be performed by the voice assistant 201 next, including querying the user, executing instructions from the user, recommending other instructions from the user, generating replies, etc.

The DPL module 35 uses the dialog state information output by the DST module 34 to determine the next execution action.

In one implementation, the DPL module 35 may determine that the next execution action information is to generate a broadcast content that asks the user based on the current wheel session status.

For example, for the previous example, DST module 34 outputs that the dialog state information for the current round has unfilled slots (time:), DPL module 35 may determine that the next action to perform is to ask the user "what day? "to maintain the control logic of the dialog system, ensure that the dialog can continue to execute. The execution action information is an action tag or structured information, such as "REQUEST-SLOT: date", indicating the time at which the user is to be next queried.

In one implementation, the DPL module 35 may determine, based on the current wheel dialog status, that the next execution action is to select an appropriate third party application (app) to execute the voice instruction, and send the intent and slot information to the selected third party application; and acquiring an execution result returned by the third party application, wherein the execution result is broadcast content corresponding to the voice instruction.

A third party application (app) is an application capable of executing or satisfying the intention of the voice instruction according to the slot information and returning the broadcast content, for example, an app capable of inquiring weather, an app capable of providing commodity information, an app capable of providing navigation or positioning information, and the like.

The broadcast content determined by the DPL module 35 according to the current dialog state of the round or the broadcast content returned after the third party application (app) or the server executes the voice command according to the intention and the slot information can be used as the input parameter of the next dialog state of the DST module 34 or the input parameter of the NLG module 36.

The natural language generation module (natural language understanding, NLG) module 36 is a translator that converts structured information material into natural language expressions, and is currently widely used in voice assistants. When the voice assistant broadcasting language is generated, because of different equipment and different webpage positions, a duration limiting parameter is required to be introduced to limit the length of the generated text so as to adaptively match the requirements of different users, different equipment and different scenes on broadcasting contents and broadcasting duration.

In the embodiment of the present application, the NLG module 36 is configured to obtain the current dialogue state maintained by the DST module 34, the next execution action determined by the DPL module 35, and/or the broadcast content returned by the third party application (app), and generate the target broadcast text with personalized length in combination with the user information, the device information, and/or the scene information.

Illustratively, in the case where the current dialog state maintained by DST module 34 is intent information (weather check), filled slots (nanjing) and unfilled slots (time:), the next action that DPL module 35 determines is to query the user, then the report text generated by NLG module 36 is "please ask you what day is you needed to query? "

Illustratively, the NLG module 36 inputs the current dialog state and the broadcast content returned by the third party application into a template matching the current intent, device or scene, and outputs a target broadcast text of the template configuration length. The NLG module 36 may also output target broadcast text of personalized length using a model-based black box.

A User Profile (UP) module 213, configured to obtain user information by querying data in the database 230 shown in fig. 2, where information such as a historical listening time period for a user to listen to a voice assistant broadcast is recorded in the user information.

User information, also called user portraits, is characterized by collecting data of various dimensions such as social attributes, consumption habits, preference characteristics, behaviors of a use system and the like of a user, analyzing and counting the characteristics, and mining potential value information, so that the full view of the user information is abstracted for recommending personalized contents to the user or providing services conforming to the use habits of the user.

A Device Profile (DP) module 214 is configured to obtain device information of the client device 240 shown in fig. 2, including display resolution, size, category, volume of speaker, tone, etc.

The scene perception (context awareness, CA) module 215 is configured to obtain current scene information, including a room category, a background noise level, a current motion state of a user, and the like, through the data acquisition device 260 shown in fig. 2.

The CA module 215, the DP module 214, and the UP module 213 may be external modules with respect to the voice assistant 201, and are not specifically limited herein.

In the embodiment of the present application, after the voice assistant understands the voice command of the user through the natural language understanding NLU module 35 and sends the voice command to the corresponding third party application (app) for execution, the voice assistant may obtain the structured broadcast content returned by the third party application, and use the NLG module 36 to convert the returned structured broadcast content into the broadcast text, so that the TTS module generates the broadcast voice and notifies the user of the broadcast content.

A Text-to-Speech (TTS) module 37 is configured to control a broadcasting speed of the target broadcasting Text according to a broadcasting length parameter, where the broadcasting length parameter indicates historical listening time information.

In the embodiment of the application, when the TTS module 37 converts the target broadcast text into the broadcast voice, the broadcast length parameter is introduced, and the broadcast voice speed is controlled by combining the user information, the equipment information and/or the scene information, so that the broadcast time length of the target broadcast text is limited, and the characteristics of the generated voice such as the voice speed, the tone quality, the volume and the like are controlled while the accuracy of voice generation is ensured.

And the dialogue output module 38 is used for displaying the corresponding broadcasting card to the user after generating the corresponding broadcasting card according to the target broadcasting voice.

Example 1

The embodiment of the application provides a method for generating a broadcast text, which is applied to a voice assistant, and is used for acquiring broadcast content corresponding to a voice instruction by receiving the voice instruction of a user and generating a target broadcast text according to a broadcast length parameter and the broadcast content, wherein the broadcast length parameter indicates historical listening time information.

Fig. 4 is a flowchart of a method for generating a broadcast text according to an embodiment of the present application. As shown in fig. 4, the voice assistant performs the following steps S401-S404.

S401, receiving a voice instruction of a user.

The voice assistant 201 receives the voice instruction of the user.

Illustratively, after user a wakes up the voice assistant 201, a voice command "how does the temperature of south Beijing today? ".

S402, acquiring broadcasting content corresponding to the voice command.

The voice assistant 201 performs front-end processing on the voice command "how the temperature is the same as the current temperature of Nanjing" to obtain an audio feature vector; identifying the audio feature vector as text by an acoustic model and a language model; understanding the text, acquiring the intention corresponding to the voice instruction as ' looking up weather ', and the slot position as ' place: nanjing ' and ' time: today "; and managing the dialogue state, and obtaining the current dialogue state according to the previous dialogue state, the previous broadcasting content, the intention information and the slot position information corresponding to the current voice instruction, wherein the current dialogue state comprises the intention information, the filled slot position and the unfilled slot position, and determining whether the voice instruction is executable.

In one implementation, the voice assistant 201 may determine a third party application that executes intent information if the current dialog state is executable; transmitting intention information and slot position information corresponding to the voice instruction to the third party application; and acquiring an execution result returned by a third-party application (app) or a server, wherein the execution result is broadcast content corresponding to the current voice instruction.

Illustratively, the user sends a voice command "how the temperature of Nanjing today" to the voice assistant 201, the voice assistant 201 selects a proper third party application (app) to execute the voice command in combination with the intention information and the slot information related to the user request, and outputs an execution result returned by the third party application (app) and related to the user request, where the execution result is a structured broadcast content "{" temperature ": 15-23", "unit": "C", "location": "nanjin" }.

In one implementation, the voice assistant 201 may generate the broadcast content based on the dialog state if it is determined that voice instructions are not executable if there are unfilled slots in the current dialog state.

Illustratively, in the event that there are unfilled SLOTs in the current dialog state, the voice assistant 201 obtains next action information, which is action tag or structured information, determined by the DPL module 35, determining that the broadcast content is "REQUEST-SLOT: date", indicating the time to be next queried by the user.

S403, generating a target broadcast text according to the broadcast length parameter and the broadcast content, wherein the broadcast length parameter indicates historical listening time information.

In one implementation, the NLG module 36 may generate a generated model, may use the broadcast content as an input of the generated model, use the broadcast length parameter of the user as an additional parameter, implicitly define the length of the output broadcast text through the training data, and generate a target broadcast text, where the target broadcast text is a broadcast text with a duration that matches the broadcast length parameter.

In another implementation manner, the length or the length range of the text generated by the generated model can be defined by the input broadcast length parameter, the broadcast content and the broadcast length parameter are used as the input of the model, and the target broadcast text with the defined length is output.

In one implementation, NLG module 36 may be a retrievable model that takes the broadcast content as input to the retrievable model, retrieves a corresponding template from the broadcast content, and generates target broadcast text from the retrieved template.

In one implementation manner, the broadcasting content and the broadcasting length parameter of the user are used as the input of the search model, the template corresponding to the broadcasting content is searched in a predefined template library according to the length limited by the broadcasting length parameter, and the target broadcasting text is output through the searched template.

In one implementation, the broadcast length parameter may be determined based on an average or weighted average of at least one piece of historical listening period information. Illustratively, the device portraiting module 213 obtains user information, obtains historical listening time length information of listening to the voice broadcast by the user each time, and obtains a broadcast length parameter according to a statistical average or a weighted average of the historical listening time length information of listening to the voice broadcast by the user each time; the minimum/maximum value and the latest value of the historical listening time length can be used as the broadcasting length parameter.

For example, the device portrayal module 213 obtains the historical listening time t=5s of the user, and determines that the character length of the generated broadcast text is 20 after the conversion of the mapping table, and then the broadcast length parameter is 20. The NLG module 36 determines the content according to the returned broadcast content "{" temperature ":"15-23"," unit ": "C", "location": "nanjin" } "and the broadcast length parameter 20 generate a target broadcast text" Nanjing today sunny, 15 degrees celsius minimum, 23 degrees celsius maximum "with a character length of about 20 words.

The value of the historical listening period may be an initial value before the voice assistant of the smart device is enabled. The value can be an accurate numerical record, such as "5 seconds", "20 words", etc., or can be an identification mapped to a certain time period, such as "medium", "concise", etc.; the initial value may also be a value of an average listening time length of the user obtained through investigation of the user by the smart device manufacturer, or a value of an average listening time length of a group to which the user belongs. The embodiment of the application does not limit the initial value of the historical listening time.

When the user listens to the voice broadcast each time, the voice assistant 201 continuously records the listening time of listening to the broadcast, and collects the time information of listening to the broadcast each time in the user portrait, so as to obtain a plurality of pieces of historical listening time information.

In one implementation, the recording of the listening period may begin timing from the time the broadcast is initiated and end when one of the following occurs: and after the broadcasting is finished, interrupting the broadcasting, closing or switching to other programs. The listening period is the time interval between the start timing and the end timing.

S404, controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter.

In one implementation manner, after the TTS module 37 obtains the broadcast text, the TTS module uses the broadcast length parameter as a limiting condition of the speech speed of the broadcast speech to control the speech speed of the broadcast speech, and converts the broadcast text into the broadcast speech conforming to the current user history listening habit.

Fig. 5 is an application schematic diagram of a method for generating a broadcast text according to an embodiment of the present application. As shown in fig. 5, user a wakes up the voice assistant 201 and asks "what is the temperature of south tokyo today? "

The front-end processing module 31 inputs a voice command "how does the temperature of south Beijing today? "audio decoding is carried out, and audio signals in pcm format are decoded; the audio signal is separated, noise reduced and feature extracted by utilizing voiceprint or other features, and audio feature vectors are obtained through audio processing algorithms such as framing, windowing, short-time Fourier transform and the like.

The ASR module 32 converts the audio feature vectors into text via an acoustic model and a language model. Specifically, acoustic features in the audio feature vector are converted into phonemes or words through an acoustic model, the phonemes or words are converted into text sequences through a language model, and texts corresponding to voice instructions of the user A are output.

The NLU module 33 understands the text, and obtains the intention of the user as "looking up weather" and the slot as "place: nanjing".

The DST module 34 outputs dialogue status information of the current wheel including intention information (weather search), filled slots (nanjing) and (time: today) using the "weather search" of the current wheel dialogue output by the NLU module 33, the slots being "place: nanjing".

The DPL module 35 determines that the next execution action is an execution instruction using the dialog state information output from the DST module 34, and the DPL module 35 selects an appropriate third party service or application (app) to execute a voice instruction of the user according to the intention information using the slot information as a parameter; to "check weather" to the corresponding third party application (service provider W).

The NLG module 36 obtains the returned broadcast content as structured information "{" temperature ":"15-23"," unit ": "C", "location": "nanjin" }. At the same time, the UP module 213 acquires the historical listening time t of the user a _A =5s, after the mapping table conversion, determining that the character length of the generated broadcast text is 20, and then the broadcast length parameter is 20.

The NLG module 36 generates a target broadcast text "nanjing today sunny, 15 degrees celsius minimum, 23 degrees celsius maximum" with a character length of about 20 words according to the returned broadcast content and the broadcast length parameter.

The TTS module 37 performs speed control according to the target broadcast text and the listening time t=5s, and generates broadcast voice with a length of 4.5s to 5.5s for broadcasting.

After the broadcasting is completed, the voice assistant 201 sends the listening time length of the user to the UP module 213, and the UP module 213 records the listening time length of the user a.

The user B wakes UP the voice assistant 201, the voice command input by the user B is the same as the process that the user A and the DPL module 35 obtain the return result is the same as the user A, and the historical listening time t of the user B is obtained through the UP module 213 _B After conversion, determining that the character length of the generated broadcasting text is about 10 words, and the broadcasting length parameter is 10; the target broadcast text generated by the NLG module 36 is' sunny, 15-23 ℃, and the broadcast voice duration generated by the TTS module 37 is 1.5-2.5 s.

As can be seen from the embodiment shown in fig. 5, according to the personalized difference of the historic listening periods of two users for different users a and B, the method for generating the broadcast text provided by the embodiment of the application can generate the broadcast text with different lengths for the same voice command, so that a voice assistant can generate the personalized broadcast text according to the use habit of the user, and further perform personalized broadcast according to the personalized broadcast text.

According to the method for generating the broadcasting text, user information is introduced in the generation stage of the broadcasting text and the broadcasting voice, the detail degree of the target broadcasting text is controlled according to the historical listening time of the user recorded in the user information, and thousands of people sexual interaction experience is provided for the user and the voice assistant.

Example two

According to the method for generating the broadcasting text, on the basis of the first embodiment, the voice command of the user is combined with the historical listening time length of the user, the equipment information and/or the current scene information through the introduced user information, the equipment information and/or the current scene information data to generate the broadcasting text with the length matched with the historical listening habit of the user, and the broadcasting is performed at the corresponding speech speed, so that personalized broadcasting experience is provided. Wherein the user information includes a historical listening time length of the user; the device information comprises configuration information such as display resolution, size, broadcasting device category and the like of the broadcasting device; the scene information includes information such as room category, background noise level, current motion state of the user, etc.

The voice assistant obtains the device information of the broadcasting device through the DP module 214, obtains the current scene information through the CA module 213, and searches the database 213 by using the device information and the scene information as indexes, respectively, to obtain a list of the historical listening time length information with the finest granularity meeting the threshold requirement, as shown in table 2.

In the dialogue system of the voice assistant 201, the historical listening time length of the user is respectively calculated according to the device information and the current scene in three layers. Taking the obtained listening time calculated according to the current available finest granularity level as a broadcasting length parameter, executing the step S403 of the first embodiment to generate a target broadcasting text; step S404 of the first embodiment is executed to perform speed control to broadcast the broadcast text at a speech speed conforming to the current user listening history habit.

After the user finishes broadcasting the text listening event once, the historical listening time length of the corresponding hierarchy is updated based on the three hierarchy index structures. The three hierarchical levels of historical listening period information list are shown in table 2:

TABLE 2

Time	Device d	Scene e	Listening time t
20:00:03	Mobile phone	Vehicle with a vehicle body having a vehicle body support	1.7s
22:05:03	Television set	Bedroom	8.1s
12:20:10	Television set	Parlor (living room)	5.2s
19:05:54	Television set	Parlor (living room)	7.1s
08:03:03	Mobile phone	Bedroom	2.5s
08:30:45	Mobile phone	Parlor (living room)	3s
17:35:04	Mobile phone	Vehicle with a vehicle body having a vehicle body support	1.5s
18:30:08	Mobile phone	Vehicle with a vehicle body having a vehicle body support	1.9s

Illustratively, the broadcast length parameters are calculated in terms of a number of three-level listening periods, and according to table 2, there are mainly those available broadcast length parameters shown below: integral listening time t_total and mobile phone listening time t_d ₁ Television listening time t_d ₂ Vehicle listening time t_e ₁ Listening time t_e of living room ₂ And the listening time of the mobile phone in the vehicle. The data according to table 2 can be obtained:

t_total＝average(all)＝3.875s；

t_d ₁ ＝average(d ₁ )＝2.12s；

t_d ₂ ＝average(d ₂ )＝6.8s；

t_e ₁ ＝average(e ₁ )＝1.7s；

t_e ₂ ＝average(e ₂ )＝5.1s；

t_d ₁ e ₁ ＝average(d ₁ e ₁ )＝1.7s；

In the above formula, average () is the mean function, and the index value d in brackets ₁ Is a mobile phone d ₂ Is television, e ₁ Is a vehicle e ₂ Is living room, d ₁ e ₁ The mobile phone is in the vehicle.

In one implementation, a plurality of pieces of the historical listening period information may be determined according to device information or scene information; and determining the broadcasting length parameter according to the average value or the weighted average value of the pieces of historical listening time length information. Recording the broadcasting length parameter determined according to the equipment information as a first broadcasting length parameter; and recording the broadcasting length parameter determined according to the scene information as a second broadcasting length parameter.

Illustratively, the voice assistant calculates the obtained report length parameter using the first hierarchy when the number of historic listening period records collected under the device information or the scene information is less than a threshold.

The first level of computation is to compute the overall listening time period t_total. The overall listening time t_total is identical to the historical listening time of the user defined in the first embodiment, and is an average value or a weighted average value of a plurality of pieces of historical listening time information, and the overall listening time t_total is used as a broadcasting length parameter.

For example, when the threshold is set to be more than 3 records and is valid, when the number of the listening records of the user is less than 3, the voice assistant can determine the overall listening time t_total of the user according to the statistical average value or the weighted average value of the historical time information of each listening voice broadcast of the user, and the overall listening time t_total is used as the broadcast length parameter.

For example, when the number of the history listening period records collected under the device information or the scene information is greater than a threshold value, the history listening period information obtained by the second-level calculation is used, and the broadcast length parameter is determined according to an average value or a weighted average value of the pieces of history listening period information.

The second level is calculated by counting a plurality of pieces of historical listening time length information on the corresponding equipment according to the listening time length t_d under the equipment information or counting a plurality of pieces of historical listening time length under the corresponding scene according to the listening time length t_e under the scene information.

For example, the device corresponding to the device information in table 2 may be an intelligent terminal such as a mobile phone or a television; the scenes corresponding to the scene information can be places such as vehicles, bedrooms or living rooms, and motion states such as sports and rest.

In an exemplary embodiment, when the historical listening time length recorded by the user a for listening to the weather report through the voice assistant logging in the mobile phone terminal is 5 and exceeds the threshold value set by the system by 3, the voice assistant may obtain the reporting length parameter of the user a under the mobile phone terminal according to the statistical average value or the weighted average value of each piece of the historical listening time length information recorded by the mobile phone terminal.

For example, when the record of the user B listening to the weather report in the living room by the voice assistant logging in the mobile phone terminal is 1, the record of the user B listening to the weather report in the living room by the voice assistant logging in the smart television is 2, and the record of the user B listening to the weather report in the living room by the voice assistant reaches the threshold value set by the dialogue system is 3, the voice assistant logging in the user B can obtain the report length parameter of the user in the same scene by different smart terminals according to the statistical average value or the weighted average value of the historical duration record of each report listening in the living room.

In one implementation, at least one piece of the historical listening period information may be determined according to device information and scene information; and determining the broadcasting length parameter according to the average value or the weighted average value of the pieces of historical listening time length information.

For example, when the number of the collected historical listening period records in the combination of the device information and the scene information is greater than a threshold value, the historical listening period information obtained using the third-level calculation is used. And recording the broadcasting length parameter determined according to the combination of the equipment information and the scene information as a third broadcasting length parameter.

The third level of calculation is to count the historical listening time length of the user of the current device d in the current scene e according to the listening time length t_de of the device scene.

For example, when the historical listening duration of the weather report received by the user C in the vehicle through the mobile phone terminal is recorded as 3, and reaches the threshold set by the dialogue system, the voice assistant logged in by the user C may obtain the report length parameter of the weather report received by the user C in the vehicle through the mobile phone terminal according to the statistical average or the weighted average of the recorded historical listening duration information of each report received by the user C in the vehicle.

Meanwhile, after completing one-time text broadcasting listening event, the voice assistant sends the listening time length of the user to the UP module 213, and the UP module 213 records the historical listening time length and time of the corresponding hierarchy on the three-hierarchy historical listening time length information list shown in table 2.

According to the broadcast text generation method provided by the embodiment II of the application, aiming at users with different historical listening periods on different equipment and under different scenes, the voice assistant 201 can generate target broadcast texts with different lengths for the same voice command, and finer personalized interaction experience is provided for the users. According to the second embodiment of the application, the historical listening time of the user is counted in a refined mode according to the type of the equipment and the scene where the user is located, and personalized broadcasting voice interaction experience which is more suitable for the scene used by the user is provided.

The dialogue system of the voice assistant 201 combines the historical listening time information of the user, the related parameters of the equipment and/or the information of the current scene in the broadcasting text generation flow, and can provide broadcasting voice with broadcasting length and speech speed which accord with the listening history of the current user and adapt to the equipment information and the scene information for the user, thereby improving the naturalness of voice interaction and greatly improving the user experience.

Example III

According to the method for generating the broadcasting text, based on the first embodiment and the second embodiment, the broadcasting length parameter can be obtained through a machine learning model, the machine learning model can be realized based on random forest (random forest), the broadcasting length parameter is trained by utilizing the historical listening time of a user listening broadcasting, the screen size, the screen resolution and/or the noise size of the environment and the room category, the broadcasting length parameter and the broadcasting content are input into the machine learning model, the broadcasting length parameter is output, the target broadcasting text is generated according to the broadcasting length parameter and the broadcasting content, broadcasting is carried out at the corresponding speech speed, and personalized broadcasting experience is provided.

Fig. 6 is a schematic structural diagram of a machine learning model based on a random forest according to a method for generating a broadcast text according to a third embodiment of the present application. As shown in fig. 6, x is an input feature of the machine learning model, and the broadcast length parameter y is output.

Illustratively, the input features x include user information, device information, and/or scene information, among other data; wherein the user information includes a historical listening time length of the user; the equipment information comprises the screen size, screen resolution and the like of the current broadcasting equipment; the scene information related data includes the ambient noise level, the category of the room in which the scene information is located, and the like.

Illustratively, the broadcast length parameter y includes a class classification result of "concise" or "moderate", or a predicted broadcast text length limit value L.

In an implementation manner, the machine learning model is a classification model, characteristic data such as user information, equipment information, scene information and the like are input, and the output broadcasting length parameter y is a length classification result of a target broadcasting text and is recorded as a fourth broadcasting length parameter; such as succinct, moderate, and detailed. The classification model may be trained using a standard random forest classification learner (random forest classifier).

In an implementation manner, the machine learning model may be a regression model, input characteristic data such as user information, device information and/or scene information, and output a broadcasting length parameter y as a length limit value L of a target broadcasting text, and record as a fifth broadcasting length parameter; the regression model may be trained using a standard random forest regression learner (random forest regressor).

Each initial model of the machine learning model is obtained through offline training, and the historical listening time of a user under the conditions of a specific screen size, screen resolution and/or noise level of the environment and the room category is continuously collected for online learning to provide a broadcasting length parameter adapting to the historical listening habit of the user.

The training data of the machine learning model comprises user historical listening time and/or equipment information, such as screen size and/or screen resolution of the current broadcasting equipment, and scene information, such as environmental noise size and/or room category, and the like, and each training data is labeled as a broadcasting length parameter expected to be generated. Each piece of training data may be obtained through the steps corresponding to the first and second embodiments, or may be collected from the network environment in combination with user feedback, which is not limited herein.

The NLG module 36 uses the broadcast length parameters output by the machine learning model to control the generation length of the target broadcast text.

The TTS module 37 uses the machine learning of the broadcast length parameter output by the model to control the speech rate of the broadcast voice, and broadcasts at the corresponding speech rate.

Compared with the second embodiment, the method for generating the broadcasting text introduces a machine learning model, obtains broadcasting length parameters according to the historical listening time length, the equipment information and/or the scene information of the user, limits the length of the broadcasting text and the broadcasting voice according to the broadcasting length parameters, keeps the machine learning model continuously learned through an online learning mechanism, and updates and matches the individualized broadcasting length parameters of the user. The more and more accurate the voice assistant 201 broadcasts the generated personalized experience by applying the method for generating the broadcast text in the third embodiment of the present application.

According to the method for generating the broadcasting text, which is provided by the embodiment of the application, the mapping relation between the historical listening time of the user and the expected broadcasting text length and broadcasting voice time is learned through a machine learning model, and more accurate personalized experience is provided through an online learning mode. And embodiment one is a rule mapping approach.

Example IV

Because of the development of a pre-training language model, a great number of NLP tasks can acquire a great index improvement through the paradigm, and the method for generating the broadcasting text provided by the embodiment of the application can utilize the pre-training language model, such as a BERT language model, a GPT-2 language model and the like, to blend broadcasting length parameters into the controllable NLG module 36/TTS module 37, and generate the broadcasting text or voice end to end.

Fig. 7 is a schematic diagram of a method for generating a broadcast text according to a fourth embodiment of the present application based on a typical pre-training language model structure. As shown in fig. 7, after the module encodes different types of user information, device information and/or scene information by using a linear encoder (linear), a characterization vector of the broadcast length parameter is obtained through a fusion module (fusion), and is recorded as a sixth broadcast length parameter; the sixth report length parameter is input into the GPT-2 language model together with the current dialogue state of the wheel output by the DST module 34 and the report content of the current user voice command output by the DPL module 35, and a target report text with a length matching the user listening history is generated.

In one implementation, NLG module 36 first pre-trains the GPT-2 language model with unlabeled text data to obtain language characterization information. And fine tuning the broadcasting content information comprising broadcasting content, dialogue state, corresponding user information, equipment information and/or scene information and broadcasting results of positive feedback of the user, learning encoder parameters corresponding to the parameters, adjusting parameters of a pre-trained GPT-2 model output layer to generate a target broadcasting text with the length matched with the user listening history, and adapting to the generating task.

According to the method for generating the broadcasting text, when the broadcasting text is generated, equipment information and/or scene information are introduced in addition to user information, and the broadcasting text with different lengths is generated. The historical time length of the broadcast text is collected through the user information, the broadcast text listening time length is stored in a differentiated mode by combining the environment and/or the used equipment, when the broadcast text is generated in a specific scene, the broadcast length parameter in the scene is used for guiding the generation of the broadcast text, the target broadcast text which is matched with the habit of the user and is matched with the equipment information and/or the used scene can be generated, the interaction experience and efficiency are improved, and the personalized voice assistant 201 for the user is provided.

In addition to generating the broadcast text or voice according to the user request, the method of the above embodiments of the present application may be used by the voice assistant 201 to actively send a welcome message, generate the broadcast text or voice when the system is turned on or off, and generate the broadcast text or voice in other situations that may match the user's personalized usage record, device information, and/or scene information.

Example five

The embodiment of the application provides a method for broadcasting text, which can generate broadcasting voice according to a user request, introduce user information in the generation stage of the broadcasting voice, control the speech speed of target broadcasting voice according to the historical listening time of a user recorded in the user information, and provide thousands of people and thousands of faces sexual interaction experience for a user and a voice assistant.

The embodiment of the application provides a text broadcasting method, which comprises the following steps: receiving a voice instruction of a user; generating a target broadcasting text corresponding to the voice instruction; and controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter, wherein the broadcasting length parameter indicates the historical listening time length information.

The voice assistant may determine the broadcast length parameter based on an average or weighted average of a plurality of pieces of historical listening time length information. The embodiment related to determining the broadcast length parameter in the first embodiment may be referred to, and will not be described herein.

In some embodiments that may be implemented, the broadcast length parameter is associated with the device information, and may determine a first broadcast length parameter according to the device information, and control a broadcast speed of the target broadcast text according to the broadcast length parameter, including: controlling the broadcasting speed of the target broadcasting text according to the first broadcasting length parameter; the first broadcast length parameter indicates first historical listening period information associated with the device information. Reference may be made to the implementation manner related to the first broadcast length parameter in the second embodiment, which is not described herein.

In some embodiments that may be implemented, the broadcasting length parameter is associated with the scene information, and may determine a second broadcasting length parameter according to the scene information, and control a broadcasting speed of the target broadcasting text according to the broadcasting length parameter, including: controlling the broadcasting speed of the target broadcasting text according to the second broadcasting length parameter; the second broadcast length parameter indicates second historical listening period information associated with the scene information. Reference may be made specifically to the implementation manner related to the second broadcast length parameter in the second embodiment, which is not described herein.

In some embodiments that may be implemented, the broadcasting length parameter is associated with field device information and scene information, and may determine a third broadcasting length parameter according to the device information and the scene information, and control a broadcasting speed of the target broadcasting text according to the broadcasting length parameter, including: controlling the broadcasting speed of the target broadcasting text according to the third broadcasting length parameter; the third broadcast length parameter indicates third historical listening period information associated with the device information and the scene information, and specific reference may be made to implementation manners of the second embodiment related to the third broadcast length parameter, which is not described herein again. .

In some possible embodiments, controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter may include: inputting historical listening time length information, equipment information and/or scene information into a classification model; outputting a fourth broadcast length parameter; the fourth broadcast length parameter is of different length category; and controlling the broadcasting speed of the target broadcasting text according to the fourth broadcasting length parameter. The embodiment of obtaining the fourth report length parameter through the classification model in the third embodiment may be referred to, and will not be described herein.

In one possible implementation manner, controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter may include: inputting historical listening time length information, equipment information and/or scene information into a regression model; outputting a fifth broadcasting length parameter, wherein the fifth broadcasting length parameter is a length limit value; and controlling the broadcasting speed of the target broadcasting text according to the fifth broadcasting length parameter. For specific reference, the implementation manner of obtaining the fifth report length parameter through the regression model in the third embodiment is not described herein.

It is understood that each embodiment of the present application is not an isolated embodiment, and those skilled in the art may associate or combine each embodiment, and the association and combination schemes thereof are all within the protection scope of the embodiments of the present application.

An embodiment of the present application provides an electronic apparatus including: at least one memory for storing a program; and at least one processor for executing the program stored in the memory, the processor for performing the method of any of the embodiments described above when the program stored in the memory is executed.

An embodiment of the present application is a storage medium having stored therein instructions that, when executed on a terminal, cause the first terminal to perform the method of any of the embodiments described above.

The listening time of the broadcasting text defined by the embodiment of the application can be converted into the time and other price indexes of the user for checking the broadcasting text in the plain text generation scene.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.

Furthermore, various aspects or features of embodiments of the application may be implemented as a method, apparatus, or article of manufacture using standard programming and/or engineering techniques. The term "article of manufacture" as used herein encompasses a computer program accessible from any computer-readable device, carrier, or media. For example, computer-readable media can include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, or magnetic strips, etc.), optical disks (e.g., compact disk, CD, digital versatile disk, digital versatile disc, DVD, etc.), smart cards, and flash memory devices (e.g., erasable programmable read-only memory, EPROM), cards, sticks, or key drives, etc. Additionally, various storage media described herein can represent one or more devices and/or other machine-readable media for storing information. The term "machine-readable medium" can include, without being limited to, wireless channels and various other media capable of storing, including, and/or carrying instruction(s) and/or data.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be embodied essentially or in a part contributing to the prior art or in the form of a broadcast length parameter software product stored in a storage medium, comprising instructions for causing a computer device (which may be a personal computer, a server, or an access network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

A method for generating a broadcast text, the method comprising:

receiving a voice instruction of a user;

acquiring broadcasting content corresponding to the voice instruction;

and generating a target broadcast text according to the broadcast length parameter and the broadcast content, wherein the broadcast length parameter indicates historical listening time information.
The method for generating a broadcast text according to claim 1, wherein generating the target broadcast text according to the broadcast length parameter and the broadcast content comprises:

and taking the broadcasting content and the broadcasting length parameter as inputs of a model, outputting a target broadcasting text by the model, wherein the target broadcasting text is a broadcasting text with the duration matched with the broadcasting length parameter.
The broadcast text generation method according to claim 2, wherein the model is a generative model or a retrievable model;

Generating the target broadcast text according to the broadcast length parameter and the broadcast content comprises the following steps:

taking the broadcasting content and the broadcasting length parameter as input of a generating model, and generating and outputting a target broadcasting text by the generating model; or (b)

Taking the broadcasting content and the broadcasting length parameter as input of a search model, and searching a text template with limited length in a predefined template library according to the broadcasting length parameter by the search model; outputting a target broadcasting text through the searched text template with the limited length, wherein the target broadcasting text is a broadcasting text with the duration matched with the historical listening duration information.
A method for generating a broadcast text according to any one of claims 1 to 3, wherein the broadcast length parameter is associated with device information, a first broadcast length parameter is determined according to the device information, and the generating a target broadcast text according to the broadcast length parameter and the broadcast content specifically includes:

generating a first target broadcasting text according to the first broadcasting length parameter and the broadcasting content; the first broadcast length parameter indicates first historical listening period information associated with the device information.
A method for generating a broadcast text according to any one of claims 1 to 3, wherein the broadcast length parameter is associated with scene information, a second broadcast length parameter is determined according to the scene information, and the generating a target broadcast text according to the broadcast length parameter and the broadcast content specifically includes:

generating a second target broadcasting text according to the second broadcasting length parameter and the broadcasting content; the second broadcast length parameter indicates second historical listening period information associated with the scene information.
A method for generating a broadcast text according to any one of claims 1 to 3, wherein the broadcast length parameter is associated with field device information and scene information, a third broadcast length parameter is determined according to the device information and the scene information, and the generating a target broadcast text according to the broadcast length parameter and the broadcast content specifically includes:

generating a third target broadcasting text according to the third broadcasting length parameter and the broadcasting content; the third broadcast length parameter indicates third historical listening period information associated with the device information and the scene information.
A method of generating a broadcast text according to any one of claims 1-3, wherein the broadcast length parameter is associated with field device information and/or scene information, and wherein generating a target broadcast text based on the broadcast length parameter and the broadcast content comprises:

Inputting the historical listening period information, the equipment information and/or the scene information into a classification model; outputting a fourth broadcast length parameter; the fourth broadcast length parameter is of different length types;

and generating a fourth target broadcasting text according to the fourth broadcasting length parameter and the broadcasting content.
The method according to one of claims 1, wherein the broadcast length parameter is associated with field device information and/or scene information, and the generating the target broadcast text according to the broadcast length parameter and the broadcast content includes:

inputting the historical listening period information, equipment information and/or scene information into the regression model; outputting a fifth broadcasting length parameter, wherein the fifth broadcasting length parameter is a length limit value;

and generating a fifth target broadcasting text according to the fifth broadcasting length parameter and the broadcasting content.
The method for generating a broadcast text according to claim 1, wherein the broadcast length parameter is associated with field device information and/or scene information, and the generating a target broadcast text according to the broadcast length parameter and the broadcast content includes:

respectively performing linear encoding on the equipment information, the scene information and/or the historical listening time length information, and then fusing to obtain a sixth broadcasting length parameter; the sixth broadcast length parameter is a characterization vector of the broadcast length parameter;

And taking the sixth broadcasting length parameter, the broadcasting content and the voice command as executable/non-executable inputs of a pre-training language model, and outputting a sixth target broadcasting text.
The method for generating a broadcast text according to one of claims 1 to 9, wherein the obtaining the broadcast content corresponding to the voice command includes:

acquiring intention and slot position information according to the voice instruction;

determining whether the voice instruction is executable according to the intention and the slot position information;

and generating broadcasting content which is inquiry information under the condition that the voice instruction is not executable.
The broadcast text generation method according to one of claims 1 to 9, wherein the determining the broadcast content according to the dialogue state includes:

acquiring intention and slot position information according to the voice instruction;

determining whether the voice instruction is executable according to the intention and the slot position information;

determining, if the voice instructions are executable, a third party service that performs the intent;

and acquiring the broadcasting content from the third party service, wherein the broadcasting content is result information corresponding to the voice instruction.
The broadcast text generation method according to one of claims 1 to 11, characterized in that the method further comprises:

and controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter.
The broadcast text generation method according to one of claims 1 to 12, characterized in that the method further comprises:

recording the broadcasting time of the current target broadcasting text, and obtaining the historical listening time information.
A method of broadcasting text, for use with a voice assistant, the method comprising:

receiving a voice instruction of a user;

generating a target broadcasting text corresponding to the voice instruction;

and controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter, wherein the broadcasting length parameter indicates historical listening time information.
The method for broadcasting text according to claim 14, wherein the broadcasting length parameter is associated with equipment information, a first broadcasting length parameter is determined according to the equipment information, and the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter comprises:

controlling the broadcasting speed of the target broadcasting text according to the first broadcasting length parameter; the first broadcast length parameter indicates first historical listening period information associated with the device information.
The method for broadcasting text according to claim 14, wherein the broadcasting length parameter is associated with scene information, a second broadcasting length parameter is determined according to the scene information, and the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter comprises: controlling the broadcasting speed of the target broadcasting text according to the second broadcasting length parameter; the second broadcast length parameter indicates second historical listening period information associated with the scene information.
The method of claim 14, wherein the broadcast length parameter is associated with field device information and scene information, a third broadcast length parameter is determined according to the device information and the scene information, and the controlling the broadcast speed of the target broadcast text according to the broadcast length parameter includes:

controlling the broadcasting speed of the target broadcasting text according to the third broadcasting length parameter; the third broadcast length parameter indicates third historical listening period information associated with the device information and the scene information.
The method for broadcasting text according to claim 14, wherein the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter comprises: inputting the historical listening period information, the equipment information and/or the scene information into a classification model; outputting a fourth broadcast length parameter; the fourth broadcast length parameter is of different length types;

And controlling the broadcasting speed of the target broadcasting text according to the fourth broadcasting length parameter.
The method for broadcasting text according to claim 14, wherein the controlling the broadcasting speed of the target broadcasting text according to the broadcasting length parameter comprises:

inputting the historical listening period information, equipment information and/or scene information into the regression model; outputting a fifth broadcasting length parameter, wherein the fifth broadcasting length parameter is a length limit value;

and controlling the broadcasting speed of the target broadcasting text according to the fifth broadcasting length parameter.
An electronic device, comprising:

at least one memory for storing a program; and

at least one processor for executing the memory-stored program, which processor is adapted to perform the method of any of claims 1-19 when the memory-stored program is executed.
A storage medium having stored therein instructions which, when executed on a terminal, cause a first terminal to perform the method of any of claims 1-19.