WO2022121181A1

WO2022121181A1 - Intelligent news broadcasting method, apparatus and device, and storage medium

Info

Publication number: WO2022121181A1
Application number: PCT/CN2021/084290
Authority: WO
Inventors: 苏雪琦; 王健宗; 程宁
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-12-10
Filing date: 2021-03-31
Publication date: 2022-06-16
Also published as: CN112541078A

Abstract

The present application relates to the field of artificial intelligence. Disclosed are an intelligent news broadcasting method, apparatus and device, and a storage medium. The intelligent news broadcasting method comprises: acquiring news broadcasting text to be processed; inputting the news broadcasting text into a preset semantic prediction model for semantic prediction, so as to obtain a corresponding semantic vector; inputting the semantic vector into a preset semantic classification model for classification, and generating an emotion tag corresponding to each sentence in the news broadcasting text; and inputting the news broadcasting text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting a news broadcasting audio with a plurality of emotions. According to the present application, a news broadcasting audio with emotions can be synthesized.

Description

News intelligent broadcasting method, device, equipment and storage medium

This application claims the priority of the Chinese patent application with the application number 202011432581.8 and the invention titled "News Intelligent Broadcasting Method, Apparatus, Equipment and Storage Medium" filed with the China Patent Office on December 10, 2020, the entire contents of which are incorporated by reference in application.

technical field

The present application relates to the field of artificial intelligence, and in particular, to a method, device, device and storage medium for intelligent news broadcasting.

Background technique

Social media has enriched news forms, and financial news has derived interesting forms that are more suitable for new media from traditional news anchors, radio anchors and other models. For example, financial news/popular science in the short video scene and the audio station scene emerge in an endless stream. It can be seen that the development of the whole scene has become the main trend of the news media. The core of the full-scene broadcast is the support for multi-stylized speech synthesis, and in the diversified scenarios of the new media background, emotional synthesis is the key to its success. Intelligent speech synthesis can be used for multiple purposes, input text, intelligent synthesis adapt to various platform styles and types of speech, reduce the dependence on seiyuu, and improve the output efficiency of finished products.

In the prior art, the inventor realizes that there are few technical achievements in the aspect of voice emotional expressiveness, and the emotional part of speech synthesis has not yet achieved realistic anthropomorphism, so it is currently impossible to synthesize news broadcast audio with emotions.

SUMMARY OF THE INVENTION

The main purpose of this application is to solve the current problem of inability to synthesize news broadcast audio with emotions.

A first aspect of the present application provides a method for intelligent news broadcasting, including:

Get the text of the newscast to be processed;

Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;

Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;

Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.

A second aspect of the present application provides a news intelligent broadcasting device, the news intelligent broadcasting device comprising: a memory, a processor, and a news intelligent broadcasting program stored on the memory and running on the processor, the When the processor executes the intelligent news broadcast program, the following steps are implemented:

Get the text of the newscast to be processed;

A third aspect of the present application provides a storage medium, a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps:

Get the text of the newscast to be processed;

A fourth aspect of the present application provides a news intelligent broadcasting device, and the news intelligent broadcasting device includes:

The news text acquisition module is used to acquire the news broadcast text to be processed;

a semantic analysis module for inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;

The label generation module is configured to input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each sentence in the news broadcast text;

The audio synthesis module is used for inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.

In the technical solution provided by this application, by performing semantic prediction on the acquired news broadcast text, the semantic vector obtained by the semantic prediction is classified, and multiple emotional labels are generated according to the classification results, and finally the news broadcast text and the corresponding emotional label are input into the Audio synthesis in a preset text-to-speech model. The present application can realize the synthesis of news broadcast audio with emotions.

Description of drawings

1 is a schematic diagram of an embodiment of a method for intelligently broadcasting news in an embodiment of the application;

FIG. 2 is a schematic diagram of an embodiment of a news intelligent broadcasting device in an embodiment of the application;

FIG. 3 is a schematic diagram of an embodiment of a news intelligent broadcasting device in an embodiment of the present application.

Detailed ways

The embodiments of the present application provide a method, device, device and storage medium for intelligent news broadcasting, which can enrich the synthesis effect of emotions and improve the simulation degree of speech.

The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used can be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

For ease of understanding, the specific flow of the embodiment of the present application is described below, referring to FIG. 1 , an embodiment of the method for intelligently broadcasting news in the embodiment of the present application includes:

101. Obtain the text of the news broadcast to be processed;

In this embodiment, the business personnel upload the news scenario text to the script library, and the management personnel can manage the script on the script management page. The management page is divided into two modules: "script inventory" and "role management". Click "Script Inventory" to get the script uploaded to the script library. Select a script to view the script content and script description information. The script description information includes script broadcast type, script scene, and script word count, such as "Single Player Broadcast" ", "TV news scene", "694 Chinese characters".

The news broadcast text contains a time stamp, and the setting of the time stamp can be marked as a key sentence in the script dialogue by selecting the text in the script text, and the key sentence is displayed in red font. Set the timestamp corresponding to the key statement by typing the time in the timestamp setting field on the script viewing page. The timestamp setting column is divided into start time and end time. All the timestamps set in the script text are displayed in the annotation history. By clicking the timestamp in the annotation history, you can quickly locate the key sentences in the corresponding script text. Among them, the function of time stamp setting is to intervene the time position of different segments in the audio after the waveform audio is generated, so as to facilitate the subsequent editing of the business personnel. "Character Management" will upload biography or non-script dialogue independently according to the script role setting, support matching with recorded script audio, and realize audition and review by role.

In this embodiment, before the above step 101, it further includes:

Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;

Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;

Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;

The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.

In this embodiment, a large number of training samples are used to train the semantic prediction ability of the neural network model, wherein the training samples include training text and semantic labels, one training text can correspond to multiple semantic labels, and each training text and its corresponding The semantic label will be used as a training sample, and a number of such samples in the collection will be divided, part of which is used as the sample material for model training, and the other part is used as the test sample material to detect the training effect of the model. The ratio can be controlled to 9:1 (Training sample material: test sample material), 90% of the samples are used for training to obtain a semantic prediction model, and the remaining 10% are used to verify the performance of the semantic prediction model. If the test results reach the preset good Score, then end the training of the model, and the good score can be specified by the prediction success ratio. For example, if the prediction success ratio is 60%, it is a good score, then use the semantic prediction model to perform semantic prediction 10 times, and use the prediction success rate for each time. The results are compared with the corresponding semantic labels, and a "good" rating is achieved when 6 of the 10 predictions are accurate. If the "good" rating is not reached, the training parameters are re-adjusted and the model training continues.

In this embodiment, before the above step 101, it further includes:

obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;

Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;

processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;

According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.

In this embodiment, a decision tree model is used for classification, and the decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes, where internal nodes represent a feature or attribute, and leaf nodes represent a class. Generally, a decision tree contains a root node, several internal nodes and several leaf nodes. Leaf nodes correspond to decision results, and each other node corresponds to an attribute test. The sample set contained in each node is divided into child nodes according to the result of the attribute test, the root node contains the full set of samples, and the path from the root node to each leaf node corresponds to a decision test sequence. The purpose of decision tree learning is to generate a decision tree with strong generalization ability, that is, the ability to deal with unseen examples.

Constructing a decision tree model based on the given semantic classification training samples, so that it can classify the instances correctly, is essentially inducting a set of classification rules from the training data set. Whether the parameters need to be optimized is determined by calculating the loss function. The smaller the loss function, the better the generated decision tree. Its loss function is usually a regularized maximum likelihood function.

102. Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;

The semantic prediction model adopted in this embodiment is the BERT model (Bidirectional Encoder Representation from Transformers), that is, the Encoder of the bidirectional Transformer, because the decoder cannot obtain the information to be predicted. The main innovations of the model are in the pre-train method, which uses Masked LM and Next Sentence Prediction to capture word- and sentence-level representations respectively. In this embodiment, intelligent semantic analysis is performed based on the BERT model to determine whether the text has emotions such as joy, anger, sadness, and joy.

In this embodiment, before the above step 101, it further includes:

Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;

Inputting each word segment into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each word segment;

The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.

In this embodiment, word segmentation is obtained by acquiring and analyzing the pre-set word segmentation structure of the news broadcast text to obtain multiple word segments with word order, for example, obtaining the first word segment, the second word segment, and the third word segment, and then the feature recognition network analyzes each word segment. Extract features, output the text vector α and weight 3 of the first participle, output the text vector β and weight 4 of the second participle, output the text vector ɡ and weight 5 of the third participle, and finally fuse these word vectors through the word vector synthesis network. is a semantic vector, and its weight is calculated by a weighting algorithm, that is, the sum of the weights of the vectors α, β, and ɡ is 3+4+5=12,. Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;

103. Input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each statement in the news broadcast text;

In this embodiment, each type of semantic vector corresponds to an emotion label, and the two are in a one-to-one relationship, and the semantic classification in this example is implemented by a classification model, which can classify objects with common characteristics. , the common classification model is Naive Bayesian, the two most widely used classification models are Decision Tree Model (Decision Tree Model) and Naive Bayesian Model (NBM). Compared with the decision tree model, the Naive Bayes Classifier (or NBC) originated from classical mathematical theory, has a solid mathematical foundation, and has stable classification efficiency;

Logistic regression: y=sigmoid(wx) function, which is divided into categories according to a certain probability threshold; SVM, assuming that there is a hyperplane, which can separate all samples. Multi-layer perceptron MLP, a fully connected neural network, except for the input layer, the activation functions of other layers are all SIGMOD functions, and the BP algorithm is used to learn the weights: the output is passed backward, and the error is passed forward. Traditional boost algorithm: Initially, the weights of all samples are the same, and the weight of the "wrongly classified" samples is continuously increased and the weight of the paired samples is reduced.

In this embodiment, the above 103 further includes:

Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;

Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;

Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;

Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.

A classification tree is a tree structure that describes the classification of instances. When using the classification tree for classification, starting from the root node, a certain feature of the semantic classification sample is tested, and according to the test result, a certain semantic classification sample is assigned to its child nodes. At this time, each child node corresponds to a value of the feature. In this way, the semantic classification samples are tested and assigned one by one recursively until a leaf node is reached. Finally, the semantic classification samples are divided into leaf node classes. Each leaf node corresponds to a class of semantic classification samples, and a corresponding sentiment label is generated based on the semantic classification samples of each class.

104. Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions;

A good speech synthesis model is WORLD: WORLD is an open source speech synthesis system based on C language. Speech synthesis mainly includes two methods: waveform splicing and parameter synthesis. WORLD is a parameter synthesis method based on vocoder. Compared with STRAIGHT The advantage is that it reduces the computational complexity and can be applied to real-time speech synthesis. Since STRAIGHT is not an open source system, and in the WORLD paper, WORLD has been compared to STRAIGHT in terms of synthesized audio quality and synthesis speed. End-to-end text-to-speech (TTS) technology based on neural network has developed rapidly. Compared with the concatenative synthesis and statistical parametric synthesis in traditional speech synthesis, the voice generated by end-to-end speech synthesis technology usually has better naturalness of sound. However, this technology still faces the following problems:

The speed of synthesizing speech is slow: the end-to-end model usually generates a Mel-Spectrogram in an Autoregressive manner, and then synthesizes the speech through a vocoder (Vocoder), and the Mel-spectrum of a piece of speech can usually be To hundreds of thousands of frames, resulting in slower synthesis;

The synthesized speech is less stable: the end-to-end model usually adopts the Encoder-Attention-Decoder mechanism for autoregressive generation, due to the error propagation of sequence generation and the lack of attention alignment. Accurate, resulting in repeated words or missing words;

Lack of controllability: The autoregressive neural network model automatically determines the generation length of a speech, and cannot explicitly control the speech rate or prosody pause of the generated speech. In order to solve the above-mentioned series of problems, the Machine Learning Group of Microsoft Research Asia and the Speech Team of Microsoft (Asia) Internet Engineering Institute and Zhejiang University proposed a new feedforward network FastSpeech based on Transformer, which can generate high quality mel spectrum, and then synthesize the sound in parallel with the help of a vocoder.

In this embodiment, the preset text-to-speech model adopts the Fast Speech model, and the full-scene broadcast is a high-fidelity presentation of speech in different scenes. The key lies in the rhythm information such as accent pauses, breath strength, pitch strength, emotional fluctuations, etc. . Due to the need for high expressiveness and the feature of long text, this embodiment adopts the Fast Speech model as the direction of the underlying technology for productization. Compared with the autoregressive Transformer TTS, Fast Speech speeds up Mel spectrum generation by nearly 270 times, end-to-end speech synthesis by 38 times, and speech synthesis on a single GPU that is 30 times faster than real-time speech times. In addition, Fast Speech removes the attention mechanism, reduces the synthesis failure rate, and can effectively avoid the loss caused by the failure of long text synthesis; Compared with the Tacotron2 model, Fast Speech is a non-autoregressive model, which calculates the value of each character in parallel. Mel spectrum frames avoid the synthesis speed limitation caused by the loop mechanism, but at the same time lead to the lack of rhythmic correlation between each other, thus reducing the expressiveness of the sound. However, it is recommended to introduce a variance regulator mechanism to predict prosody information such as Pitch and Duration, and improve the effects of phoneme duration/pitch intensity/accent volume for synthesized sounds, so as to achieve fast and good results.

Fast Speech adopts a new feedforward Transformer network architecture, abandoning the traditional encoder-attention-decoder mechanism. Its main module adopts Transformer's self-attention mechanism (Self-Attention) and one-dimensional convolution network (1D Convolution). The feedforward Transformer stacks multiple FFT blocks for Phoneme to Mel spectrum transformation, with N FFT blocks each on the phoneme side and the Mel spectrum side. In particular, there is a length regulator (Length Regulator) in the middle, which is used to adjust the length difference between the phoneme sequence and the mel spectrum sequence.

In this embodiment, the above 104 further includes:

The news broadcast text is divided into sentences to obtain a plurality of sentences with word order;

Inputting each sentence and the emotional label corresponding to each sentence into the text preprocessing network for phoneme serialization, and outputting a phoneme sequence;

Inputting the phoneme sequence into the temperament prediction network for temperament prediction, to obtain temperament synthesis type information;

Inputting the melody synthesis type information into the speech synthesis network for waveform generation, and outputting news broadcast audio with multiple emotions.

In this embodiment, the text is converted into a phoneme sequence, and information such as the start and end time and frequency change of each phoneme is marked. As a preprocessing step, its importance is often overlooked, but it involves many issues worth investigating, such as the distinction of words with the same spelling but different pronunciations, the handling of abbreviations, the determination of pause positions, and so on. The phoneme sequence provides a reference for prosody information such as Pitch and Duration in the Fast Speech model, and determines the correct emotion synthesis type, such as situational dialogue A and emotional label a. The Fast Speech model performs prosodic information according to the situational dialogue A and emotional label a. Prediction, the prediction result is "anger", then determine that the temperament synthesis type information is "anger type", and finally input the information of the temperament synthesis type as "anger type" into the speech synthesis network, and the speech synthesis network performs the input information. Parameter parsing, the vocoder in the speech synthesis network synthesizes speech according to the result of parameter parsing.

In this embodiment, after inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions, the method further includes:

According to the preset time stamp, the audio of the news broadcast is visually edited to obtain emotional voices under various emotions;

Submit the emotional voices and emotional labels corresponding to the emotional voices to manual review.

In this embodiment, the dialogue of the waveform audio is visualized according to the time stamp, and the dialogue can be auditioned and edited after quickly locating the dialogue. For example, timestamp 1 is 01:08~02:34, and timestamp 2 is 02:34~03:28. According to these two time periods, the entire waveform audio is trimmed to obtain two audio files. Label them with label buttons, such as button 1 (01:08～02:34), button 2 (02:34～03:28), when button 1 is clicked, the audio file corresponding to time stamp 1 will be played. When button 2 is clicked, the audio file of the corresponding time period of time stamp 2 will be played.

If you need to modify and edit the audio file, quickly locate the corresponding audio through the timestamp, download the corresponding audio file from the system, and use digital audio production editing software to "invert", "mute", and "mute" the audio waveform. General processing such as "Amplify", "Amplify", "Reduce", "Fade In", "Fade Out", "Regularization", general processing such as Cut, Copy, Paste, Multi-File Merge and Mixing, "Slot Band Filter" ", "Band Pass Filter", "High Pass Filter", "Low Pass Filter", "High Frequency Filter", "Low Pass Filter", "FFT Filter" filtering processing. Digital audio production editing software mainly includes recording, sound mixing, post-effect processing, etc. It is a powerful digital audio editing software with audio processing as the core, integrating sound recording, playback, editing, processing and conversion. The rich effects and editing functions required for sound effects can be used to complete a variety of complex and sophisticated professional audio editing. In terms of sound processing, it includes a number of functions such as frequency equalization, effect processing, and noise reduction. Regarding audio clipping, open the audio file corresponding to the corresponding timestamp in digital audio production editing software, segment the audio file, perform processing such as splicing, modification, etc., and upload the processed audio file to the system.

In this embodiment, whether the emotional voice matches the emotional tag is manually checked, and the corresponding emotional voice is played after clicking the button, and the staff judges the emotional color of the emotional voice and compares it with the emotional tag bound to the emotional voice Yes, if the emotional color of the emotional voice is consistent with the emotional tag bound to it, then the review is passed, and the emotional voice is marked as news. If the emotional color of the emotional voice is inconsistent with the emotional tag bound to it, then The intelligent semantic analysis of the situational dialogue will be performed again, corresponding emotional tags will be generated, waveform audio will be synthesized, audio clips will be supplemented, and manual review will be initiated again.

In this embodiment, the synthesis effect is enriched, the simulation degree is improved, the voice effect is diversified, and more possibilities are provided for the product application scenarios; it is produced for the full-scenario broadcast of financial news, and the control power is stronger.

The method for intelligent news broadcasting in the embodiment of the present application has been described above. The following describes the intelligent news broadcasting device in the embodiment of the present application. Please refer to FIG. 2. An embodiment of the intelligent news broadcasting device in the embodiment of the present application includes:

News text acquisition module 201, used to acquire the news broadcast text to be processed;

Semantic analysis module 202, for inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;

The label generation module 203 is configured to input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each sentence in the news broadcast text;

The audio synthesis module 204 is configured to input the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and output news broadcast audio with multiple emotions.

Optionally, the news text acquisition module 201 can also be specifically used for:

Optionally, the semantic analysis module 202 can also be specifically used for:

Inputting each segmented word into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each segmented word;

Optionally, the label generation module 203 can also be specifically used for:

Optionally, the audio synthesis module 204 can also be specifically used for:

1 and 2 above describe the intelligent news broadcasting device in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the intelligent news broadcasting device in the embodiment of the present application in detail from the perspective of hardware processing.

FIG. 3 is a schematic structural diagram of a news intelligent broadcasting device provided by an embodiment of the present application. The news intelligent broadcasting device 300 may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units). , CPU) 310 (eg, one or more processors) and memory 320, one or more storage media 330 (eg, one or more mass storage devices) storing applications 333 or data 332. Wherein, the memory 320 and the storage medium 330 may be short-term storage or persistent storage. The program stored in the storage medium 330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the news intelligent broadcasting device 300 . Furthermore, the processor 310 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the news intelligent broadcasting device 300 .

News intelligent broadcasting device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input and output interfaces 360, and/or, one or more operating systems 331, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more. Those skilled in the art can understand that the structure of the news intelligent broadcasting device shown in FIG. 3 does not constitute a limitation on the news intelligent broadcasting device, and may include more or less components than those shown in the figure, or combine some components, or different Component placement.

The present application also provides a news intelligent broadcasting device, the news intelligent broadcasting device includes a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor is made to execute the above embodiments. The steps of the news intelligent broadcasting method in .

The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are executed on the computer, make the computer execute the steps of the method for intelligent news broadcasting.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims

A news intelligent broadcasting method, wherein the news intelligent broadcasting method comprises:

Get the text of the newscast to be processed;

Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;

Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;

Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
The intelligent news broadcast method according to claim 1, wherein before the acquiring the news broadcast text to be processed, the method further comprises:

Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;

Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;

Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;

The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
The intelligent news broadcast method according to claim 1 or 2, wherein the semantic prediction model comprises a feature recognition network and a word vector synthesis network in sequence, and the news broadcast text is input into a preset semantic prediction model for semantic prediction, Obtaining the corresponding semantic vector includes:

Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;

Inputting each segmented word into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each segmented word;

The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
The intelligent news broadcast method according to claim 1, wherein before the acquiring the news broadcast text to be processed, the method further comprises:

obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;

Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;

processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;

According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
The method for intelligent news broadcasting according to claim 1 or 4, wherein the semantic classification model comprises a feature extraction network, a feature recognition network, and a classification network in sequence, and the semantic vector is input into a preset semantic classification model for classification, Generating emotion labels corresponding to each sentence in the news broadcast text includes:

Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;

Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;

Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;

Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.
The intelligent news broadcast method according to claim 1, wherein the text-to-speech model comprises a text preprocessing network, a melody prediction network, and a speech synthesis network in sequence, and the news broadcast text and the emotional labels are input into the The preset text-to-speech model is used for audio synthesis, and the output audio of news broadcasts with various emotions includes:

The news broadcast text is divided into sentences to obtain a plurality of sentences with word order;

Inputting each sentence and the emotional label corresponding to each sentence into the text preprocessing network for phoneme serialization, and outputting a phoneme sequence;

Inputting the phoneme sequence into the temperament prediction network for temperament prediction, to obtain temperament synthesis type information;

Inputting the melody synthesis type information into the speech synthesis network for waveform generation, and outputting news broadcast audio with multiple emotions.
The intelligent news broadcast method according to claim 1, wherein, in the process of inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions After that, also include:

According to the preset time stamp, the audio of the news broadcast is visually edited to obtain emotional voices under various emotions;

Submit the emotional voices and emotional labels corresponding to the emotional voices to manual review.
A news intelligent broadcasting device, wherein the news intelligent broadcasting device comprises: a memory, a processor, and a news intelligent broadcasting program stored on the memory and running on the processor, the processor executing the The following steps are implemented in the intelligent news broadcast program:

Get the text of the newscast to be processed;

Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;

Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;

Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
The intelligent news broadcasting device according to claim 8, wherein before the acquiring the news broadcasting text to be processed, the processor executing the intelligent news broadcasting program further comprises:

Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;

Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;

Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;

The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
The news intelligent broadcasting device according to claim 8 or 9, wherein the processor executes the news intelligent broadcasting program to realize that the semantic prediction model sequentially includes a feature recognition network and a word vector synthesis network, and the The broadcast text is input into the preset semantic prediction model for semantic prediction, and the corresponding semantic vectors are obtained including:

Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;

Inputting each segmented word into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each segmented word;

The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
The intelligent news broadcasting device according to claim 8, wherein before the acquiring the news broadcasting text to be processed, the processor executing the intelligent news broadcasting program further comprises:

obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;

Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;

processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;

According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
The news intelligent broadcasting device according to claim 8 or 11, wherein the processor executes the news intelligent broadcasting program to realize that the semantic classification model sequentially includes a feature extraction network, a feature recognition network, and a classification network, and the The semantic vector is input into a preset semantic classification model for classification, and the emotion label corresponding to each sentence in the news broadcast text is generated, including:

Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;

Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;

Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;

Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.
The news intelligent broadcasting device according to claim 8, wherein the processor executes the news intelligent broadcasting program to realize that the text-to-speech model comprises a text preprocessing network, a melody prediction network, and a speech synthesis network in sequence, and the The news broadcast text and each emotion tag are input into a preset text-to-speech model for audio synthesis, and the output news broadcast audio with multiple emotions includes:

The news broadcast text is divided into sentences to obtain a plurality of sentences with word order;

Inputting each sentence and the emotional label corresponding to each sentence into the text preprocessing network for phoneme serialization, and outputting a phoneme sequence;

Inputting the phoneme sequence into the temperament prediction network for temperament prediction, to obtain temperament synthesis type information;

Inputting the melody synthesis type information into the speech synthesis network for waveform generation, and outputting news broadcast audio with multiple emotions.
The intelligent news broadcast device according to claim 8, wherein, in the process of inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions After that, executing the intelligent news broadcast program by the processor further includes:

According to the preset time stamp, the audio of the news broadcast is visually edited to obtain emotional voices under various emotions;

Submit the emotional voices and emotional labels corresponding to the emotional voices to manual review.
A computer-readable storage medium, storing computer instructions in the computer-readable storage medium, when the computer instructions are executed on a computer, the computer is made to perform the following steps:

Get the text of the newscast to be processed;

Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;

Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;

Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
The computer-readable storage medium of claim 15, before the obtaining the newscast text to be processed, the computer-readable storage medium executing the computer instructions further comprising:

Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;

Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;

Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;

The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
The computer-readable storage medium according to claim 15 or 16, wherein the computer-readable storage medium executes the computer instructions to realize that the semantic prediction model sequentially includes a feature recognition network, a word vector synthesis network, and the The broadcast text is input into the preset semantic prediction model for semantic prediction, and the corresponding semantic vectors are obtained including:

Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;

Inputting each segmented word into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each segmented word;

The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
The computer-readable storage medium of claim 15, before the obtaining the newscast text to be processed, the computer-readable storage medium executing the computer instructions further comprising:

obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;

Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;

processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;

According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
The computer-readable storage medium according to claim 15 or 18, wherein the computer-readable storage medium executes the computer instructions to realize that the semantic classification model sequentially includes a feature extraction network, a feature recognition network, and a classification network, and the The semantic vector is input into a preset semantic classification model for classification, and the emotion label corresponding to each sentence in the news broadcast text is generated, including:

Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;

Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;

Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;

Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.
A news intelligent broadcasting device, wherein the news intelligent broadcasting device comprises:

The news text acquisition module is used to acquire the news broadcast text to be processed;

a semantic analysis module for inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;

The label generation module is configured to input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each sentence in the news broadcast text;

The audio synthesis module is used for inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.