WO2022121181A1 - Intelligent news broadcasting method, apparatus and device, and storage medium - Google Patents

Intelligent news broadcasting method, apparatus and device, and storage medium Download PDF

Info

Publication number
WO2022121181A1
WO2022121181A1 PCT/CN2021/084290 CN2021084290W WO2022121181A1 WO 2022121181 A1 WO2022121181 A1 WO 2022121181A1 CN 2021084290 W CN2021084290 W CN 2021084290W WO 2022121181 A1 WO2022121181 A1 WO 2022121181A1
Authority
WO
WIPO (PCT)
Prior art keywords
semantic
classification
model
text
news
Prior art date
Application number
PCT/CN2021/084290
Other languages
French (fr)
Chinese (zh)
Inventor
苏雪琦
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121181A1 publication Critical patent/WO2022121181A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present application relates to the field of artificial intelligence, and in particular, to a method, device, device and storage medium for intelligent news broadcasting.
  • Social media has enriched news forms, and financial news has derived interesting forms that are more suitable for new media from traditional news anchors, radio anchors and other models.
  • financial news/popular science in the short video scene and the audio station scene emerge in an endless stream. It can be seen that the development of the whole scene has become the main trend of the news media.
  • the core of the full-scene broadcast is the support for multi-stylized speech synthesis, and in the diversified scenarios of the new media background, emotional synthesis is the key to its success.
  • Intelligent speech synthesis can be used for multiple purposes, input text, intelligent synthesis adapt to various platform styles and types of speech, reduce the dependence on seiyuu, and improve the output efficiency of finished products.
  • the main purpose of this application is to solve the current problem of inability to synthesize news broadcast audio with emotions.
  • a first aspect of the present application provides a method for intelligent news broadcasting, including:
  • a second aspect of the present application provides a news intelligent broadcasting device, the news intelligent broadcasting device comprising: a memory, a processor, and a news intelligent broadcasting program stored on the memory and running on the processor, the When the processor executes the intelligent news broadcast program, the following steps are implemented:
  • a third aspect of the present application provides a storage medium, a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps:
  • a fourth aspect of the present application provides a news intelligent broadcasting device, and the news intelligent broadcasting device includes:
  • the news text acquisition module is used to acquire the news broadcast text to be processed
  • a semantic analysis module for inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector
  • the label generation module is configured to input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each sentence in the news broadcast text;
  • the audio synthesis module is used for inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
  • the semantic vector obtained by the semantic prediction is classified, and multiple emotional labels are generated according to the classification results, and finally the news broadcast text and the corresponding emotional label are input into the Audio synthesis in a preset text-to-speech model.
  • the present application can realize the synthesis of news broadcast audio with emotions.
  • FIG. 1 is a schematic diagram of an embodiment of a method for intelligently broadcasting news in an embodiment of the application
  • FIG. 2 is a schematic diagram of an embodiment of a news intelligent broadcasting device in an embodiment of the application
  • FIG. 3 is a schematic diagram of an embodiment of a news intelligent broadcasting device in an embodiment of the present application.
  • the embodiments of the present application provide a method, device, device and storage medium for intelligent news broadcasting, which can enrich the synthesis effect of emotions and improve the simulation degree of speech.
  • an embodiment of the method for intelligently broadcasting news in the embodiment of the present application includes:
  • the business personnel upload the news scenario text to the script library, and the management personnel can manage the script on the script management page.
  • the management page is divided into two modules: "script inventory” and “role management”. Click “Script Inventory” to get the script uploaded to the script library. Select a script to view the script content and script description information.
  • the script description information includes script broadcast type, script scene, and script word count, such as "Single Player Broadcast” ", "TV news scene”, "694 Chinese characters”.
  • the news broadcast text contains a time stamp, and the setting of the time stamp can be marked as a key sentence in the script dialogue by selecting the text in the script text, and the key sentence is displayed in red font.
  • Set the timestamp corresponding to the key statement by typing the time in the timestamp setting field on the script viewing page.
  • the timestamp setting column is divided into start time and end time. All the timestamps set in the script text are displayed in the annotation history. By clicking the timestamp in the annotation history, you can quickly locate the key sentences in the corresponding script text.
  • the function of time stamp setting is to intervene the time position of different segments in the audio after the waveform audio is generated, so as to facilitate the subsequent editing of the business personnel.
  • "Character Management" will upload biography or non-script dialogue independently according to the script role setting, support matching with recorded script audio, and realize audition and review by role.
  • step 101 before the above step 101, it further includes:
  • test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
  • a large number of training samples are used to train the semantic prediction ability of the neural network model, wherein the training samples include training text and semantic labels, one training text can correspond to multiple semantic labels, and each training text and its corresponding
  • the semantic label will be used as a training sample, and a number of such samples in the collection will be divided, part of which is used as the sample material for model training, and the other part is used as the test sample material to detect the training effect of the model.
  • the ratio can be controlled to 9:1 (Training sample material: test sample material), 90% of the samples are used for training to obtain a semantic prediction model, and the remaining 10% are used to verify the performance of the semantic prediction model.
  • the good score can be specified by the prediction success ratio. For example, if the prediction success ratio is 60%, it is a good score, then use the semantic prediction model to perform semantic prediction 10 times, and use the prediction success rate for each time. The results are compared with the corresponding semantic labels, and a "good” rating is achieved when 6 of the 10 predictions are accurate. If the "good" rating is not reached, the training parameters are re-adjusted and the model training continues.
  • step 101 before the above step 101, it further includes:
  • the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
  • a decision tree model is used for classification, and the decision tree consists of nodes and directed edges.
  • nodes There are two types of nodes: internal nodes and leaf nodes, where internal nodes represent a feature or attribute, and leaf nodes represent a class.
  • a decision tree contains a root node, several internal nodes and several leaf nodes. Leaf nodes correspond to decision results, and each other node corresponds to an attribute test. The sample set contained in each node is divided into child nodes according to the result of the attribute test, the root node contains the full set of samples, and the path from the root node to each leaf node corresponds to a decision test sequence.
  • the purpose of decision tree learning is to generate a decision tree with strong generalization ability, that is, the ability to deal with unseen examples.
  • Constructing a decision tree model based on the given semantic classification training samples, so that it can classify the instances correctly, is essentially inducting a set of classification rules from the training data set. Whether the parameters need to be optimized is determined by calculating the loss function. The smaller the loss function, the better the generated decision tree. Its loss function is usually a regularized maximum likelihood function.
  • the semantic prediction model adopted in this embodiment is the BERT model (Bidirectional Encoder Representation from Transformers), that is, the Encoder of the bidirectional Transformer, because the decoder cannot obtain the information to be predicted.
  • the main innovations of the model are in the pre-train method, which uses Masked LM and Next Sentence Prediction to capture word- and sentence-level representations respectively.
  • intelligent semantic analysis is performed based on the BERT model to determine whether the text has emotions such as joy, anger, sadness, and joy.
  • step 101 before the above step 101, it further includes:
  • the word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
  • word segmentation is obtained by acquiring and analyzing the pre-set word segmentation structure of the news broadcast text to obtain multiple word segments with word order, for example, obtaining the first word segment, the second word segment, and the third word segment, and then the feature recognition network analyzes each word segment. Extract features, output the text vector ⁇ and weight 3 of the first participle, output the text vector ⁇ and weight 4 of the second participle, output the text vector ⁇ and weight 5 of the third participle, and finally fuse these word vectors through the word vector synthesis network.
  • is a semantic vector, and its weight is calculated by a weighting algorithm, that is, the sum of the weights of the vectors ⁇ , ⁇ , and ⁇ is 3+4+5 12,. Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;
  • each type of semantic vector corresponds to an emotion label, and the two are in a one-to-one relationship, and the semantic classification in this example is implemented by a classification model, which can classify objects with common characteristics.
  • the common classification model is Naive Bayesian
  • the two most widely used classification models are Decision Tree Model (Decision Tree Model) and Naive Bayesian Model (NBM).
  • the Naive Bayes Classifier or NBC originated from classical mathematical theory, has a solid mathematical foundation, and has stable classification efficiency;
  • Multi-layer perceptron MLP a fully connected neural network, except for the input layer, the activation functions of other layers are all SIGMOD functions, and the BP algorithm is used to learn the weights: the output is passed backward, and the error is passed forward.
  • Traditional boost algorithm Initially, the weights of all samples are the same, and the weight of the "wrongly classified" samples is continuously increased and the weight of the paired samples is reduced.
  • the above 103 further includes:
  • an emotion label corresponding to each sentence in the news broadcast text is generated.
  • a classification tree is a tree structure that describes the classification of instances.
  • a certain feature of the semantic classification sample is tested, and according to the test result, a certain semantic classification sample is assigned to its child nodes.
  • each child node corresponds to a value of the feature.
  • the semantic classification samples are tested and assigned one by one recursively until a leaf node is reached.
  • the semantic classification samples are divided into leaf node classes. Each leaf node corresponds to a class of semantic classification samples, and a corresponding sentiment label is generated based on the semantic classification samples of each class.
  • a good speech synthesis model is WORLD: WORLD is an open source speech synthesis system based on C language. Speech synthesis mainly includes two methods: waveform splicing and parameter synthesis. WORLD is a parameter synthesis method based on vocoder. Compared with STRAIGHT The advantage is that it reduces the computational complexity and can be applied to real-time speech synthesis. Since STRAIGHT is not an open source system, and in the WORLD paper, WORLD has been compared to STRAIGHT in terms of synthesized audio quality and synthesis speed. End-to-end text-to-speech (TTS) technology based on neural network has developed rapidly. Compared with the concatenative synthesis and statistical parametric synthesis in traditional speech synthesis, the voice generated by end-to-end speech synthesis technology usually has better naturalness of sound. However, this technology still faces the following problems:
  • the end-to-end model usually generates a Mel-Spectrogram in an autoregressive manner, and then synthesizes the speech through a vocoder (Vocoder), and the Mel-spectrum of a piece of speech can usually be To hundreds of thousands of frames, resulting in slower synthesis;
  • the synthesized speech is less stable: the end-to-end model usually adopts the Encoder-Attention-Decoder mechanism for autoregressive generation, due to the error propagation of sequence generation and the lack of attention alignment. Accurate, resulting in repeated words or missing words;
  • the autoregressive neural network model automatically determines the generation length of a speech, and cannot explicitly control the speech rate or prosody pause of the generated speech.
  • the Machine Learning Group of Microsoft Research Asia and the Speech Team of Microsoft (Asia) Internet Engineering Institute and Zhejiang University proposed a new feedforward network FastSpeech based on Transformer, which can generate high quality mel spectrum, and then synthesize the sound in parallel with the help of a vocoder.
  • the preset text-to-speech model adopts the Fast Speech model
  • the full-scene broadcast is a high-fidelity presentation of speech in different scenes.
  • the key lies in the rhythm information such as accent pauses, breath strength, pitch strength, emotional fluctuations, etc. .
  • this embodiment adopts the Fast Speech model as the direction of the underlying technology for productization.
  • Fast Speech speeds up Mel spectrum generation by nearly 270 times, end-to-end speech synthesis by 38 times, and speech synthesis on a single GPU that is 30 times faster than real-time speech times.
  • Fast Speech removes the attention mechanism, reduces the synthesis failure rate, and can effectively avoid the loss caused by the failure of long text synthesis;
  • Fast Speech is a non-autoregressive model, which calculates the value of each character in parallel.
  • Mel spectrum frames avoid the synthesis speed limitation caused by the loop mechanism, but at the same time lead to the lack of rhythmic correlation between each other, thus reducing the expressiveness of the sound.
  • Fast Speech adopts a new feedforward Transformer network architecture, abandoning the traditional encoder-attention-decoder mechanism. Its main module adopts Transformer's self-attention mechanism (Self-Attention) and one-dimensional convolution network (1D Convolution).
  • the feedforward Transformer stacks multiple FFT blocks for Phoneme to Mel spectrum transformation, with N FFT blocks each on the phoneme side and the Mel spectrum side.
  • there is a length regulator (Length Regulator) in the middle which is used to adjust the length difference between the phoneme sequence and the mel spectrum sequence.
  • the above 104 further includes:
  • the news broadcast text is divided into sentences to obtain a plurality of sentences with word order;
  • the text is converted into a phoneme sequence, and information such as the start and end time and frequency change of each phoneme is marked.
  • the phoneme sequence provides a reference for prosody information such as Pitch and Duration in the Fast Speech model, and determines the correct emotion synthesis type, such as situational dialogue A and emotional label a.
  • the Fast Speech model performs prosodic information according to the situational dialogue A and emotional label a.
  • the prediction result is "anger”
  • determine that the temperament synthesis type information is "anger type”
  • input the information of the temperament synthesis type as "anger type” into the speech synthesis network, and the speech synthesis network performs the input information.
  • Parameter parsing the vocoder in the speech synthesis network synthesizes speech according to the result of parameter parsing.
  • the method further includes:
  • the audio of the news broadcast is visually edited to obtain emotional voices under various emotions
  • the dialogue of the waveform audio is visualized according to the time stamp, and the dialogue can be auditioned and edited after quickly locating the dialogue.
  • timestamp 1 is 01:08 ⁇ 02:34
  • timestamp 2 is 02:34 ⁇ 03:28.
  • the entire waveform audio is trimmed to obtain two audio files. Label them with label buttons, such as button 1 (01:08 ⁇ 02:34), button 2 (02:34 ⁇ 03:28), when button 1 is clicked, the audio file corresponding to time stamp 1 will be played.
  • button 2 is clicked, the audio file of the corresponding time period of time stamp 2 will be played.
  • Digital audio production editing software mainly includes recording, sound mixing, post-effect processing, etc.
  • whether the emotional voice matches the emotional tag is manually checked, and the corresponding emotional voice is played after clicking the button, and the staff judges the emotional color of the emotional voice and compares it with the emotional tag bound to the emotional voice Yes, if the emotional color of the emotional voice is consistent with the emotional tag bound to it, then the review is passed, and the emotional voice is marked as news. If the emotional color of the emotional voice is inconsistent with the emotional tag bound to it, then The intelligent semantic analysis of the situational dialogue will be performed again, corresponding emotional tags will be generated, waveform audio will be synthesized, audio clips will be supplemented, and manual review will be initiated again.
  • the synthesis effect is enriched, the simulation degree is improved, the voice effect is diversified, and more possibilities are provided for the product application scenarios; it is produced for the full-scenario broadcast of financial news, and the control power is stronger.
  • An embodiment of the intelligent news broadcasting device in the embodiment of the present application includes:
  • News text acquisition module 201 used to acquire the news broadcast text to be processed
  • Semantic analysis module 202 for inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector
  • the label generation module 203 is configured to input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each sentence in the news broadcast text;
  • the audio synthesis module 204 is configured to input the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and output news broadcast audio with multiple emotions.
  • the news text acquisition module 201 can also be specifically used for:
  • test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
  • the news text acquisition module 201 can also be specifically used for:
  • the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
  • the semantic analysis module 202 can also be specifically used for:
  • the word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
  • the label generation module 203 can also be specifically used for:
  • an emotion label corresponding to each sentence in the news broadcast text is generated.
  • the audio synthesis module 204 can also be specifically used for:
  • the news broadcast text is divided into sentences to obtain a plurality of sentences with word order;
  • FIG. 3 is a schematic structural diagram of a news intelligent broadcasting device provided by an embodiment of the present application.
  • the news intelligent broadcasting device 300 may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units). , CPU) 310 (eg, one or more processors) and memory 320, one or more storage media 330 (eg, one or more mass storage devices) storing applications 333 or data 332. Wherein, the memory 320 and the storage medium 330 may be short-term storage or persistent storage.
  • the program stored in the storage medium 330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the news intelligent broadcasting device 300 .
  • the processor 310 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the news intelligent broadcasting device 300 .
  • News intelligent broadcasting device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input and output interfaces 360, and/or, one or more operating systems 331, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more.
  • operating systems 331, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more.
  • the present application also provides a news intelligent broadcasting device, the news intelligent broadcasting device includes a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor is made to execute the above embodiments.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are executed on the computer, make the computer execute the steps of the method for intelligent news broadcasting.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to the field of artificial intelligence. Disclosed are an intelligent news broadcasting method, apparatus and device, and a storage medium. The intelligent news broadcasting method comprises: acquiring news broadcasting text to be processed; inputting the news broadcasting text into a preset semantic prediction model for semantic prediction, so as to obtain a corresponding semantic vector; inputting the semantic vector into a preset semantic classification model for classification, and generating an emotion tag corresponding to each sentence in the news broadcasting text; and inputting the news broadcasting text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting a news broadcasting audio with a plurality of emotions. According to the present application, a news broadcasting audio with emotions can be synthesized.

Description

新闻智能播报方法、装置、设备及存储介质News intelligent broadcasting method, device, equipment and storage medium
本申请要求于2020年12月10日提交中国专利局、申请号为202011432581.8、发明名称为“新闻智能播报方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application with the application number 202011432581.8 and the invention titled "News Intelligent Broadcasting Method, Apparatus, Equipment and Storage Medium" filed with the China Patent Office on December 10, 2020, the entire contents of which are incorporated by reference in application.
技术领域technical field
本申请涉及人工智能领域,尤其涉及一种新闻智能播报方法、装置、设备及存储介质。The present application relates to the field of artificial intelligence, and in particular, to a method, device, device and storage medium for intelligent news broadcasting.
背景技术Background technique
社交化传媒丰富了新闻形式,金融新闻从传统的新闻主播、电台主播等模式衍生出了更适应新媒体的趣味形式。例如短视频场景、有声电台场景下的财经新闻/科普层出不穷,可见全场景发展已是新闻媒体的主要趋势。全场景播报的核心在于对多元风格化语音合成的支持,而新媒体背景多元化场景下,情绪合成是其取胜的关键。智能语音合成可以一物多用,输入文本,智能合成适应各种平台风格类型的语音,减少对声优的依赖、提高成品产出效率。Social media has enriched news forms, and financial news has derived interesting forms that are more suitable for new media from traditional news anchors, radio anchors and other models. For example, financial news/popular science in the short video scene and the audio station scene emerge in an endless stream. It can be seen that the development of the whole scene has become the main trend of the news media. The core of the full-scene broadcast is the support for multi-stylized speech synthesis, and in the diversified scenarios of the new media background, emotional synthesis is the key to its success. Intelligent speech synthesis can be used for multiple purposes, input text, intelligent synthesis adapt to various platform styles and types of speech, reduce the dependence on seiyuu, and improve the output efficiency of finished products.
现有技术中,发明人意识到针对声音情感表现力方面技术建树较少,且语音合成的情感部分尚未达到逼真拟人化,所以当前无法合成带有情绪的新闻播报音频。In the prior art, the inventor realizes that there are few technical achievements in the aspect of voice emotional expressiveness, and the emotional part of speech synthesis has not yet achieved realistic anthropomorphism, so it is currently impossible to synthesize news broadcast audio with emotions.
发明内容SUMMARY OF THE INVENTION
本申请的主要目的在于解决当前无法合成带有情绪的新闻播报音频的问题。The main purpose of this application is to solve the current problem of inability to synthesize news broadcast audio with emotions.
本申请第一方面提供了一种新闻智能播报方法,包括:A first aspect of the present application provides a method for intelligent news broadcasting, including:
获取待处理的新闻播报文本;Get the text of the newscast to be processed;
将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;
将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
本申请第二方面提供了一种新闻智能播报设备,所述新闻智能播报设备包括:存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的新闻智能播报程序,所述处理器执行所述新闻智能播报程序时实现如下步骤:A second aspect of the present application provides a news intelligent broadcasting device, the news intelligent broadcasting device comprising: a memory, a processor, and a news intelligent broadcasting program stored on the memory and running on the processor, the When the processor executes the intelligent news broadcast program, the following steps are implemented:
获取待处理的新闻播报文本;Get the text of the newscast to be processed;
将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;
将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
本申请第三方面提供了一种存储介质,一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A third aspect of the present application provides a storage medium, a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps:
获取待处理的新闻播报文本;Get the text of the newscast to be processed;
将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;
将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
本申请第四方面提供了一种新闻智能播报装置,所述新闻智能播报装置包括:A fourth aspect of the present application provides a news intelligent broadcasting device, and the news intelligent broadcasting device includes:
新闻文本获取模块,用于获取待处理的新闻播报文本;The news text acquisition module is used to acquire the news broadcast text to be processed;
语义分析模块,用于将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;a semantic analysis module for inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
标签生成模块,用于将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;The label generation module is configured to input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each sentence in the news broadcast text;
音频合成模块,用于将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。The audio synthesis module is used for inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
本申请提供的技术方案中,通过对获取的新闻播报文本进行语义预测,将语义预测得到的语义向量进行分类,根据分类结果生成多个情绪标签,最后将新闻播报文本和对应的情绪标签输入到预置文本转语音模型中进行音频合成。本申请能够实现合成带有情绪的新闻播报音频。In the technical solution provided by this application, by performing semantic prediction on the acquired news broadcast text, the semantic vector obtained by the semantic prediction is classified, and multiple emotional labels are generated according to the classification results, and finally the news broadcast text and the corresponding emotional label are input into the Audio synthesis in a preset text-to-speech model. The present application can realize the synthesis of news broadcast audio with emotions.
附图说明Description of drawings
图1为本申请实施例中新闻智能播报方法的一个实施例示意图;1 is a schematic diagram of an embodiment of a method for intelligently broadcasting news in an embodiment of the application;
图2为本申请实施例中新闻智能播报装置的一个实施例示意图;FIG. 2 is a schematic diagram of an embodiment of a news intelligent broadcasting device in an embodiment of the application;
图3为本申请实施例中新闻智能播报设备的一个实施例示意图。FIG. 3 is a schematic diagram of an embodiment of a news intelligent broadcasting device in an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种新闻智能播报方法、装置、设备及存储介质,能够丰富情绪的合成效果,提高语音的仿真度。The embodiments of the present application provide a method, device, device and storage medium for intelligent news broadcasting, which can enrich the synthesis effect of emotions and improve the simulation degree of speech.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of the present application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used can be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中新闻智能播报方法的一个实施例包括:For ease of understanding, the specific flow of the embodiment of the present application is described below, referring to FIG. 1 , an embodiment of the method for intelligently broadcasting news in the embodiment of the present application includes:
101、获取待处理的新闻播报文本;101. Obtain the text of the news broadcast to be processed;
本实施例中,业务人员将新闻情景文本上传到剧本库中,管理人员可以在剧本管理页面中对剧情进行管理,其中管理页面下分为了“剧本库存”和“角色管理”这两个模块,点击“剧本库存”将获取到上传到剧本库中的剧本,选择一个剧本可以进行查看剧本内容及其剧本描述信息,剧本描述信息包括剧本播报类型、剧本场景、剧本字数统计,例如“单人播报”、“电视新闻场景”、“694个汉字”。In this embodiment, the business personnel upload the news scenario text to the script library, and the management personnel can manage the script on the script management page. The management page is divided into two modules: "script inventory" and "role management". Click "Script Inventory" to get the script uploaded to the script library. Select a script to view the script content and script description information. The script description information includes script broadcast type, script scene, and script word count, such as "Single Player Broadcast" ", "TV news scene", "694 Chinese characters".
所述新闻播报文本中包含时间戳,时间戳的设定可以通过选中剧本文本中的文字可以将其标注为剧本对白中的关键性语句,关键性语句通过红色字体标识显示,对于关键性语句可以通过剧本查看页面中的时间戳设置栏键入时间来设置该关键性语句对应的时间戳。时间戳设置栏分为开始时间和结束时间,剧本文本中所有设定的时间戳都在标注历史中显示,通过点击标注历史中的时间戳可以快速定位到对应剧本文本中的关键性语句。其中,时间戳设定的作用是在波形音频生成后,干预不同语段在音频中的时间位置,以方便业务人员后续剪辑后期。“角色管理”将根据剧本角色设定进行单独上传小传或非剧本内对白,支持与录制剧本音频进行匹配,实现按角色进行试听与审核。The news broadcast text contains a time stamp, and the setting of the time stamp can be marked as a key sentence in the script dialogue by selecting the text in the script text, and the key sentence is displayed in red font. Set the timestamp corresponding to the key statement by typing the time in the timestamp setting field on the script viewing page. The timestamp setting column is divided into start time and end time. All the timestamps set in the script text are displayed in the annotation history. By clicking the timestamp in the annotation history, you can quickly locate the key sentences in the corresponding script text. Among them, the function of time stamp setting is to intervene the time position of different segments in the audio after the waveform audio is generated, so as to facilitate the subsequent editing of the business personnel. "Character Management" will upload biography or non-script dialogue independently according to the script role setting, support matching with recorded script audio, and realize audition and review by role.
在本实施例中,在上述步骤101之前,还包括:In this embodiment, before the above step 101, it further includes:
获取语义预测训练样本集合和语义标签集合,并建立样本与标签之间的匹配关系;Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;
对所述预测训练样本集合和语义标签集合进行切分,得到训练样本集合和测试样本集合;Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;
将所述训练样本集合输入预置神经网络模型进行语义预测训练,得到语义预测模型;Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;
将所述测试样本集合输入所述语义预测模型进行模型性能测试,若测试结果为良好,则模型训练结束,否则继续进行模型训练。The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
本实施例中,通过大量的训练样本来对神经网络模型进行语义预测能力的训练,其中训练样本包括了训练文本和语义标签,一个训练文本可对应多个语义标签,每一个训练文本及其对应的语义标签将作为一个训练样本,将集合中若干个这样的样本进行切分,其中一部分作为模型训练的样本材料,另一部分作为检测模型训练效果的测试样本材料,这个比例可以控制为9:1(训练样本材料:测试样本材料),其中90%的样本用于训练,得到了语义预测模型,用剩下的10%去验证所述语义预测模型的性能,如果测试结果达到预先设定的良好评分,那么到此结束模型的训练,良好评分可以用过预测成功比例来规定,例如规定预测成功比例为60%即为良好评分,那么用语义预测模型进行语义预测10次,将每次的预测结果与对应的语义标签对比,当这10次预测中有6次预测准确就达到了“良好”等级。如果没有达到“良好”等级,则重新调整训练参数,继续进行模型训练。In this embodiment, a large number of training samples are used to train the semantic prediction ability of the neural network model, wherein the training samples include training text and semantic labels, one training text can correspond to multiple semantic labels, and each training text and its corresponding The semantic label will be used as a training sample, and a number of such samples in the collection will be divided, part of which is used as the sample material for model training, and the other part is used as the test sample material to detect the training effect of the model. The ratio can be controlled to 9:1 (Training sample material: test sample material), 90% of the samples are used for training to obtain a semantic prediction model, and the remaining 10% are used to verify the performance of the semantic prediction model. If the test results reach the preset good Score, then end the training of the model, and the good score can be specified by the prediction success ratio. For example, if the prediction success ratio is 60%, it is a good score, then use the semantic prediction model to perform semantic prediction 10 times, and use the prediction success rate for each time. The results are compared with the corresponding semantic labels, and a "good" rating is achieved when 6 of the 10 predictions are accurate. If the "good" rating is not reached, the training parameters are re-adjusted and the model training continues.
在本实施例中,在上述步骤101之前,还包括:In this embodiment, before the above step 101, it further includes:
获取语义分类样本,并对所述语义分类样本添加分类标签信息;obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;
对预置决策树模型进行初始化,并将所述语义分类样本与对应的分类标签信息输入所述决策树模型中;Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;
通过所述决策树模型,对所述语义分类样本进行处理,得到所述语义分类样本的分类预测结果;processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;
根据所述分类预测结果和分类标签信息,对所述决策树模型的参数进行优化,直至所述决策树模型收敛,得到语义分类模型。According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
本实施例中,采用了决策树模型进行分类,决策树由结点和有向边组成。结点有两种类型:内部结点和叶结点,其中内部结点表示一个特征或属性,叶结点表示一个类。一般的,一棵决策树包含一个根结点、若干个内部结点和若干个叶结点。叶结点对应于决策结果,其他每个结点则对应于一个属性测试。每个结点包含的样本集合根据属性测试的结果被划分到子结点中,根结点包含样本全集,从根结点到每个叶结点的路径对应了一个判定测试序列。决策树学习的目的是为了产生一棵泛化能力强,即处理未见示例能力强的决策树。In this embodiment, a decision tree model is used for classification, and the decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes, where internal nodes represent a feature or attribute, and leaf nodes represent a class. Generally, a decision tree contains a root node, several internal nodes and several leaf nodes. Leaf nodes correspond to decision results, and each other node corresponds to an attribute test. The sample set contained in each node is divided into child nodes according to the result of the attribute test, the root node contains the full set of samples, and the path from the root node to each leaf node corresponds to a decision test sequence. The purpose of decision tree learning is to generate a decision tree with strong generalization ability, that is, the ability to deal with unseen examples.
根据给定的语义分类训练样本构建一个决策树模型,使它能够对实例进行正确的分类,其本质上是从训练数据集中归纳出一组分类规则。而参数是否需要进行优化,是通过计算损失函数来确定的,损失函数越小,生成的决策树则越优良。其损失函数通常是正则化的极大似然函数。Constructing a decision tree model based on the given semantic classification training samples, so that it can classify the instances correctly, is essentially inducting a set of classification rules from the training data set. Whether the parameters need to be optimized is determined by calculating the loss function. The smaller the loss function, the better the generated decision tree. Its loss function is usually a regularized maximum likelihood function.
102、将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;102. Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
本实施例中所采用的语义预测模型为BERT模型(Bidirectional Encoder Representation from Transformers),即双向Transformer的Encoder,因为decoder是不能获要预测的信息的。模型的主要创新点都在pre-train方法上,即用了Masked LM和Next Sentence Prediction两种方法分别捕捉词语和句子级别的representation。本实施例中,基于BERT模型进行智能语义分析,判别文本是否具有喜怒哀乐等情绪。The semantic prediction model adopted in this embodiment is the BERT model (Bidirectional Encoder Representation from Transformers), that is, the Encoder of the bidirectional Transformer, because the decoder cannot obtain the information to be predicted. The main innovations of the model are in the pre-train method, which uses Masked LM and Next Sentence Prediction to capture word- and sentence-level representations respectively. In this embodiment, intelligent semantic analysis is performed based on the BERT model to determine whether the text has emotions such as joy, anger, sadness, and joy.
在本实施例中,在上述步骤101之前,还包括:In this embodiment, before the above step 101, it further includes:
对所述新闻播报文本进行分词,得到带词序的多个分词;Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;
将所述各分词依次输入所述特征识别网络进行特征抽取,输出所述各分词对应的词向 量和语义权重;Inputting each word segment into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each word segment;
将所述各词向量输入所述词向量合成网络,并根据所述语义权重对所述各词向量进行加权融合,输出相应的语义向量。The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
本实施例中,分词是通过获取分析新闻播报文本预设的分词结构,得到带有词序的多个分词,例如得到第一分词、第二分词、第三分词,然后特征识别网络对每一个分词提取特征,输出第一分词的文本向量α及权重3,输出第二分词的文本向量β及权重4,输出第三分词的文本向量ɡ及权重5,最后通过词向量合成网络将这些词向量融合为一个语义向量,其权重计算通过加权算法得到,即向量α、β、ɡ的权重之和3+4+5=12,。将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;In this embodiment, word segmentation is obtained by acquiring and analyzing the pre-set word segmentation structure of the news broadcast text to obtain multiple word segments with word order, for example, obtaining the first word segment, the second word segment, and the third word segment, and then the feature recognition network analyzes each word segment. Extract features, output the text vector α and weight 3 of the first participle, output the text vector β and weight 4 of the second participle, output the text vector ɡ and weight 5 of the third participle, and finally fuse these word vectors through the word vector synthesis network. is a semantic vector, and its weight is calculated by a weighting algorithm, that is, the sum of the weights of the vectors α, β, and ɡ is 3+4+5=12,. Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;
103、将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;103. Input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each statement in the news broadcast text;
本实施例中,每一类语义向量即对应一个情绪标签,两者为一对一的关系,而本实例中的语义分类通过分类模型来实现的,分类模型能够将具有共同特征的目标划分出来,常见的分类模型有朴素贝叶斯,最为广泛的两种分类模型是决策树模型(Decision Tree Model)和朴素贝叶斯模型(Naive Bayesian Model,NBM)。和决策树模型相比,朴素贝叶斯分类器(Naive Bayes Classifier,或NBC)发源于古典数学理论,有着坚实的数学基础,以及稳定的分类效率;In this embodiment, each type of semantic vector corresponds to an emotion label, and the two are in a one-to-one relationship, and the semantic classification in this example is implemented by a classification model, which can classify objects with common characteristics. , the common classification model is Naive Bayesian, the two most widely used classification models are Decision Tree Model (Decision Tree Model) and Naive Bayesian Model (NBM). Compared with the decision tree model, the Naive Bayes Classifier (or NBC) originated from classical mathematical theory, has a solid mathematical foundation, and has stable classification efficiency;
逻辑回归:y=sigmoid(wx)函数,根据某个概率阈值划分类别;SVM,假设存在一个超平面,能够将所有样本进行隔开。多层感知机MLP,全连接的神经网络,除了输入层,其他层的激活函数都SIGMOD函数,采用BP算法学习权值:输出向后传递,误差向前传递。传统的boost算法:初始所有样本的权重都是一致的,后续不断增加“被分错”样本的权重,降低分对样本的权重。Logistic regression: y=sigmoid(wx) function, which is divided into categories according to a certain probability threshold; SVM, assuming that there is a hyperplane, which can separate all samples. Multi-layer perceptron MLP, a fully connected neural network, except for the input layer, the activation functions of other layers are all SIGMOD functions, and the BP algorithm is used to learn the weights: the output is passed backward, and the error is passed forward. Traditional boost algorithm: Initially, the weights of all samples are the same, and the weight of the "wrongly classified" samples is continuously increased and the weight of the paired samples is reduced.
在本实施例中,上述103还包括:In this embodiment, the above 103 further includes:
将所述语义向量输入所述特征提取网络进行特征提取,输出多个对应的特征;Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;
将所述多个对应的特征输入所述特征识别网络进行特征测试,输出测试结果;Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;
将所述测试结果输入所述分类网络,并根据所述测试结果将所述语义向量进行节点分配,输出所述语义向量的分类树;Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;
基于所述语义向量的分类树,生成所述新闻播报文本中各语句对应的情绪标签。Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.
分类树是一种描述对实例进行分类的树形结构。在使用分类树进行分类时,从根结点开始,对所述语义分类样本的某一特征进行测试,根据测试结果,将某一个语义分类样本分配到其子结点。这时,每一个子结点对应着该特征的一个取值。如此递归地对语义分类样本逐一进行测试并分配,直至达到叶结点。最后将语义分类样本分到叶结点的类中。而每一个叶节点即对应一类语义分类样本,基于每一类的语义分类样本生成一个对应的情绪标签。A classification tree is a tree structure that describes the classification of instances. When using the classification tree for classification, starting from the root node, a certain feature of the semantic classification sample is tested, and according to the test result, a certain semantic classification sample is assigned to its child nodes. At this time, each child node corresponds to a value of the feature. In this way, the semantic classification samples are tested and assigned one by one recursively until a leaf node is reached. Finally, the semantic classification samples are divided into leaf node classes. Each leaf node corresponds to a class of semantic classification samples, and a corresponding sentiment label is generated based on the semantic classification samples of each class.
104、将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频;104. Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions;
比较好的语音合成模型有WORLD:WORLD是一个基于C语言的开源语音合成系统,语音合成主要包括波形拼接和参数合成两种方法,WORLD是一种基于vocoder的参数合成方法,它相比于STRAIGHT的优势是减少了计算复杂度,并且可以应用于实时的语音合成。由于STRAIGHT不是开源的系统,并且在WORLD论文中已经对比了WORLD相比于STRAIGHT无论是在合成的音频质量上还是合成速度上都处于领先优势。基于神经网络的端到端文本到语音合成(Text-to-Speech,TTS)技术取了快速发展。与传统语音合成中的拼接法(concatenative synthesis)和参数法(statistical parametric synthesis)相比,端到端语音 合成技术生成的声音通常具有更好的声音自然度。但是,这种技术依然面临以下几个问题:A good speech synthesis model is WORLD: WORLD is an open source speech synthesis system based on C language. Speech synthesis mainly includes two methods: waveform splicing and parameter synthesis. WORLD is a parameter synthesis method based on vocoder. Compared with STRAIGHT The advantage is that it reduces the computational complexity and can be applied to real-time speech synthesis. Since STRAIGHT is not an open source system, and in the WORLD paper, WORLD has been compared to STRAIGHT in terms of synthesized audio quality and synthesis speed. End-to-end text-to-speech (TTS) technology based on neural network has developed rapidly. Compared with the concatenative synthesis and statistical parametric synthesis in traditional speech synthesis, the voice generated by end-to-end speech synthesis technology usually has better naturalness of sound. However, this technology still faces the following problems:
合成语音的速度较慢:端到端模型通常以自回归(Autoregressive)的方式生成梅尔谱(Mel-Spectrogram),再通过声码器(Vocoder)合成语音,而一段语音的梅尔谱通常能到几百上千帧,导致合成速度较慢;The speed of synthesizing speech is slow: the end-to-end model usually generates a Mel-Spectrogram in an Autoregressive manner, and then synthesizes the speech through a vocoder (Vocoder), and the Mel-spectrum of a piece of speech can usually be To hundreds of thousands of frames, resulting in slower synthesis;
合成的语音稳定性较差:端到端模型通常采用编码器-注意力-解码器(Encoder-Attention-Decoder)机制进行自回归生成,由于序列生成的错误传播(Error Propagation)以及注意力对齐不准,导致出现重复吐词或漏词现象;The synthesized speech is less stable: the end-to-end model usually adopts the Encoder-Attention-Decoder mechanism for autoregressive generation, due to the error propagation of sequence generation and the lack of attention alignment. Accurate, resulting in repeated words or missing words;
缺乏可控性:自回归的神经网络模型自动决定一条语音的生成长度,无法显式地控制生成语音的语速或者韵律停顿等。为了解决上述的一系列问题,微软亚洲研究院机器学习组和微软(亚洲)互联网工程院语音团队联合浙江大学提出了一种基于Transformer的新型前馈网络FastSpeech,可以并行、稳定、可控地生成高质量的梅尔谱,再借助声码器并行地合成声音。Lack of controllability: The autoregressive neural network model automatically determines the generation length of a speech, and cannot explicitly control the speech rate or prosody pause of the generated speech. In order to solve the above-mentioned series of problems, the Machine Learning Group of Microsoft Research Asia and the Speech Team of Microsoft (Asia) Internet Engineering Institute and Zhejiang University proposed a new feedforward network FastSpeech based on Transformer, which can generate high quality mel spectrum, and then synthesize the sound in parallel with the help of a vocoder.
本实施例中,预置文本转语音模型采用的是Fast Speech模型,全场景播报是对不同场景语音的高保真表现,关键在于重音停连、气息强弱、音调强弱、情绪起伏等韵律信息。由于需要高表现力,且具有长文本特征,本实施例采用Fast Speech模型作为产品化底层技术的方向。与自回归的Transformer TTS相比,Fast Speech将梅尔谱的生成速度提高了近270倍,将端到端语音合成速度提高了38倍,单GPU上的语音合成速度达到了实时语音速度的30倍。并且,Fast Speech去除了注意力机制,降低了合成失败率,能有效避免因长文本合成失败带来的损失;相比Tacotron2模型,Fast Speech是一种非自回归模型,并行计算每个字符的梅尔谱帧,避免了因循环机制带来的合成速度限制,但同时导致互相之间的韵律关联有所缺失,从而降低了声音表现力。但建议引入方差调节器机制,对Pitch、Duration等韵律信息进行预测,提高对于合成声音的音素持续时间/音调强弱/重音音量等方面效果,以此来实现又快又好的效果。In this embodiment, the preset text-to-speech model adopts the Fast Speech model, and the full-scene broadcast is a high-fidelity presentation of speech in different scenes. The key lies in the rhythm information such as accent pauses, breath strength, pitch strength, emotional fluctuations, etc. . Due to the need for high expressiveness and the feature of long text, this embodiment adopts the Fast Speech model as the direction of the underlying technology for productization. Compared with the autoregressive Transformer TTS, Fast Speech speeds up Mel spectrum generation by nearly 270 times, end-to-end speech synthesis by 38 times, and speech synthesis on a single GPU that is 30 times faster than real-time speech times. In addition, Fast Speech removes the attention mechanism, reduces the synthesis failure rate, and can effectively avoid the loss caused by the failure of long text synthesis; Compared with the Tacotron2 model, Fast Speech is a non-autoregressive model, which calculates the value of each character in parallel. Mel spectrum frames avoid the synthesis speed limitation caused by the loop mechanism, but at the same time lead to the lack of rhythmic correlation between each other, thus reducing the expressiveness of the sound. However, it is recommended to introduce a variance regulator mechanism to predict prosody information such as Pitch and Duration, and improve the effects of phoneme duration/pitch intensity/accent volume for synthesized sounds, so as to achieve fast and good results.
Fast Speech采用一种新型的前馈Transformer网络架构,抛弃掉传统的编码器-注意力-解码器机制。其主要模块采用Transformer的自注意力机制(Self-Attention)以及一维卷积网络(1D Convolution)。前馈Transformer堆叠多个FFT块,用于音素(Phoneme)到梅尔谱变换,音素侧和梅尔谱侧各有N个FFT块。特别注意的是,中间有一个长度调节器(Length Regulator),用来调节音素序列和梅尔谱序列之间的长度差异。Fast Speech adopts a new feedforward Transformer network architecture, abandoning the traditional encoder-attention-decoder mechanism. Its main module adopts Transformer's self-attention mechanism (Self-Attention) and one-dimensional convolution network (1D Convolution). The feedforward Transformer stacks multiple FFT blocks for Phoneme to Mel spectrum transformation, with N FFT blocks each on the phoneme side and the Mel spectrum side. In particular, there is a length regulator (Length Regulator) in the middle, which is used to adjust the length difference between the phoneme sequence and the mel spectrum sequence.
在本实施例中,上述104还包括:In this embodiment, the above 104 further includes:
对所述新闻播报文本进行语句划分,得到带语序的多个语句;The news broadcast text is divided into sentences to obtain a plurality of sentences with word order;
将所述各语句和所述各语句对应的情绪标签输入所述文本预处理网络进行音素序列化处理,输出音素序列;Inputting each sentence and the emotional label corresponding to each sentence into the text preprocessing network for phoneme serialization, and outputting a phoneme sequence;
将所述音素序列输入所述音律预测网络进行音律预测,得到音律合成类型信息;Inputting the phoneme sequence into the temperament prediction network for temperament prediction, to obtain temperament synthesis type information;
将所述音律合成类型信息输入所述语音合成网络进行波形生成,输出带有多种情绪的新闻播报音频。Inputting the melody synthesis type information into the speech synthesis network for waveform generation, and outputting news broadcast audio with multiple emotions.
本实施例中,把文本转化成音素序列,并标出每个音素的起止时间、频率变化等信息。作为一个预处理步骤,它的重要性经常被忽视,但是它涉及到很多值得研究的问题,比如拼写相同但读音不同的词的区分、缩写的处理、停顿位置的确定,等等。所述音素序列为Fast Speech模型中Pitch、Duration等韵律信息提供参考依据,确定正确的情绪合成类型,例如情景对白A和情绪标签a,Fast Speech模型根据情景对白A和情绪标签a对韵律信息进行预测,预测结果为“愤怒”,那么确定音律合成类型信息即为“愤怒类型”,最终将音律合成类型为“愤怒类型”的信息输入到语音合成网络中,由语音合成网络对输入的信息进行参数解析,由语音合成网络中的声码器根据参数解析结果合成语音。In this embodiment, the text is converted into a phoneme sequence, and information such as the start and end time and frequency change of each phoneme is marked. As a preprocessing step, its importance is often overlooked, but it involves many issues worth investigating, such as the distinction of words with the same spelling but different pronunciations, the handling of abbreviations, the determination of pause positions, and so on. The phoneme sequence provides a reference for prosody information such as Pitch and Duration in the Fast Speech model, and determines the correct emotion synthesis type, such as situational dialogue A and emotional label a. The Fast Speech model performs prosodic information according to the situational dialogue A and emotional label a. Prediction, the prediction result is "anger", then determine that the temperament synthesis type information is "anger type", and finally input the information of the temperament synthesis type as "anger type" into the speech synthesis network, and the speech synthesis network performs the input information. Parameter parsing, the vocoder in the speech synthesis network synthesizes speech according to the result of parameter parsing.
在本实施例中,在将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频之后,还包括:In this embodiment, after inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions, the method further includes:
根据预置时间戳,对所述新闻播报音频进行可视化剪辑,得到多种不同情绪下的情感语音;According to the preset time stamp, the audio of the news broadcast is visually edited to obtain emotional voices under various emotions;
将所述各情感语音及所述各情感语音对应的情绪标签提交人工审核。Submit the emotional voices and emotional labels corresponding to the emotional voices to manual review.
本实施例中,按照时间戳将波形音频的对白可视化处理,快速定位对白后可以进行试听与编辑。例如时间戳1为01:08~02:34,时间戳2为02:34~03:28,按照这两个时间段将整个波形音频进行裁剪,得到两个音频文件,将这两个音频文件分别用标签按钮进行标注,例如按钮1(01:08~02:34),按钮2(02:34~03:28),当点击按钮1时,将播放时间戳1对应时段的音频文件,当点击按钮2时,将播放时间戳2对应时段的音频文件。In this embodiment, the dialogue of the waveform audio is visualized according to the time stamp, and the dialogue can be auditioned and edited after quickly locating the dialogue. For example, timestamp 1 is 01:08~02:34, and timestamp 2 is 02:34~03:28. According to these two time periods, the entire waveform audio is trimmed to obtain two audio files. Label them with label buttons, such as button 1 (01:08~02:34), button 2 (02:34~03:28), when button 1 is clicked, the audio file corresponding to time stamp 1 will be played. When button 2 is clicked, the audio file of the corresponding time period of time stamp 2 will be played.
如果需要对音频文件进行修改与剪辑,通过时间戳快速定位到对应的音频,将相应的音频文件从系统中下载后通过数字音频制作类编辑软件对音频波形进行“反转”、“静音”、“放大”、“扩音”、“减弱”、“淡入”、“淡出”、“规则化”等常规处理,剪贴、复制、粘贴、多文件合并和混音等常规处理,“槽带滤波器”、“带通滤波器”、“高通滤波器”、“低通滤波器”、“高频滤波器”、“低通滤波器”、“FFT滤波器”滤波处理。数字音频制作类编辑软件主要包括录音、混音、后期效果处理等,是以音频处理为核心,集声音记录、播放、编辑、处理和转换于一体的功能强大的数字音频编辑软件,具备制作专业声效所需的丰富效果和编辑功能,用它可以完成各种复杂和精细的专业音频编辑。在声音处理方面包含有频率均衡、效果处理、降噪等多项功能。关于音频的剪辑,在数字音频制作类编辑软件中打开相应时间戳对应的音频文件,将所述音频文件分段后进行拼接、修改等处理,将处理后的音频文件上传到系统中。If you need to modify and edit the audio file, quickly locate the corresponding audio through the timestamp, download the corresponding audio file from the system, and use digital audio production editing software to "invert", "mute", and "mute" the audio waveform. General processing such as "Amplify", "Amplify", "Reduce", "Fade In", "Fade Out", "Regularization", general processing such as Cut, Copy, Paste, Multi-File Merge and Mixing, "Slot Band Filter" ", "Band Pass Filter", "High Pass Filter", "Low Pass Filter", "High Frequency Filter", "Low Pass Filter", "FFT Filter" filtering processing. Digital audio production editing software mainly includes recording, sound mixing, post-effect processing, etc. It is a powerful digital audio editing software with audio processing as the core, integrating sound recording, playback, editing, processing and conversion. The rich effects and editing functions required for sound effects can be used to complete a variety of complex and sophisticated professional audio editing. In terms of sound processing, it includes a number of functions such as frequency equalization, effect processing, and noise reduction. Regarding audio clipping, open the audio file corresponding to the corresponding timestamp in digital audio production editing software, segment the audio file, perform processing such as splicing, modification, etc., and upload the processed audio file to the system.
本实施例中,人工审核情感语音与情绪标签是否匹配,通过点击按钮后播放相应的情感语音,由工作人员判断所述情感语音的情感色彩,并与所述情感语音绑定的情绪标签进行比对,如果所述情感语音的情感色彩与其绑定的情绪标签是一致的,那么审核通过,将所述情感语音标记为新闻,如果所述情感语音的情感色彩与其绑定的情绪标签不一致,那么将重新对情景对白进行智能语义分析,生成对应情绪标签,合成波形音频,对音频剪辑补充,并再一次发起人工审核。In this embodiment, whether the emotional voice matches the emotional tag is manually checked, and the corresponding emotional voice is played after clicking the button, and the staff judges the emotional color of the emotional voice and compares it with the emotional tag bound to the emotional voice Yes, if the emotional color of the emotional voice is consistent with the emotional tag bound to it, then the review is passed, and the emotional voice is marked as news. If the emotional color of the emotional voice is inconsistent with the emotional tag bound to it, then The intelligent semantic analysis of the situational dialogue will be performed again, corresponding emotional tags will be generated, waveform audio will be synthesized, audio clips will be supplemented, and manual review will be initiated again.
本实施例,丰富合成效果,提高仿真度、令语音效果多样化,为产品应用场景提供更多可能;针对金融新闻全场景播报而制作,驾驭力更强。In this embodiment, the synthesis effect is enriched, the simulation degree is improved, the voice effect is diversified, and more possibilities are provided for the product application scenarios; it is produced for the full-scenario broadcast of financial news, and the control power is stronger.
上面对本申请实施例中新闻智能播报方法进行了描述,下面对本申请实施例中新闻智能播报装置进行描述,请参阅图2,本申请实施例中新闻智能播报装置一个实施例包括:The method for intelligent news broadcasting in the embodiment of the present application has been described above. The following describes the intelligent news broadcasting device in the embodiment of the present application. Please refer to FIG. 2. An embodiment of the intelligent news broadcasting device in the embodiment of the present application includes:
新闻文本获取模块201,用于获取待处理的新闻播报文本;News text acquisition module 201, used to acquire the news broadcast text to be processed;
语义分析模块202,用于将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量; Semantic analysis module 202, for inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
标签生成模块203,用于将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;The label generation module 203 is configured to input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each sentence in the news broadcast text;
音频合成模块204,用于将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。The audio synthesis module 204 is configured to input the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and output news broadcast audio with multiple emotions.
可选的,新闻文本获取模块201还可以具体用于:Optionally, the news text acquisition module 201 can also be specifically used for:
获取语义预测训练样本集合和语义标签集合,并建立样本与标签之间的匹配关系;Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;
对所述预测训练样本集合和语义标签集合进行切分,得到训练样本集合和测试样本集合;Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;
将所述训练样本集合输入预置神经网络模型进行语义预测训练,得到语义预测模型;Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;
将所述测试样本集合输入所述语义预测模型进行模型性能测试,若测试结果为良好,则模型训练结束,否则继续进行模型训练。The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
可选的,新闻文本获取模块201还可以具体用于:Optionally, the news text acquisition module 201 can also be specifically used for:
获取语义分类样本,并对所述语义分类样本添加分类标签信息;obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;
对预置决策树模型进行初始化,并将所述语义分类样本与对应的分类标签信息输入所述决策树模型中;Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;
通过所述决策树模型,对所述语义分类样本进行处理,得到所述语义分类样本的分类预测结果;processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;
根据所述分类预测结果和分类标签信息,对所述决策树模型的参数进行优化,直至所述决策树模型收敛,得到语义分类模型。According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
可选的,语义分析模块202还可以具体用于:Optionally, the semantic analysis module 202 can also be specifically used for:
对所述新闻播报文本进行分词,得到带词序的多个分词;Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;
将所述各分词依次输入所述特征识别网络进行特征抽取,输出所述各分词对应的词向量和语义权重;Inputting each segmented word into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each segmented word;
将所述各词向量输入所述词向量合成网络,并根据所述语义权重对所述各词向量进行加权融合,输出相应的语义向量。The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
可选的,标签生成模块203还可以具体用于:Optionally, the label generation module 203 can also be specifically used for:
将所述语义向量输入所述特征提取网络进行特征提取,输出多个对应的特征;Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;
将所述多个对应的特征输入所述特征识别网络进行特征测试,输出测试结果;Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;
将所述测试结果输入所述分类网络,并根据所述测试结果将所述语义向量进行节点分配,输出所述语义向量的分类树;Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;
基于所述语义向量的分类树,生成所述新闻播报文本中各语句对应的情绪标签。Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.
可选的,音频合成模块204还可以具体用于:Optionally, the audio synthesis module 204 can also be specifically used for:
对所述新闻播报文本进行语句划分,得到带语序的多个语句;The news broadcast text is divided into sentences to obtain a plurality of sentences with word order;
将所述各语句和所述各语句对应的情绪标签输入所述文本预处理网络进行音素序列化处理,输出音素序列;Inputting each sentence and the emotional label corresponding to each sentence into the text preprocessing network for phoneme serialization, and outputting a phoneme sequence;
将所述音素序列输入所述音律预测网络进行音律预测,得到音律合成类型信息;Inputting the phoneme sequence into the temperament prediction network for temperament prediction, to obtain temperament synthesis type information;
将所述音律合成类型信息输入所述语音合成网络进行波形生成,输出带有多种情绪的新闻播报音频。Inputting the melody synthesis type information into the speech synthesis network for waveform generation, and outputting news broadcast audio with multiple emotions.
上面图1和图2从模块化功能实体的角度对本申请实施例中的新闻智能播报装置进行详细描述,下面从硬件处理的角度对本申请实施例中新闻智能播报设备进行详细描述。1 and 2 above describe the intelligent news broadcasting device in the embodiment of the present application in detail from the perspective of modular functional entities, and the following describes the intelligent news broadcasting device in the embodiment of the present application in detail from the perspective of hardware processing.
图3是本申请实施例提供的一种新闻智能播报设备的结构示意图,该新闻智能播报设备300可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)310(例如,一个或一个以上处理器)和存储器320,一个或一个以上存储应用程序333或数据332的存储介质330(例如一个或一个以上海量存储设备)。其中,存储器320和存储介质330可以是短暂存储或持久存储。存储在存储介质330的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对新闻智能播报设备300中的一系列指令操作。更进一步地,处理器310可以设置为与存储介质330通信,在新闻智能播报设备300上执行存储介质330中的一系列指令操作。FIG. 3 is a schematic structural diagram of a news intelligent broadcasting device provided by an embodiment of the present application. The news intelligent broadcasting device 300 may have relatively large differences due to different configurations or performances, and may include one or more processors (central processing units). , CPU) 310 (eg, one or more processors) and memory 320, one or more storage media 330 (eg, one or more mass storage devices) storing applications 333 or data 332. Wherein, the memory 320 and the storage medium 330 may be short-term storage or persistent storage. The program stored in the storage medium 330 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the news intelligent broadcasting device 300 . Furthermore, the processor 310 may be configured to communicate with the storage medium 330 to execute a series of instruction operations in the storage medium 330 on the news intelligent broadcasting device 300 .
新闻智能播报设备300还可以包括一个或一个以上电源340,一个或一个以上有线或无线网络接口350,一个或一个以上输入输出接口360,和/或,一个或一个以上操作系统331,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图3示出的新闻智能播报设备结构并不构成对新闻智能播报设备的限定,可以包括 比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。News intelligent broadcasting device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input and output interfaces 360, and/or, one or more operating systems 331, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD and more. Those skilled in the art can understand that the structure of the news intelligent broadcasting device shown in FIG. 3 does not constitute a limitation on the news intelligent broadcasting device, and may include more or less components than those shown in the figure, or combine some components, or different Component placement.
本申请还提供一种新闻智能播报设备,所述新闻智能播报设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述新闻智能播报方法的步骤。The present application also provides a news intelligent broadcasting device, the news intelligent broadcasting device includes a memory and a processor, the memory stores computer-readable instructions, and when the computer-readable instructions are executed by the processor, the processor is made to execute the above embodiments. The steps of the news intelligent broadcasting method in .
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述新闻智能播报方法的步骤。The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are executed on the computer, make the computer execute the steps of the method for intelligent news broadcasting.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (20)

  1. 一种新闻智能播报方法,其中,所述新闻智能播报方法包括:A news intelligent broadcasting method, wherein the news intelligent broadcasting method comprises:
    获取待处理的新闻播报文本;Get the text of the newscast to be processed;
    将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
    将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;
    将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
  2. 根据权利要求1所述的新闻智能播报方法,其中,在所述获取待处理的新闻播报文本之前,还包括:The intelligent news broadcast method according to claim 1, wherein before the acquiring the news broadcast text to be processed, the method further comprises:
    获取语义预测训练样本集合和语义标签集合,并建立样本与标签之间的匹配关系;Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;
    对所述预测训练样本集合和语义标签集合进行切分,得到训练样本集合和测试样本集合;Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;
    将所述训练样本集合输入预置神经网络模型进行语义预测训练,得到语义预测模型;Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;
    将所述测试样本集合输入所述语义预测模型进行模型性能测试,若测试结果为良好,则模型训练结束,否则继续进行模型训练。The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
  3. 根据权利要求1或2所述的新闻智能播报方法,其中,所述语义预测模型依次包括特征识别网络、词向量合成网络,所述将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量包括:The intelligent news broadcast method according to claim 1 or 2, wherein the semantic prediction model comprises a feature recognition network and a word vector synthesis network in sequence, and the news broadcast text is input into a preset semantic prediction model for semantic prediction, Obtaining the corresponding semantic vector includes:
    对所述新闻播报文本进行分词,得到带词序的多个分词;Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;
    将所述各分词依次输入所述特征识别网络进行特征抽取,输出所述各分词对应的词向量和语义权重;Inputting each segmented word into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each segmented word;
    将所述各词向量输入所述词向量合成网络,并根据所述语义权重对所述各词向量进行加权融合,输出相应的语义向量。The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
  4. 根据权利要求1所述的新闻智能播报方法,其中,在所述获取待处理的新闻播报文本之前,还包括:The intelligent news broadcast method according to claim 1, wherein before the acquiring the news broadcast text to be processed, the method further comprises:
    获取语义分类样本,并对所述语义分类样本添加分类标签信息;obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;
    对预置决策树模型进行初始化,并将所述语义分类样本与对应的分类标签信息输入所述决策树模型中;Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;
    通过所述决策树模型,对所述语义分类样本进行处理,得到所述语义分类样本的分类预测结果;processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;
    根据所述分类预测结果和分类标签信息,对所述决策树模型的参数进行优化,直至所述决策树模型收敛,得到语义分类模型。According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
  5. 根据权利要求1或4所述的新闻智能播报方法,其中,所述语义分类模型依次包括特征提取网络、特征识别网络、分类网络,所述将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签包括:The method for intelligent news broadcasting according to claim 1 or 4, wherein the semantic classification model comprises a feature extraction network, a feature recognition network, and a classification network in sequence, and the semantic vector is input into a preset semantic classification model for classification, Generating emotion labels corresponding to each sentence in the news broadcast text includes:
    将所述语义向量输入所述特征提取网络进行特征提取,输出多个对应的特征;Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;
    将所述多个对应的特征输入所述特征识别网络进行特征测试,输出测试结果;Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;
    将所述测试结果输入所述分类网络,并根据所述测试结果将所述语义向量进行节点分配,输出所述语义向量的分类树;Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;
    基于所述语义向量的分类树,生成所述新闻播报文本中各语句对应的情绪标签。Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.
  6. 根据权利要求1所述的新闻智能播报方法,其中,所述文本转语音模型依次包括文本预处理网络、音律预测网络、语音合成网络,所述将所述新闻播报文本和所述各情绪标 签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频包括:The intelligent news broadcast method according to claim 1, wherein the text-to-speech model comprises a text preprocessing network, a melody prediction network, and a speech synthesis network in sequence, and the news broadcast text and the emotional labels are input into the The preset text-to-speech model is used for audio synthesis, and the output audio of news broadcasts with various emotions includes:
    对所述新闻播报文本进行语句划分,得到带语序的多个语句;The news broadcast text is divided into sentences to obtain a plurality of sentences with word order;
    将所述各语句和所述各语句对应的情绪标签输入所述文本预处理网络进行音素序列化处理,输出音素序列;Inputting each sentence and the emotional label corresponding to each sentence into the text preprocessing network for phoneme serialization, and outputting a phoneme sequence;
    将所述音素序列输入所述音律预测网络进行音律预测,得到音律合成类型信息;Inputting the phoneme sequence into the temperament prediction network for temperament prediction, to obtain temperament synthesis type information;
    将所述音律合成类型信息输入所述语音合成网络进行波形生成,输出带有多种情绪的新闻播报音频。Inputting the melody synthesis type information into the speech synthesis network for waveform generation, and outputting news broadcast audio with multiple emotions.
  7. 根据权利要求1所述的新闻智能播报方法,其中,在所述将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频之后,还包括:The intelligent news broadcast method according to claim 1, wherein, in the process of inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions After that, also include:
    根据预置时间戳,对所述新闻播报音频进行可视化剪辑,得到多种不同情绪下的情感语音;According to the preset time stamp, the audio of the news broadcast is visually edited to obtain emotional voices under various emotions;
    将所述各情感语音及所述各情感语音对应的情绪标签提交人工审核。Submit the emotional voices and emotional labels corresponding to the emotional voices to manual review.
  8. 一种新闻智能播报设备,其中,所述新闻智能播报设备包括:存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的新闻智能播报程序,所述处理器执行所述新闻智能播报程序时实现如下步骤:A news intelligent broadcasting device, wherein the news intelligent broadcasting device comprises: a memory, a processor, and a news intelligent broadcasting program stored on the memory and running on the processor, the processor executing the The following steps are implemented in the intelligent news broadcast program:
    获取待处理的新闻播报文本;Get the text of the newscast to be processed;
    将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
    将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;
    将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
  9. 根据权利要求8所述的新闻智能播报设备,其中,在所述获取待处理的新闻播报文本之前,所述处理器执行所述新闻智能播报程序还包括:The intelligent news broadcasting device according to claim 8, wherein before the acquiring the news broadcasting text to be processed, the processor executing the intelligent news broadcasting program further comprises:
    获取语义预测训练样本集合和语义标签集合,并建立样本与标签之间的匹配关系;Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;
    对所述预测训练样本集合和语义标签集合进行切分,得到训练样本集合和测试样本集合;Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;
    将所述训练样本集合输入预置神经网络模型进行语义预测训练,得到语义预测模型;Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;
    将所述测试样本集合输入所述语义预测模型进行模型性能测试,若测试结果为良好,则模型训练结束,否则继续进行模型训练。The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
  10. 根据权利要求8或9所述的新闻智能播报设备,其中,所述处理器执行所述新闻智能播报程序实现所述语义预测模型依次包括特征识别网络、词向量合成网络,所述将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量包括:The news intelligent broadcasting device according to claim 8 or 9, wherein the processor executes the news intelligent broadcasting program to realize that the semantic prediction model sequentially includes a feature recognition network and a word vector synthesis network, and the The broadcast text is input into the preset semantic prediction model for semantic prediction, and the corresponding semantic vectors are obtained including:
    对所述新闻播报文本进行分词,得到带词序的多个分词;Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;
    将所述各分词依次输入所述特征识别网络进行特征抽取,输出所述各分词对应的词向量和语义权重;Inputting each segmented word into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each segmented word;
    将所述各词向量输入所述词向量合成网络,并根据所述语义权重对所述各词向量进行加权融合,输出相应的语义向量。The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
  11. 根据权利要求8所述的新闻智能播报设备,其中,在所述获取待处理的新闻播报文本之前,所述处理器执行所述新闻智能播报程序还包括:The intelligent news broadcasting device according to claim 8, wherein before the acquiring the news broadcasting text to be processed, the processor executing the intelligent news broadcasting program further comprises:
    获取语义分类样本,并对所述语义分类样本添加分类标签信息;obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;
    对预置决策树模型进行初始化,并将所述语义分类样本与对应的分类标签信息输入所述决策树模型中;Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;
    通过所述决策树模型,对所述语义分类样本进行处理,得到所述语义分类样本的分类预测结果;processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;
    根据所述分类预测结果和分类标签信息,对所述决策树模型的参数进行优化,直至所述决策树模型收敛,得到语义分类模型。According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
  12. 根据权利要求8或11所述的新闻智能播报设备,其中,所述处理器执行所述新闻智能播报程序实现所述语义分类模型依次包括特征提取网络、特征识别网络、分类网络,所述将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签包括:The news intelligent broadcasting device according to claim 8 or 11, wherein the processor executes the news intelligent broadcasting program to realize that the semantic classification model sequentially includes a feature extraction network, a feature recognition network, and a classification network, and the The semantic vector is input into a preset semantic classification model for classification, and the emotion label corresponding to each sentence in the news broadcast text is generated, including:
    将所述语义向量输入所述特征提取网络进行特征提取,输出多个对应的特征;Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;
    将所述多个对应的特征输入所述特征识别网络进行特征测试,输出测试结果;Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;
    将所述测试结果输入所述分类网络,并根据所述测试结果将所述语义向量进行节点分配,输出所述语义向量的分类树;Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;
    基于所述语义向量的分类树,生成所述新闻播报文本中各语句对应的情绪标签。Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.
  13. 根据权利要求8所述的新闻智能播报设备,其中,所述处理器执行所述新闻智能播报程序实现所述文本转语音模型依次包括文本预处理网络、音律预测网络、语音合成网络,所述将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频包括:The news intelligent broadcasting device according to claim 8, wherein the processor executes the news intelligent broadcasting program to realize that the text-to-speech model comprises a text preprocessing network, a melody prediction network, and a speech synthesis network in sequence, and the The news broadcast text and each emotion tag are input into a preset text-to-speech model for audio synthesis, and the output news broadcast audio with multiple emotions includes:
    对所述新闻播报文本进行语句划分,得到带语序的多个语句;The news broadcast text is divided into sentences to obtain a plurality of sentences with word order;
    将所述各语句和所述各语句对应的情绪标签输入所述文本预处理网络进行音素序列化处理,输出音素序列;Inputting each sentence and the emotional label corresponding to each sentence into the text preprocessing network for phoneme serialization, and outputting a phoneme sequence;
    将所述音素序列输入所述音律预测网络进行音律预测,得到音律合成类型信息;Inputting the phoneme sequence into the temperament prediction network for temperament prediction, to obtain temperament synthesis type information;
    将所述音律合成类型信息输入所述语音合成网络进行波形生成,输出带有多种情绪的新闻播报音频。Inputting the melody synthesis type information into the speech synthesis network for waveform generation, and outputting news broadcast audio with multiple emotions.
  14. 根据权利要求8所述的新闻智能播报设备,其中,在所述将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频之后,所述处理器执行所述新闻智能播报程序还包括:The intelligent news broadcast device according to claim 8, wherein, in the process of inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions After that, executing the intelligent news broadcast program by the processor further includes:
    根据预置时间戳,对所述新闻播报音频进行可视化剪辑,得到多种不同情绪下的情感语音;According to the preset time stamp, the audio of the news broadcast is visually edited to obtain emotional voices under various emotions;
    将所述各情感语音及所述各情感语音对应的情绪标签提交人工审核。Submit the emotional voices and emotional labels corresponding to the emotional voices to manual review.
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium, storing computer instructions in the computer-readable storage medium, when the computer instructions are executed on a computer, the computer is made to perform the following steps:
    获取待处理的新闻播报文本;Get the text of the newscast to be processed;
    将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;Inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
    将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;Inputting the semantic vector into a preset semantic classification model for classification, and generating emotional labels corresponding to each sentence in the news broadcast text;
    将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。Inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
  16. 根据权利要求15所述的计算机可读存储介质,在所述获取待处理的新闻播报文本之前,所述计算机可读存储介质执行所述计算机指令还包括:The computer-readable storage medium of claim 15, before the obtaining the newscast text to be processed, the computer-readable storage medium executing the computer instructions further comprising:
    获取语义预测训练样本集合和语义标签集合,并建立样本与标签之间的匹配关系;Obtain the semantic prediction training sample set and the semantic label set, and establish the matching relationship between the samples and the labels;
    对所述预测训练样本集合和语义标签集合进行切分,得到训练样本集合和测试样本集合;Segmenting the predicted training sample set and the semantic label set to obtain a training sample set and a test sample set;
    将所述训练样本集合输入预置神经网络模型进行语义预测训练,得到语义预测模型;Inputting the training sample set into a preset neural network model for semantic prediction training to obtain a semantic prediction model;
    将所述测试样本集合输入所述语义预测模型进行模型性能测试,若测试结果为良好,则模型训练结束,否则继续进行模型训练。The test sample set is input into the semantic prediction model for model performance test, if the test result is good, the model training ends, otherwise, the model training continues.
  17. 根据权利要求15或16所述的计算机可读存储介质,所述计算机可读存储介质执行所述计算机指令实现所述语义预测模型依次包括特征识别网络、词向量合成网络,所述将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量包括:The computer-readable storage medium according to claim 15 or 16, wherein the computer-readable storage medium executes the computer instructions to realize that the semantic prediction model sequentially includes a feature recognition network, a word vector synthesis network, and the The broadcast text is input into the preset semantic prediction model for semantic prediction, and the corresponding semantic vectors are obtained including:
    对所述新闻播报文本进行分词,得到带词序的多个分词;Perform word segmentation on the news broadcast text to obtain multiple word segmentations with word order;
    将所述各分词依次输入所述特征识别网络进行特征抽取,输出所述各分词对应的词向量和语义权重;Inputting each segmented word into the feature recognition network in turn for feature extraction, and outputting the word vector and semantic weight corresponding to each segmented word;
    将所述各词向量输入所述词向量合成网络,并根据所述语义权重对所述各词向量进行加权融合,输出相应的语义向量。The word vectors are input into the word vector synthesis network, and the word vectors are weighted and fused according to the semantic weights to output corresponding semantic vectors.
  18. 根据权利要求15所述的计算机可读存储介质,在所述获取待处理的新闻播报文本之前,所述计算机可读存储介质执行所述计算机指令还包括:The computer-readable storage medium of claim 15, before the obtaining the newscast text to be processed, the computer-readable storage medium executing the computer instructions further comprising:
    获取语义分类样本,并对所述语义分类样本添加分类标签信息;obtaining a semantic classification sample, and adding classification label information to the semantic classification sample;
    对预置决策树模型进行初始化,并将所述语义分类样本与对应的分类标签信息输入所述决策树模型中;Initialize the preset decision tree model, and input the semantic classification sample and the corresponding classification label information into the decision tree model;
    通过所述决策树模型,对所述语义分类样本进行处理,得到所述语义分类样本的分类预测结果;processing the semantic classification samples through the decision tree model to obtain a classification prediction result of the semantic classification samples;
    根据所述分类预测结果和分类标签信息,对所述决策树模型的参数进行优化,直至所述决策树模型收敛,得到语义分类模型。According to the classification prediction result and classification label information, the parameters of the decision tree model are optimized until the decision tree model converges to obtain a semantic classification model.
  19. 根据权力要求15或18所述的计算机可读存储介质,所述计算机可读存储介质执行所述计算机指令实现所述语义分类模型依次包括特征提取网络、特征识别网络、分类网络,所述将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签包括:The computer-readable storage medium according to claim 15 or 18, wherein the computer-readable storage medium executes the computer instructions to realize that the semantic classification model sequentially includes a feature extraction network, a feature recognition network, and a classification network, and the The semantic vector is input into a preset semantic classification model for classification, and the emotion label corresponding to each sentence in the news broadcast text is generated, including:
    将所述语义向量输入所述特征提取网络进行特征提取,输出多个对应的特征;Inputting the semantic vector into the feature extraction network for feature extraction, and outputting a plurality of corresponding features;
    将所述多个对应的特征输入所述特征识别网络进行特征测试,输出测试结果;Inputting the plurality of corresponding features into the feature recognition network for feature testing, and outputting test results;
    将所述测试结果输入所述分类网络,并根据所述测试结果将所述语义向量进行节点分配,输出所述语义向量的分类树;Inputting the test result into the classification network, and assigning nodes to the semantic vector according to the test result, and outputting a classification tree of the semantic vector;
    基于所述语义向量的分类树,生成所述新闻播报文本中各语句对应的情绪标签。Based on the classification tree of the semantic vector, an emotion label corresponding to each sentence in the news broadcast text is generated.
  20. 一种新闻智能播报装置,其中,所述新闻智能播报装置包括:A news intelligent broadcasting device, wherein the news intelligent broadcasting device comprises:
    新闻文本获取模块,用于获取待处理的新闻播报文本;The news text acquisition module is used to acquire the news broadcast text to be processed;
    语义分析模块,用于将所述新闻播报文本输入预置语义预测模型进行语义预测,得到相应的语义向量;a semantic analysis module for inputting the news broadcast text into a preset semantic prediction model for semantic prediction to obtain a corresponding semantic vector;
    标签生成模块,用于将所述语义向量输入预置语义分类模型进行分类,生成所述新闻播报文本中各语句对应的情绪标签;The label generation module is configured to input the semantic vector into a preset semantic classification model for classification, and generate emotional labels corresponding to each sentence in the news broadcast text;
    音频合成模块,用于将所述新闻播报文本和所述各情绪标签输入预置文本转语音模型进行音频合成,输出带有多种情绪的新闻播报音频。The audio synthesis module is used for inputting the news broadcast text and each emotion tag into a preset text-to-speech model for audio synthesis, and outputting news broadcast audio with multiple emotions.
PCT/CN2021/084290 2020-12-10 2021-03-31 Intelligent news broadcasting method, apparatus and device, and storage medium WO2022121181A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011432581.8 2020-12-10
CN202011432581.8A CN112541078A (en) 2020-12-10 2020-12-10 Intelligent news broadcasting method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022121181A1 true WO2022121181A1 (en) 2022-06-16

Family

ID=75019847

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084290 WO2022121181A1 (en) 2020-12-10 2021-03-31 Intelligent news broadcasting method, apparatus and device, and storage medium

Country Status (2)

Country Link
CN (1) CN112541078A (en)
WO (1) WO2022121181A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115033701A (en) * 2022-08-12 2022-09-09 北京百度网讯科技有限公司 Text vector generation model training method, text classification method and related device
CN115130613A (en) * 2022-07-26 2022-09-30 西北工业大学 False news identification model construction method, false news identification method and device
CN115662435A (en) * 2022-10-24 2023-01-31 福建网龙计算机网络信息技术有限公司 Virtual teacher simulation voice generation method and terminal
CN115827854A (en) * 2022-12-28 2023-03-21 数据堂(北京)科技股份有限公司 Voice abstract generation model training method, voice abstract generation method and device
CN116166827A (en) * 2023-04-24 2023-05-26 北京百度网讯科技有限公司 Training of semantic tag extraction model and semantic tag extraction method and device
CN117558259A (en) * 2023-11-22 2024-02-13 北京风平智能科技有限公司 Digital man broadcasting style control method and device

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541078A (en) * 2020-12-10 2021-03-23 平安科技(深圳)有限公司 Intelligent news broadcasting method, device, equipment and storage medium
CN113838452B (en) * 2021-08-17 2022-08-23 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and computer storage medium
CN113850083A (en) * 2021-08-17 2021-12-28 北京百度网讯科技有限公司 Method, device and equipment for determining broadcast style and computer storage medium
CN113761940B (en) * 2021-09-09 2023-08-11 杭州隆埠科技有限公司 News main body judging method, equipment and computer readable medium
CN115083428B (en) * 2022-05-30 2023-05-30 湖南中周至尚信息技术有限公司 Voice model recognition device for news broadcasting assistance and control method thereof

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169430A (en) * 2017-05-02 2017-09-15 哈尔滨工业大学深圳研究生院 Reading environment audio strengthening system and method based on image procossing semantic analysis
CN110276076A (en) * 2019-06-25 2019-09-24 北京奇艺世纪科技有限公司 A kind of text mood analysis method, device and equipment
CN110941954A (en) * 2019-12-04 2020-03-31 深圳追一科技有限公司 Text broadcasting method and device, electronic equipment and storage medium
CN111128118A (en) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 Speech synthesis method, related device and readable storage medium
CN112541078A (en) * 2020-12-10 2021-03-23 平安科技(深圳)有限公司 Intelligent news broadcasting method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107169430A (en) * 2017-05-02 2017-09-15 哈尔滨工业大学深圳研究生院 Reading environment audio strengthening system and method based on image procossing semantic analysis
CN110276076A (en) * 2019-06-25 2019-09-24 北京奇艺世纪科技有限公司 A kind of text mood analysis method, device and equipment
CN110941954A (en) * 2019-12-04 2020-03-31 深圳追一科技有限公司 Text broadcasting method and device, electronic equipment and storage medium
CN111128118A (en) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 Speech synthesis method, related device and readable storage medium
CN112541078A (en) * 2020-12-10 2021-03-23 平安科技(深圳)有限公司 Intelligent news broadcasting method, device, equipment and storage medium

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115130613A (en) * 2022-07-26 2022-09-30 西北工业大学 False news identification model construction method, false news identification method and device
CN115130613B (en) * 2022-07-26 2024-03-15 西北工业大学 False news identification model construction method, false news identification method and device
CN115033701A (en) * 2022-08-12 2022-09-09 北京百度网讯科技有限公司 Text vector generation model training method, text classification method and related device
CN115662435A (en) * 2022-10-24 2023-01-31 福建网龙计算机网络信息技术有限公司 Virtual teacher simulation voice generation method and terminal
CN115827854A (en) * 2022-12-28 2023-03-21 数据堂(北京)科技股份有限公司 Voice abstract generation model training method, voice abstract generation method and device
CN115827854B (en) * 2022-12-28 2023-08-11 数据堂(北京)科技股份有限公司 Speech abstract generation model training method, speech abstract generation method and device
CN116166827A (en) * 2023-04-24 2023-05-26 北京百度网讯科技有限公司 Training of semantic tag extraction model and semantic tag extraction method and device
CN116166827B (en) * 2023-04-24 2023-12-15 北京百度网讯科技有限公司 Training of semantic tag extraction model and semantic tag extraction method and device
CN117558259A (en) * 2023-11-22 2024-02-13 北京风平智能科技有限公司 Digital man broadcasting style control method and device

Also Published As

Publication number Publication date
CN112541078A (en) 2021-03-23

Similar Documents

Publication Publication Date Title
WO2022121181A1 (en) Intelligent news broadcasting method, apparatus and device, and storage medium
Sun End-to-end speech emotion recognition with gender information
WO2022048405A1 (en) Text-based virtual object animation generation method, apparatus, storage medium, and terminal
KR20230034423A (en) 2-level speech rhyme transmission
US10453434B1 (en) System for synthesizing sounds from prototypes
Haag et al. Bidirectional LSTM networks employing stacked bottleneck features for expressive speech-driven head motion synthesis
WO2022184055A1 (en) Speech playing method and apparatus for article, and device, storage medium and program product
Casale et al. Multistyle classification of speech under stress using feature subset selection based on genetic algorithms
JP2021530726A (en) Methods and systems for creating object-based audio content
Mu et al. Review of end-to-end speech synthesis technology based on deep learning
CN110782875B (en) Voice rhythm processing method and device based on artificial intelligence
Wang et al. Comic-guided speech synthesis
Wenner et al. Scalable music: Automatic music retargeting and synthesis
CN115147521A (en) Method for generating character expression animation based on artificial intelligence semantic analysis
US20190088258A1 (en) Voice recognition device, voice recognition method, and computer program product
CN113178182A (en) Information processing method, information processing device, electronic equipment and storage medium
CN111402919B (en) Method for identifying style of playing cavity based on multi-scale and multi-view
Gamage et al. Modeling variable length phoneme sequences—A step towards linguistic information for speech emotion recognition in wider world
Tran et al. Naturalness improvement of vietnamese text-to-speech system using diffusion probabilistic modelling and unsupervised data enrichment
Anumanchipalli Intra-lingual and cross-lingual prosody modelling
Chen et al. A new learning scheme of emotion recognition from speech by using mean fourier parameters
Tong Speech to text with emoji
Tu et al. Contextual expressive text-to-speech
US20230386475A1 (en) Systems and methods of text to audio conversion
Magdin et al. Case study of features extraction and real time classification of emotion from speech on the basis with using neural nets

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901897

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901897

Country of ref document: EP

Kind code of ref document: A1