CN116865887B

CN116865887B - Emotion classification broadcasting system and method based on knowledge distillation

Info

Publication number: CN116865887B
Application number: CN202310828957.4A
Authority: CN
Inventors: 刘海章; 王祥; 张长娟; 田才林; 黄大池; 朱静宁; 赵开宇; 杜限; 黄河; 靳晶晶; 王佩; 邹雪
Original assignee: Sichuan Institute Of Radio And Television Science And Technology
Current assignee: Sichuan Institute Of Radio And Television Science And Technology
Priority date: 2023-07-06
Filing date: 2023-07-06
Publication date: 2024-03-01
Anticipated expiration: 2043-07-06
Also published as: CN116865887A

Abstract

The invention discloses an emotion classification broadcasting system and a method based on knowledge distillation, wherein the emotion classification broadcasting system comprises a program broadcasting control service subsystem, a program text content processing subsystem and a program text content processing subsystem, wherein the program broadcasting control service subsystem is used for converging the content of a program text content; the AI service subsystem is used for carrying out knowledge distillation on the emotion classification BERT pre-training large model and training to obtain an edge side emotion classification small model; and the intelligent broadcasting terminal subsystem is used for transmitting the emotion factors and the program texts obtained by classification to the text-to-speech conversion engine, generating an audio program with the emotion factors and broadcasting the audio program. The invention not only ensures that the broadcasting system has the broadcasting effect of the program with emotion colors, but also greatly reduces the transmission data quantity because the transmitted program data is changed from audio frequency into text, so that the transmission time of the program is shorter and the emergency broadcasting capability of the system is stronger.

Description

Emotion classification broadcasting system and method based on knowledge distillation

Technical Field

The invention belongs to the technical field of intelligent broadcasting, and particularly relates to an emotion classification broadcasting system and method based on knowledge distillation.

Background

In the current broadcasting system, broadcasting content is collected from safe and compliant broadcasting data sources through a news collector and a web crawler program, the broadcasting system firstly uses a text-to-speech mode for collected broadcasting texts to generate audio programs at a cloud end, and then uses streaming media communication modes such as RTMP and the like to transmit the audio programs to a broadcasting terminal for broadcasting the audio programs.

The main problems in the above-mentioned manner are:

(1) Broadcast program content without emotion color

Because the broadcasting programs are converted by using a text-to-speech conversion engine (TTS), emotion classification is not carried out on the broadcasting program text during conversion, and all the programs are of the same speech speed and tone, so that the broadcasting programs are flat and have no infection force, and effective influence on audience products cannot be generated.

(2) The program transmission time is too long

Because the broadcasting terminal directly plays the converted audio file, compared with directly transmitting and playing the text, the audio file transmission time may be too long, and even the playing failure condition may be caused by the network transmission quality problem under the condition of poor network.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides an emotion classification broadcasting system and method based on knowledge distillation, so as to solve the problems of no emotion color and overlong program transmission time of the existing broadcasting program.

In order to achieve the above purpose, the invention adopts the following technical scheme:

in a first aspect, an emotion classification broadcast system based on knowledge distillation, comprising:

the program broadcasting control service subsystem is used for carrying out content aggregation on the text content of the program, storing the content into a broadcasting program library and sending the text content of the program to the intelligent broadcasting terminal subsystem according to broadcasting requirements;

the AI service subsystem is used for carrying out knowledge distillation on the emotion classification BERT pre-training large model, training to obtain an edge side emotion classification small model, and storing and transmitting the edge side emotion classification small model to the intelligent broadcasting terminal subsystem;

and the intelligent broadcasting terminal subsystem is used for receiving the edge side emotion classification small model and the program text, performing emotion classification on the program text, transmitting the emotion factors and the program text obtained by classification to the text-to-speech conversion engine, generating an audio program with the emotion factors, and broadcasting the audio program.

In a second aspect, a broadcasting method of an emotion classification broadcasting system based on knowledge distillation is characterized by comprising the following steps:

s1, content aggregation is carried out on the text content of the program, the content is stored in a broadcasting program library, and the text content of the program is sent to an intelligent broadcasting terminal subsystem according to broadcasting requirements;

s2, carrying out knowledge distillation on the emotion classification BERT pre-training large model, training to obtain an edge side emotion classification small model, and storing and transmitting the edge side emotion classification small model to the intelligent broadcast terminal subsystem;

s3, carrying out emotion classification on the program text by adopting an edge side emotion classification small model, transmitting emotion factors and the program text obtained by classification to a text-to-speech conversion engine, generating an audio program with emotion factors, and broadcasting.

Further, step S1 includes:

the method comprises the steps of collecting content through a news collector and a web crawler for emergency broadcasting, geological disaster information, emergency release of other three parties, appointed news information or government notices, cleaning data of collected content, storing aggregated program data in a broadcasting program library, storing the broadcasting program library in a data and file separation mode, and carrying out CDN release on programs of the broadcasting program library based on broadcasting service.

Further, in step S2, knowledge distillation is performed on the emotion classification BERT pre-training large model, and the training is performed to obtain an edge side emotion classification small model, which includes:

s2.1, initializing a temperature parameter T;

s2.2, training a student model by adopting a current temperature parameter T and a loss function;

s2.3, evaluating the performance of the training student model under the temperature parameter T by adopting the accuracy of the verification set;

s2.4, if the performance on the verification set does not meet the preset condition, increasing the value of the temperature parameter T, and returning to S2.2 for continuous training; if the performance on the verification set meets the preset condition, the current temperature parameter T is saved, the value of the temperature parameter T is reduced, and then the training is continued by returning to S2.2;

and stopping training until the value of the temperature parameter T is smaller than the threshold value or the training round number is larger than the maximum training round number.

Further, step S2.2 includes:

introducing a temperature parameter T, performing temperature scaling by using a softmax function, and outputting probability distribution q by the scaled student model _i The method comprises the following steps:

wherein f _T ( _i The method comprises the steps of carrying out a first treatment on the surface of the ) Input text content sample x for student model _i The lower output probability distribution, θ isParameters of the green model;

using a loss function L _S,T Training a student model after temperature scaling:

wherein y is _i Training the real label of the sample for the ith text content; n is the total number of text content training samples; k is the number of categories of the tag, namely the number of emotion classifications.

Further, step S2.3 includes:

wherein N is _v To verify the total number of samples, A _T Is the model accuracy.

Further, step S3 includes:

carrying out emotion classification on the program text by adopting an edge-side emotion classification small model, wherein the emotion classification of the program text comprises urgency, pleasure, peace and sadness, and the category with the maximum probability of four emotion classification is used as an emotion factor of the broadcast program text;

and transmitting the obtained emotion factors and the program text to a text-to-speech conversion engine, performing speech conversion on the program text, generating an audio program with the emotion factors, and broadcasting the audio program.

The emotion classification broadcasting system and method based on knowledge distillation provided by the invention have the following beneficial effects:

according to the invention, the emotion classification method based on knowledge distillation is adopted to carry out emotion analysis on the program text at the edge side, and emotion factors obtained by analysis are input into the text-to-speech conversion engine to carry out audio program conversion, so that the broadcast audio program has emotion factors and has more infectivity. Meanwhile, as the transmitted program data is changed from audio frequency to text, the transmission data volume is greatly reduced, the program transmission time is shorter, and the emergency broadcasting capability of the system is stronger.

The invention not only enables the broadcasting system to have the program broadcasting effect with emotion colors, but also enables the program type issued by the cloud to be text type, the data size after compression is only a few KB, the files transmitted by the existing broadcasting system are audio files, the size after encoding is also more than hundreds KB, and compared with the prior art, the data volume of the program transmission is reduced by more than hundreds of times, and the transmission time is also greatly shortened; compared with the existing broadcasting system, the method and the system greatly reduce the transmitted data volume and transmission time, enhance the emergency broadcasting capability of the broadcasting system, and simultaneously carry out the emotion reasoning of the whole text on the intelligent terminal at the edge side, thereby greatly reducing the calculation pressure of the cloud of the broadcasting system.

The core idea of the invention is to use an adaptive way to determine the value of the temperature parameter T when distilling a small model, in particular when the performance on the validation set is not improved, the value of the temperature parameter can be increased to expand the search space of the model, thus having a greater possibility to find a better model. When the performance on the verification set is improved, the current temperature parameter T is saved, and the value of the temperature parameter T is reduced, so that the model is focused on the knowledge of the original pre-training model; by repeatedly iterating the training and continuously adjusting the value of the temperature parameter T, the algorithm can adaptively determine the optimal temperature parameter value, thereby improving the performance of the student model.

Drawings

Fig. 1 is a system block diagram of an emotion classification broadcast system based on knowledge distillation.

FIG. 2 is a flow chart of adaptive temperature scaling knowledge distillation.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

Example 1

The embodiment provides an emotion classification broadcasting system based on knowledge distillation, which aims at solving the problems of no emotion color, overlong program transmission time and the like of broadcasting programs in the existing method, adopts emotion classification based on knowledge distillation to carry out emotion classification on program texts at the edge side, inputs emotion factors obtained by analysis into a text-to-speech conversion engine to carry out audio program conversion, so that broadcasting audio programs have emotion factors and have more infectivity, and meanwhile, because the transmitted program data is changed into texts from audio, the transmission data volume is greatly reduced, the program transmission time is shorter, and the emergency broadcasting capability of the system is stronger, and referring to fig. 1, the system specifically comprises:

program broadcasting control service subsystem

The system is used for converging the contents of programs such as emergency message texts, news information websites, emergency audios and the like, storing the contents in a broadcasting program library and issuing the broadcasting contents to the intelligent broadcasting terminal according to broadcasting requirements;

specifically, the subsystem collects contents according to set rules through a news collector and a web crawler program on websites such as emergency broadcasting, geological disaster information, emergency release of other three parties, appointed news information, government notices and the like, cleans the collected contents only by data and stores the collected program data in a broadcasting program library. The broadcasting program library is stored in a mode of separating data from files, and the functions of automatic classification, automatic labeling and automatic cleaning of expired data are provided. The broadcasting service distributes the program of the broadcasting program library to CDN. Compared with the prior art, the broadcasting system of the invention changes the form of issuing programs from audio to text;

AI service subsystem

Carrying out knowledge distillation on the emotion classification BERT (Bidirectional Encoder Representation from Transformers) pre-training large model, training to obtain an edge side emotion classification small model which can be operated at the intelligent broadcasting terminal, and storing and issuing the edge emotion classification small model according to the requirement;

in order for a broadcast system to have emotion broadcasting capability, it is necessary to use AI to perform emotion classification on broadcast text content. In the system, emotion classification is to classify emotion of a program and judge which category of "urgent", "pleasant", "peaceful" and "sad" the program belongs to. Considering that the number of broadcasting terminals is large, if emotion classification is carried out at the cloud, the calculation force requirement is huge and the reasoning response time is too long, so that the invention carries out the emotion classification reasoning task on the intelligent broadcasting terminals. Because the intelligent broadcasting terminal has limited calculation power, the intelligent broadcasting terminal cannot directly lower the estrus classification BERT pre-training large model (comprising a multi-layer encoder), and the large model can be efficiently operated at the edge end only by compressing and cutting;

intelligent broadcasting terminal subsystem

Using the received emotion classification small model to carry out emotion classification on the broadcast text, transmitting the program text and emotion factors to a text-to-speech conversion engine to generate an audio program with emotion factors and broadcasting the audio program;

specifically, the intelligent broadcasting terminal adopts multi-channel receiving, and can receive 4G/5G, wiFi, bluetooth and other transmission data. And the data analysis module analyzes the received data to obtain a broadcasting program text and an emotion classification BERT small model. The AI emotion analysis module loads an emotion classification BERT small model by using an AI framework such as PyTorch and the like, performs emotion classification reasoning on the broadcast program text, and then takes the category with the highest emotion classification probability of ' urgent ', ' pleasant ', ' mild ' sad ' and the like as an emotion factor of the broadcast program text. Finally, the text-to-speech engine TTS performs speech conversion on the program text according to the input emotion factors, and generates an audio program with emotion colors for broadcasting.

Example 2

The embodiment provides a broadcasting method of an emotion classification broadcasting system based on knowledge distillation, which is based on a self-adaptive temperature scaling knowledge distillation method to perform knowledge distillation on a BERT large model to obtain a corresponding BERT small model (only 3 layers of encoders). The model calculation and storage cost is greatly reduced, and meanwhile, the performance and generalization capability of a large model are basically ensured, so that a small model can carry out efficient emotion classification reasoning on an intelligent broadcasting terminal, wherein the self-adaptive temperature scaling knowledge distillation is as follows:

the goal of knowledge distillation is to migrate knowledge in a large BERT emotion classification large model teacher network T into a small student model S that will be trained to mimic the behavior of the teacher network; f (f) ^T And f ^S Representing the behavioural functions of the teacher network and the student network, respectively, the goal of the behavioural functions is to convert the input of the network into a corresponding information encoded representation, knowledge distillation can be modeled as a minimization process of the following objective functions, namely:

L _S,T ＝∑ _x∈E Loss( ^T (x), ^S (x))

the loss (·) is a loss function for measuring the difference between a teacher network and a student network, x is an input sample, and E is a sample set; the focus of knowledge distillation is to select and construct a loss function with which to correlate an effective behavioral function.

In the method, the student model selects the BERT model which is consistent with the teacher model in structure, and the model layer number is far smaller than that of the teacher model. In the method, the output of a prediction layer and the attention weight are selected as corresponding behavior functions, and a cross entropy function is selected as a basic loss function. In the aspect of data sets, the patent adopts a broadcasting program library to construct, and emotion is classified into four categories, namely: "urgent", "pleasant", "peaceful", "sad". The whole data set comprises 10000 samples of a text content training set, 5000 samples of a text content verification set and 5000 samples of a text content test set.

The self-adaptive temperature scaling method is designed to optimize the whole distillation process and improve the generalization capability of the student model. In particular, when performance on the validation set is not improved, the value of the temperature parameter may be increased to expand the search space of the model, thereby having a greater likelihood of finding a better model. When the performance on the validation set improves, the current temperature parameter T is saved, and the value of the temperature parameter T is reduced, so that the model is focused on the knowledge of the original pre-trained model. By iteratively training and continuously adjusting the value of the temperature parameter T, the algorithm can adaptively determine the optimal temperature parameter value, thereby improving the performance of the student model, and referring to fig. 2, the algorithm specifically comprises the following steps:

step S1, content aggregation is carried out on the program text content, the content is stored in a broadcasting program library, and the program text content is sent to an intelligent broadcasting terminal subsystem according to broadcasting requirements;

specifically, the news collector and the web crawler collect content of emergency broadcast, geological disaster information, other three-party emergency delivery, appointed news information or government notices, clean the collected content data, store the collected program data in a broadcasting program library, store the broadcasting program library in a mode of separating data from files, and carry out CDN delivery on the programs of the broadcasting program library based on broadcasting service.

S2, carrying out knowledge distillation on the emotion classification BERT pre-training large model, training to obtain an edge side emotion classification small model, and storing and transmitting the edge side emotion classification small model to an intelligent broadcast terminal subsystem, wherein the method specifically comprises the following steps of:

step S2.1, initializing a temperature parameter T, wherein the temperature parameter T is a smaller value;

wherein f _T ( _i The method comprises the steps of carrying out a first treatment on the surface of the ) Input text content sample x for student model _i The lower output probability distribution, θ is the parameter of the student model;

wherein y is _i Training the real label of the sample for the ith text content; n is the total number of text content training samples (10000); k is the number of categories (4) of labels, i.e. the number of emotion classifications, including "urgent", "pleasant", "peaceful", "sad";

wherein N is _v To verify the total number of samples (5000), A _T Is the model accuracy;

step S2.4, if the performance on the verification set does not meet the preset condition, increasing the value of the temperature parameter T, and returning to the step S2.2 to continue training; if the performance on the verification set meets the preset condition, the current temperature parameter T is saved, the value of the temperature parameter T is reduced, and then the step S2.2 is returned to continue training;

As shown in FIG. 2, wherein A _T () For model accuracy in training round I, l _max For maximum number of rounds of training, T _min For a set minimum temperature value, α, β are constants.

S3, carrying out emotion classification on the program text by adopting an edge side emotion classification small model, wherein the emotion classification of the program text comprises urgency, pleasure, peace and sadness, and the category with the highest probability of four emotion classification is used as an emotion factor of the broadcast program text;

Although specific embodiments of the invention have been described in detail with reference to the accompanying drawings, it should not be construed as limiting the scope of protection of the present patent. Various modifications and variations which may be made by those skilled in the art without the creative effort are within the scope of the patent described in the claims.

Claims

1. An emotion classification broadcast system based on knowledge distillation, comprising:

the AI service subsystem is used for carrying out knowledge distillation on the emotion classification BERT pre-training large model, training to obtain an edge side emotion classification small model, storing the edge side emotion classification small model and transmitting the edge side emotion classification small model to the intelligent broadcasting terminal subsystem;

the intelligent broadcasting terminal subsystem comprises a text-to-speech conversion engine and is used for receiving the edge side emotion classification small model and the program text content, performing emotion classification on the received program text content through the edge side emotion classification small model, transmitting emotion factors and the program text content obtained by classification to the text-to-speech conversion engine, generating an audio program with emotion factors and broadcasting the audio program;

the step of content aggregation of the program text content, storing the content in a broadcasting program library and sending the program text content to an intelligent broadcasting terminal subsystem according to broadcasting requirements comprises the following steps:

collecting program text content, cleaning data of the collected program text content, storing the data in a broadcasting program library, and carrying out CDN release on the program text content of the broadcasting program library according to broadcasting requirements based on broadcasting service so as to send the program text content to an intelligent broadcasting terminal subsystem;

performing knowledge distillation on the emotion classification BERT pre-training big model, training to obtain an edge side emotion classification small model, wherein the method comprises the following steps:

s2.1, initializing a temperature parameter T;

s2.3, evaluating the performance of the student model under the current temperature parameter T by adopting the accuracy of the verification set;

s2.4, if the performance of the student model under the current temperature parameter T does not meet the preset condition, increasing the value of the temperature parameter T, and returning to S2.2 for continuous training; if the performance of the student model under the current temperature parameter T meets the preset condition, the current temperature parameter T is saved, the value of the temperature parameter T is reduced, and then the student model returns to S2.2 to continue training;

stopping training until the value of the temperature parameter T is smaller than a threshold value or the training round number is larger than the maximum training round number;

the accuracy of the validation set in S2.3 is obtained by: obtaining a label of each verification sample in a verification set through a student model, and taking the ratio of the number of the verification samples, which are consistent with the obtained label and the corresponding real label, to the total number of the verification samples in the verification set as the accuracy of the verification set;

the method for receiving the edge side emotion classification small model and the program text content, performing emotion classification on the received program text content through the edge side emotion classification small model, transmitting emotion factors and the program text content obtained by classification to a text-to-speech conversion engine, generating and broadcasting an audio program with emotion factors, and comprises the following steps:

the intelligent broadcasting terminal subsystem further comprises a data analysis module, wherein the data analysis module receives data sent by the program broadcasting control service subsystem and the AI service subsystem through 4G, 5G, wiFi or Bluetooth by adopting multiple channels, analyzes the received data, and obtains program text content and an edge side emotion classification small model;

the intelligent broadcasting terminal subsystem further comprises an AI emotion analysis module, wherein the AI emotion analysis module loads an edge side emotion classification small model by using a PyTorch, performs emotion classification on received program text content through the edge side emotion classification small model, and uses the category with the highest probability of four types of emotion classification as emotion factors of the received program text content, wherein the emotion classification of the program text content comprises urgency, pleasure, peace and sadness;

and transmitting the obtained emotion factors and the program text content to the text-to-speech conversion engine, performing speech conversion on the program text content through the text-to-speech conversion engine, and generating and broadcasting an audio program with the emotion factors.