CN115713257A

CN115713257A - Anchor expressive force evaluation method and device based on multi-mode fusion and computing equipment

Info

Publication number: CN115713257A
Application number: CN202211384376.8A
Authority: CN
Inventors: 赵君利; 董黎刚; 蒋献; 邹杭
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-02-24

Abstract

The invention discloses a method, a device and a computing device for evaluating anchor expressive force based on multi-mode fusion, which comprises the steps of extracting color value characteristics of an input video to be evaluated to obtain anchor color value characteristics and color value evaluation information corresponding to the anchor color value characteristics; extracting tone features of a video to be evaluated to obtain anchor tone features and tone evaluation information corresponding to the anchor tone features; extracting content characteristics of a video to be evaluated to obtain live broadcast content characteristics and content evaluation information corresponding to the live broadcast content characteristics; inputting the anchor color value characteristics, the anchor tone color characteristics and the live broadcast content characteristics into an expressive force evaluation model obtained by pre-training to obtain anchor expressive force comprehensive evaluation information; and determining the color value evaluation information, the tone evaluation information, the content evaluation information and the anchor expressiveness comprehensive evaluation information as video comprehensive evaluation information. The method and the device can improve the accuracy of the evaluation result of the anchor expressive force.

Description

Anchor expressive force evaluation method and device based on multi-mode fusion and computing equipment

Technical Field

The invention relates to the technical field of video processing, in particular to a method and a device for evaluating the expressive force of a anchor based on multi-mode fusion and computing equipment.

Background

The current live broadcast industrial chain taking the network anchor as a core is continuously perfect and mature, the current algorithm for evaluating the network anchor is simple in flow and poor in effect, and an intelligent scoring system capable of analyzing the anchor in multiple dimensions is lacked. In order to continuously improve the live broadcast quality and continuously promote the development of the live broadcast industry, 3D-CNN, open SMILE and CNN can be used for extracting image, audio and text features in live broadcast video, and the extracted image, audio and text features are evaluated through a continuous Attention-based (CAT-LSTM) model in which Attention mechanism is fused to obtain different evaluation results respectively aiming at different modalities such as the image, the audio and the text features. However, it is found in practice. The anchor expressive force evaluation method can capture information in the modes comprehensively and accurately, but the study on the interactivity among the modes is not deep enough, so that the finally obtained evaluation result of the anchor expressive force is not accurate enough.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a anchor expressive force evaluation method, device and computing equipment based on multi-mode fusion, which can comprehensively evaluate anchor color values, timbres, live broadcast contents and comprehensive expressions based on multi-mode characteristics, and improve the accuracy of the evaluation result of the anchor expressive force.

According to an aspect of the embodiments of the present invention, there is provided a method for evaluating anchor expressiveness based on multi-modal fusion, including:

extracting color value characteristics of an input video to be evaluated to obtain anchor color value characteristics and color value evaluation information corresponding to the anchor color value characteristics;

extracting tone color characteristics of the video to be evaluated to obtain anchor tone color characteristics and tone color evaluation information corresponding to the anchor tone color characteristics;

extracting content characteristics of the video to be evaluated to obtain live broadcast content characteristics and content evaluation information corresponding to the live broadcast content characteristics;

inputting the anchor color value characteristics, the anchor tone color characteristics and the live broadcast content characteristics into a performance evaluation model obtained by pre-training to obtain anchor performance comprehensive evaluation information;

and determining the color value evaluation information, the tone evaluation information, the content evaluation information and the anchor expressiveness comprehensive evaluation information as video comprehensive evaluation information.

As an optional implementation manner, the manner of performing color value feature extraction on the input video to be evaluated to obtain anchor color value features and color value evaluation information corresponding to the anchor color value features is specifically as follows:

performing color value feature extraction on an input video to be evaluated through a pre-constructed color value feature extraction model to obtain anchor color value features and color value evaluation information corresponding to the anchor color value features;

the color value feature extraction model comprises an input convolution layer, a maximum pooling layer, a first residual convolution layer, a second residual convolution layer, a third residual convolution layer, a fourth residual convolution layer and an average pooling layer; the convolution kernel size of the input convolution layer is 7 multiplied by 7, and the step length of the input convolution layer is 2; the convolution kernel size of the maximum pooling layer is 3 x 3, and the step length of the maximum pooling layer is 2; the convolution kernel sizes of the first, second, third and fourth residual convolution layers are all 3 × 3, and the step sizes of the first, second, third and fourth residual convolution layers are all 1; the convolution kernel size of the average pooling layer is 1 × 1, and the step size of the average pooling layer is 1.

As an optional implementation manner, the extracting the timbre feature of the video to be evaluated to obtain the anchor timbre feature and timbre evaluation information corresponding to the anchor timbre feature includes:

performing audio extraction on the video to be evaluated to obtain voice audio data included in the video to be evaluated;

carrying out pre-emphasis processing on the voice audio data to obtain emphasized audio data;

performing framing processing on the emphasized audio data to obtain framed audio data;

windowing the frame audio data to obtain windowed voice data;

extracting acoustic features of the windowed voice data to obtain acoustic feature parameters;

obtaining a multi-dimensional acoustic feature vector according to the acoustic feature parameters;

extracting tone features of the acoustic feature vectors to obtain anchor tone features;

and evaluating the anchor tone color characteristics to obtain tone color evaluation information corresponding to the anchor tone color characteristics.

As an optional implementation manner, the extracting content features of the video to be evaluated to obtain live content features and content evaluation information corresponding to the live content features includes:

performing character transcription on the video to be evaluated to obtain text information contained in the video to be evaluated;

superposing the training corpus obtained by pre-training with the text information to obtain the live broadcast content characteristics corresponding to the text information; the training corpus is obtained by training a pre-constructed bidirectional coding model;

and evaluating the live content characteristics to obtain content evaluation information corresponding to the live content characteristics.

As an optional implementation manner, the method for obtaining the integrated evaluation information of the anchor expressiveness includes the steps of inputting the anchor color value characteristic, the anchor tone color characteristic, and the live content characteristic into an expressiveness evaluation model obtained through pre-training to obtain integrated evaluation information of the anchor expressiveness, where the anchor color value characteristic is a characteristic of a color value characteristic type, the anchor tone color characteristic is a characteristic of a tone color characteristic type, and the live content characteristic is a characteristic of a content characteristic type, and the method includes:

respectively performing long-distance context representation on the anchor color value characteristic, the anchor tone color characteristic and the live broadcast content characteristic to obtain a color value characteristic sequence corresponding to the anchor color value characteristic, a tone color characteristic sequence corresponding to the anchor tone color characteristic and a content characteristic sequence corresponding to the live broadcast content characteristic; the type of the modal information in each characteristic sequence at least comprises an output state type and a characteristic state type;

obtaining a plurality of cross attention value matrixes according to the color value characteristic sequence, the tone characteristic sequence and the content characteristic sequence; wherein each cross attention value matrix comprises two characteristic types, and the two characteristic types included in any two cross attention value matrices are not completely the same;

constructing a multi-tensor fusion network according to the plurality of cross attention value matrixes; the multi-tensor fusion network comprises a plurality of groups of cross attention value matrixes, and each group of cross attention value matrixes comprises three characteristic types;

obtaining a fusion tensor according to the multi-quantum fusion network;

and inputting the fusion tensor and the preset feature weight into an expressive force evaluation model obtained by pre-training to obtain the comprehensive expressive force evaluation information of the anchor.

As an optional implementation, the obtaining a fusion tensor according to the multi-tensor fusion network includes:

respectively compressing a plurality of groups of cross attention value matrix groups in the multi-tensor fusion network to obtain a plurality of compressed tensors;

and combining the plurality of compression tensors to obtain a fusion tensor.

As an optional implementation manner, the inputting the fusion tensor and the preset feature weight into an expressive force evaluation model obtained by pre-training to obtain anchor expressive force comprehensive evaluation information includes:

inputting the fusion tensor and the preset feature weight into an expressive force evaluation model obtained by pre-training to obtain initial evaluation information;

and activating the initial evaluation information based on an activation function and preset parameters to obtain the comprehensive evaluation information of the anchor expressive force.

According to another aspect of the embodiments of the present invention, there is also provided an anchor expressiveness evaluation device based on multi-modal fusion, including:

the color value feature extraction unit is used for extracting color value features of an input video to be evaluated to obtain an anchor color value feature and color value evaluation information corresponding to the anchor color value feature;

a tone characteristic extraction unit, configured to perform tone characteristic extraction on the video to be evaluated to obtain anchor tone characteristics and tone evaluation information corresponding to the anchor tone characteristics;

the content feature extraction unit is used for extracting content features of the video to be evaluated to obtain live broadcast content features and content evaluation information corresponding to the live broadcast content features;

the input unit is used for inputting the anchor color value characteristics, the anchor tone color characteristics and the live content characteristics into an expressive force evaluation model obtained through pre-training to obtain anchor expressive force comprehensive evaluation information;

a determination unit configured to determine the color value evaluation information, the tone color evaluation information, the content evaluation information, and the anchor expressiveness comprehensive evaluation information as video comprehensive evaluation information in common.

According to still another aspect of the embodiments of the present invention, there is also provided a computing device including: at least one processor, a memory, and an input-output unit; the memory is used for storing a computer program, and the processor is used for calling the computer program stored in the memory to execute the anchor performance evaluation method based on multi-modal fusion.

According to still another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium including instructions that, when executed on a computer, cause the computer to execute the above-described anchor expressiveness evaluation method based on multimodal fusion.

In the embodiment of the invention, the extraction of the anchor color value characteristic, the anchor tone color characteristic and the live broadcast content characteristic is carried out on the input video to be evaluated, the evaluation aiming at the anchor color value, the anchor tone color and the live broadcast content can be obtained, the comprehensive analysis can be carried out on the anchor color value characteristic, the anchor tone color characteristic and the live broadcast content characteristic, the comprehensive evaluation information of the anchor expressive force is obtained, and the accuracy of the evaluation result of the anchor expressive force is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow chart diagram of an alternative multi-modal fusion-based anchor performance evaluation method according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a multi-modal tensor fusion network according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a method for acquiring anchor expressiveness comprehensive evaluation information according to an embodiment of the present invention;

FIG. 4 is a block diagram of a bi-directional coding model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an alternative multi-modal fusion-based anchor performance evaluation apparatus according to an embodiment of the present invention;

FIG. 6 schematically illustrates a structural view of a medium according to an embodiment of the present invention;

fig. 7 schematically shows a structural diagram of a computing device according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Referring to fig. 1, fig. 1 is a schematic flowchart of a method for evaluating anchor expressiveness based on multi-modal fusion according to an embodiment of the present invention. It should be noted that the embodiments of the present invention can be applied to any applicable scenarios.

Fig. 1 shows a flowchart of a method for evaluating anchor expressiveness based on multi-modal fusion according to an embodiment of the present invention, including:

step S101, extracting color value characteristics of an input video to be evaluated to obtain anchor color value characteristics and color value evaluation information corresponding to the anchor color value characteristics.

In the embodiment of the present invention, a color value feature extraction model obtained through pre-training may be used to extract color value features of an anchor in a video to be evaluated, the color value feature extraction model may be constructed based on a Residual Network (ResNet), and the Residual Network may be a ResNet18, a ResNet34, a ResNet50, a ResNet101, a ResNet152 Network, or the like, which is not limited in the embodiment of the present invention. The color value characteristic extraction model extracts anchor color value characteristics in the video to be evaluated, and can evaluate the anchor color value characteristics to obtain color value evaluation information. The color value evaluation information may include comment information and a score for the anchor color value in the video to be evaluated.

As an optional implementation manner, in step S101, a manner of performing color value feature extraction on an input video to be evaluated to obtain a anchor color value feature and color value evaluation information corresponding to the anchor color value feature may specifically be:

and performing color value feature extraction on the input video to be evaluated through a pre-constructed color value feature extraction model to obtain anchor color value features and color value evaluation information corresponding to the anchor color value features.

The color value feature extraction model comprises an input convolution layer, a maximum pooling layer, a first residual convolution layer, a second residual convolution layer, a third residual convolution layer, a fourth residual convolution layer and an average pooling layer; the convolution kernel size of the input convolution layer is 7 x 7, and the step length of the input convolution layer is 2; the convolution kernel size of the maximum pooling layer is 3 x 3, and the step length of the maximum pooling layer is 2; the convolution kernel sizes of the first, second, third and fourth residual convolution layers are all 3 × 3, and the step sizes of the first, second, third and fourth residual convolution layers are all 1; the convolution kernel size of the average pooling layer is 1 × 1, and the step size of the average pooling layer is 1.

According to the embodiment of the invention, more accurate anchor color value characteristics can be extracted, and more accurate color value evaluation information can be obtained. The network structure of the color value feature extraction model can be seen in table 1:

TABLE 1 network architecture for a color value feature extraction model

And S102, extracting tone color characteristics of the video to be evaluated to obtain anchor tone color characteristics and tone color evaluation information corresponding to the anchor tone color characteristics.

In the embodiment of the invention, the timbre characteristics of the anchor in the video to be evaluated can be extracted through a timbre characteristic extraction model obtained through pre-training, and the timbre characteristic extraction model can be constructed based on a Long Short Term Memory network (LSTM). The tone characteristic extraction model extracts anchor tone characteristics in the video to be evaluated, and can evaluate the anchor tone characteristics to obtain tone evaluation information. The tone evaluation information may include comment information and a score for the anchor tone in the video to be evaluated.

As an optional implementation manner, in step S102, performing tone color feature extraction on the video to be evaluated to obtain anchor tone color features and tone color evaluation information corresponding to the anchor tone color features may specifically be:

pre-emphasis processing is carried out on the voice audio data to obtain emphasized audio data;

windowing the frame audio data to obtain windowed voice data;

By implementing the implementation mode, the audio data can be optimized through operations of pre-emphasis, framing, windowing and the like, so that the obtained anchor tone color characteristics are more accurate.

In the embodiment of the present invention, pre-emphasis, framing, windowing, and other pre-processing operations may be performed on voice audio data extracted from a video to be evaluated, and then an LibROSA voice toolkit is used to extract acoustic feature parameters, where the acoustic feature parameters may include, but are not limited to, chroma-ram, RMS, spectral center, spectral bandwidth, spectral slope, zero-cross rate, mel-Frequency Cepstral Coefficients (MFCCs), and the like, and the extracted acoustic feature parameters may be combined into a 26-dimensional acoustic feature vector; and the acoustic feature vectors can be input into a tone feature extraction model constructed by the LSTM neural network to further extract the dominant tone features, and the dominant tone features can be evaluated to obtain tone evaluation information.

Step S103, extracting content characteristics of the video to be evaluated to obtain live broadcast content characteristics and content evaluation information corresponding to the live broadcast content characteristics.

In the embodiment of the invention, the live broadcast content characteristics in the video to be evaluated can be extracted through a content characteristic extraction model obtained by pre-training, and the content characteristic extraction model can be constructed based on a Bidirectional Encoder reporting from Transformers (BERT) and a Bidirectional Long Short-Term Memory (Bi-directional Long Short-Term Memory, bi-LSTM). The live content feature extraction model extracts live content features in the video to be evaluated, and can evaluate the live content features to obtain content evaluation information. The content evaluation information may include comment information and scores for live content in the video to be evaluated.

As an optional implementation manner, in step S103, content feature extraction is performed on the video to be evaluated, and a manner of obtaining live content features and content evaluation information corresponding to the live content features may specifically be:

and evaluating the live broadcast content characteristics to obtain content evaluation information corresponding to the live broadcast content characteristics.

By implementing the implementation mode, the text information can be superimposed through the training corpus obtained by the training of the bidirectional coding model, so that the accuracy of the obtained live broadcast content characteristics is improved.

In the embodiment of the invention, voice audio data extracted from a video to be evaluated can be transcribed into texts by using a voice character conversion technology and stored in a document, and each video segment to be evaluated can correspond to one text file; and a BERT model can be selected to carry out embedding superposition on the training corpora with sequence labeling, the processed vector representation is added into an input Bi-LSTM neural network, the network structure is shown in figure 2, and content evaluation information corresponding to the characteristics of the live broadcast content is obtained.

And step S104, inputting the anchor color value characteristics, the anchor tone color characteristics and the live broadcast content characteristics into a pre-trained expressive force evaluation model to obtain anchor expressive force comprehensive evaluation information.

In this embodiment of the present invention, the anchor color value feature is a feature of a color value feature type, the anchor tone color feature is a feature of a tone color feature type, and the live content feature is a feature of a content feature type.

In another embodiment of the present invention, in order to obtain more comprehensive anchor performance comprehensive evaluation information, the modality information of multiple feature types may be fused through a cross attention mechanism, as shown in fig. 3, the step S104 is replaced by the following steps S301 to S305, and please refer to fig. 4 together:

step S301, performing long-distance context representation on the anchor color value feature, the anchor tone color feature and the live broadcast content feature respectively to obtain a color value feature sequence corresponding to the anchor color value feature, a tone color feature sequence corresponding to the anchor tone color feature and a content feature sequence corresponding to the live broadcast content feature.

In the embodiment of the invention, a video to be evaluated can be a live webcast video, long-distance context representation can be respectively carried out on the anchor color value characteristic extracted by the appearance characteristic extraction module, the anchor tone color characteristic extracted by the audio characteristic extraction module and the live content characteristic extracted by the text characteristic extraction module through an LSTM network, so as to obtain a color value characteristic sequence corresponding to the anchor color value characteristic, a tone color characteristic sequence corresponding to the anchor tone color characteristic and a content characteristic sequence corresponding to the live content characteristic, wherein the type of modal information in each characteristic sequence at least comprises an output state type and a characteristic state type; namely, the modal information O of the output state type output by the face feature extraction module can be included in the color value feature sequence _l And modal information H of characteristic state type output by hidden layer of LSTM network _l The modal information O of the output state type output by the audio feature extraction module can be included in the tone color feature sequence _l And modal information H of characteristic state type output by hidden layer of LSTM network _l The content feature sequence may include modal information O of the output state type output by the text feature extraction module _l And modal information H of characteristic state type output by hidden layer of LSTM network _l Specifically, the method comprises the following steps:

the modal information of the output state type of the last layer of the model is assumed to be O _l The modal information of the characteristic state type of the hidden layer is H _l The characteristic sequence of the video mode information is (S) ₁ ,S ₂ ,…,S _n ) And then:

O _l ,H _l ＝LSTM(S ₁ ,S ₂ ,…,S _n ),l∈{V,A,T}

where n is the length of the feature sequence, subscript l is the modality information of the video, V, A and T refer to the color value, timbre and text modality, respectively.

Step S302, obtaining a plurality of cross attention value matrixes according to the color value characteristic sequence, the tone characteristic sequence and the content characteristic sequence.

In the embodiment of the invention, each cross attention value matrix comprises two characteristic types, and the two characteristic types included in any two cross attention value matrices are not completely the same; namely embedding two modal information of different types into a cross attention mechanism matrix, enhancing effective significant features in modal interaction and weakening irrelevant features.

In the embodiment of the invention, a cross-attention mechanism is added in dual-mode embedding and fusion to capture the interaction characteristics of the modal pair.

For example, taking a color value modality and a text modality as an example, the output features of the color value branch of the LSTM output are fused with the hidden state of the text branch through a cross Attention mechanism, and then the cross Attention value Attention 'is calculated' _VT And Attention' _TV To capture the interactive features of color values and text, the calculation process using the cross-attention mechanism is as follows:

Attention _VT ＝Attention′ _VT O _V

Attention _TV ＝Attention′ _TV O _T

wherein H _V And H _T The hidden layer state output characteristics after inputting the LSTM for the color value and the text branch respectively,

is a normalization coefficient; o is _V And O _T Representing single-mode contexts from color values and text branch output features, respectively; attention _VT Representing embedded text context information (O) _T ) A cross attention mechanism value matrix of the color value modal characteristics of; attention _TV Representing embedded color value context information (O) _V ) A matrix of values of the cross-attention mechanism of the text modal characteristics of (1). By analogy, 6 groups of cross attention value matrixes (3 single-mode features interacting pairwise) can be obtained, which are not described herein.

Step S303, constructing a multi-tensor fusion network according to the plurality of cross attention value matrixes.

In the embodiment of the invention, the multi-tensor fusion network comprises a plurality of groups of cross attention value matrixes, and each group of cross attention value matrixes comprises three characteristic types; the 6 groups of cross attention value matrixes can be equally divided into two groups, cross-modal characteristics of each group are required to simultaneously comprise V, A and T modes, a multi-tensor fusion network (namely, the multi-modal tensor fusion network TFN-AM) can be obtained through multiple grouping, and the extracted color value characteristics can be recorded as z-mode characteristics _v The audio features are denoted as z _a Text feature is denoted as z _t The feature fusion network is defined using a treble cartesian product as the following vector field:

the extra constant dimension of value 1 will generate both single-modal and bimodal features, so before multi-level tensor fusion, a "1" vector needs to be spliced to the feature vector of each modality, so that all modalities can be modeled correctly. The final multi-quantity fusion network is defined as follows:

wherein,

represents the vector outer product, tensor _bi Is divided into three sub-regions

And

characterization of inter-bimodal interactions for capturing bimodal effects, tensor _tr Is a sub-region

Characterization of the interaction between the inner trimodal states is used to capture the trimodal effect of the trimodal states.

And S304, obtaining a fusion tensor according to the multi-quantity fusion network.

As an optional implementation manner, the manner of obtaining the fusion tensor by the step S304 according to the multiple quantity fusion networks may specifically be:

and combining the plurality of compression tensors to obtain a fusion tensor.

By implementing this embodiment, the high-dimensional relationship of the obtained fusion tensor can be improved.

In the embodiment of the present invention, Z obtained by the following formula ¹ 、Z ² The fusion tensor is regarded as a new characteristic view, and is combined again after being compressed to realize hierarchical fusion, and finally a multi-mode fusion tensor network with a high-dimensional relationship can be obtained:

Z＝Tensor _bi (Z ¹ ,Z ² )

Z ¹ ＝Tensor _tr (Attention _TV ,Attention _VA ,Attention _AT )

Z ² ＝Tensor _tr (Attention _TA ,Attention _VT ,Attention _AV )

step S305, inputting the fusion tensor and the preset feature weight into a performance evaluation model obtained by pre-training to obtain the comprehensive evaluation information of the anchor performance.

As an optional implementation manner, in step S305, the fusion tensor and the preset feature weight are input to an expressive force evaluation model obtained by pre-training, and a manner of obtaining the anchor expressive force comprehensive evaluation information may specifically be:

By implementing the implementation mode, the sparse characteristics can be compressed through the expression evaluation model, so that the interactive characteristics are converged in a concentrated manner, and the accuracy of the anchor expression comprehensive evaluation information is improved.

In the embodiment of the invention, the comprehensive evaluation information of the anchor expressive force is finally obtained by a full-connection neural network (expressive force evaluation model) and a Sigmoid function, and the specific method is defined as follows:

I＝6*Sigmoid(FC(Z；W _s ))-3

wherein I is the comprehensive evaluation information of the expressive force of the anchor, Z is the fusion tensor, W _s The weight is, the FC is a fully connected neural network, can compress sparse features to make interactive features converge in a centralized manner, and comprises two dropout layers and two ReLU (Rectified Linear Unit) activation functions.

In the embodiment of the invention, the expressive force evaluation model can be constructed and trained so as to enable the anchor expressive force comprehensive evaluation information output by the expressive force evaluation model to be more accurate.

In the embodiment of the invention, the expressive force evaluation model can be trained through the constructed multi-modal data set LSvideo, and the multi-modal data set LSvideo can be obtained in a mode of: the method comprises the steps of crawling live broadcast videos of the same type in all big live broadcast websites, inviting audiences in multiple live broadcast industries to perform single scoring and comprehensive scoring on anchor color values, tone colors and live broadcast contents of all segments under the condition of separating videos, audios and texts, calculating label values according to formula comprehensive manual scoring and anchor popularity, and manufacturing a multi-mode data set LSVideo special for an anchor scoring task.

Specifically, the same type of live videos of which the popularity of each live platform accounts for the top 20% can be crawled according to the twenty-eight law, and meanwhile, the information such as the number of direct broadcasting fans, the number of online watching persons, the number of audience sending gifts, the number of barrage, the popularity of the anchor and the like is crawled, wherein the live videos only comprise the complete face of the anchor and the mandarin audio of the anchor. For example, a total of 306 anchor live videos from different platforms were collected, gender included male and female, and the total live time was up to 50 hours. Then, the video is cut at the frame level to obtain target segments, the length of the video segments after being cut is between 3 seconds and 10 seconds, and 2236 target segments are finally obtained, and the detailed information is shown in table 2:

TABLE 2 details of the multimodal dataset LSVideo

In the embodiment of the invention, a plurality of audience with high quality in the live broadcast industry can be invited, and the main broadcast color value, the tone and the live broadcast content of each segment can be subjected to single scoring (the scoring range is an integer of 1-5) under the condition of separating video, audio and text. Then calculating the average value of the color value, the tone, the live broadcast content and the comprehensive score, recording the average value as V, A, T, C, and recording the video clip with the difference between the manual scoring result and the average score exceeding 2 as abnormal data elimination; the grading result of each dimension of each video and the self-heat of the anchor can be compared pairwise, and a discrimination matrix is constructed through an analytic hierarchy process and a subjective assignment process; and the characteristic values corresponding to the discrimination matrix can be calculated firstly, then the largest one of the characteristic values is selected, then the characteristic vector corresponding to the characteristic value is calculated, and finally the color value, the tone, the live broadcast content and the popularity are obtainedThe weights of the dimensions are respectively denoted as W _t ,W _a ,W _v ,W _H The specific numerical values are shown in Table 3;

TABLE 3 evaluation dimension weight values

Evaluation dimension	Weight of
		Yan Zhi	0.6062
Timbre	0.2081
		Live content	0.1302
Popularity of	0.0555

In the embodiment of the invention, manual scoring and anchor popularity calculation labels can be fused, and the color value scoring, tone scoring and live broadcast content scoring corresponding to the nth anchor after the abnormal value is eliminated are respectively recorded as

Then, the popularity of the anchor is calculated according to the evaluation elements of the number of the fans, the number of the online audiences, the number of the delivered gifts, the number of the barrages and the self-heat value, and the calculation formula is A ^H ＝(F+C+G+D)×A _l (ii) a Then, the manual scoring and the anchor popularity are fused to calculate the tag value, and the calculation formula is

Finally multiplying the monomodal score and the anchor popularity by the corresponding weight W _t 、W _a 、W _v And W _H And summing to obtain a main broadcasting expressive force label value, wherein the formula is as follows:

by implementing the above steps S301 to S305, the modal information of a plurality of feature types can be fused by the cross attention mechanism, so that the obtained anchor expression comprehensive evaluation information is more comprehensive.

Step S105, determining the color value evaluation information, the tone evaluation information, the content evaluation information, and the anchor expression comprehensive evaluation information as video comprehensive evaluation information.

The method and the device can extract the anchor color value characteristic, the anchor tone color characteristic and the live content characteristic of the input video to be evaluated, can obtain the evaluation aiming at the anchor color value, the anchor tone color and the live content, can comprehensively analyze the anchor color value characteristic, the anchor tone color characteristic and the live content characteristic, obtain the comprehensive evaluation information of the anchor expressiveness, and improve the accuracy of the evaluation result of the anchor expressiveness. In addition, the invention can also obtain more accurate color value evaluation information. In addition, the invention can also make the obtained characteristics of the timbre of the anchor more accurate. In addition, the method and the device can also improve the accuracy of the obtained live content characteristics. In addition, the invention can also make the comprehensive evaluation information of the anchor expressive force more comprehensive. In addition, the invention can also improve the high-dimensional relation of the obtained fusion tensor. In addition, the accuracy of the comprehensive evaluation information of the anchor expressive force can be improved.

Having described the method of an exemplary embodiment of the present invention, next, a multi-modal fusion based anchor expression evaluation apparatus of an exemplary embodiment of the present invention will be described with reference to fig. 5, the apparatus including:

a color value feature extraction unit 501, configured to perform color value feature extraction on an input video to be evaluated to obtain a anchor color value feature and color value evaluation information corresponding to the anchor color value feature;

a tone characteristic extraction unit 502, configured to perform tone characteristic extraction on the video to be evaluated to obtain a anchor tone characteristic and tone evaluation information corresponding to the anchor tone characteristic;

a content feature extraction unit 503, configured to perform content feature extraction on the video to be evaluated to obtain live content features and content evaluation information corresponding to the live content features;

an input unit 504, configured to input the anchor color value features obtained by the color value feature extraction unit 501, the anchor tone color features obtained by the tone color feature extraction unit 502, and the live broadcast content features obtained by the content feature extraction unit 503 into a pre-trained expressive force evaluation model, so as to obtain anchor expressive force comprehensive evaluation information;

a determining unit 505, configured to determine the color value evaluation information obtained by the color value feature extracting unit 501, the tone color evaluation information obtained by the tone color feature extracting unit 502, the content evaluation information obtained by the content feature extracting unit 503, and the anchor expression comprehensive evaluation information obtained by the input unit 504 as video comprehensive evaluation information.

As an optional implementation manner, the manner of performing, by the tone feature extraction unit 502, tone feature extraction on the video to be evaluated to obtain a key tone feature and tone evaluation information corresponding to the key tone feature may specifically be:

performing color value feature extraction on an input video to be evaluated through a pre-constructed color value feature extraction model to obtain a main play color value feature and color value evaluation information corresponding to the main play color value feature; the color value feature extraction model comprises an input convolution layer, a maximum pooling layer, a first residual convolution layer, a second residual convolution layer, a third residual convolution layer, a fourth residual convolution layer and an average pooling layer; the convolution kernel size of the input convolution layer is 7 x 7, and the step length of the input convolution layer is 2; the convolution kernel size of the maximum pooling layer is 3 x 3, and the step length of the maximum pooling layer is 2; the convolution kernel sizes of the first, second, third and fourth residual convolution layers are all 3 × 3, and the step sizes of the first, second, third and fourth residual convolution layers are all 1; the convolution kernel size of the average pooling layer is 1 × 1, and the step size of the average pooling layer is 1.

By the implementation of the implementation mode, more accurate anchor color value characteristics can be extracted, and more accurate color value evaluation information can be obtained.

As an optional implementation manner, the manner of performing content feature extraction on the video to be evaluated by the content feature extraction unit 503 to obtain live content features and content evaluation information corresponding to the live content features may specifically be:

windowing the frame audio data to obtain windowed voice data;

By implementing the embodiment, the audio data can be optimized through operations of pre-emphasis, framing, windowing and the like, so that the obtained characteristics of the dominant tone are more accurate.

As an optional implementation manner, the manner of extracting the content features of the video to be evaluated by the input unit 504 to obtain the live content features and the content evaluation information corresponding to the live content features may specifically be:

As an optional implementation manner, the anchor color value feature is a feature of a color value feature type, the anchor tone feature is a feature of a tone feature type, and the live content feature is a feature of a content feature type, and the determining unit 505 may specifically input the anchor color value feature, the anchor tone feature, and the live content feature into a pre-trained expressiveness evaluation model to obtain anchor expressiveness comprehensive evaluation information:

respectively performing long-distance context representation on the anchor color value characteristic, the anchor tone color characteristic and the live broadcast content characteristic to obtain a color value characteristic sequence corresponding to the anchor color value characteristic, a tone color characteristic sequence corresponding to the anchor tone color characteristic and a content characteristic sequence corresponding to the live broadcast content characteristic; the type of the modal information in each feature sequence at least comprises an output state type and a feature state type;

constructing a multi-tensor fusion network according to the plurality of cross attention value matrixes; the multi-tensor fusion network comprises a plurality of groups of cross attention value matrix groups, and each group of cross attention value matrix groups comprises three characteristic types;

obtaining a fusion tensor according to the multi-quantum fusion network;

and inputting the fusion tensor and the preset feature weight into an expressive force evaluation model obtained by pre-training to obtain the comprehensive evaluation information of the anchor expressive force.

By implementing the implementation mode, the modal information of various characteristic types can be fused through a cross attention mechanism, so that the obtained comprehensive evaluation information of the anchor expressive force is more comprehensive.

As an optional implementation manner, the manner of obtaining the fusion tensor by the determining unit 505 according to the multi-quantity fusion network may specifically be:

respectively compressing a plurality of groups of cross attention value matrix groups in the multi-tensor fusion network to obtain a plurality of compression tensors;

and combining the plurality of compression tensors to obtain a fusion tensor.

As an optional implementation manner, the determining unit 505 inputs the fusion tensor and the preset feature weight to an expression evaluation model obtained by pre-training, and a manner of obtaining the anchor expression comprehensive evaluation information may specifically be:

Having described the method and apparatus of the exemplary embodiment of the present invention, next, a computer-readable storage medium of the exemplary embodiment of the present invention is described with reference to fig. 6, which refers to fig. 6, and illustrates the computer-readable storage medium as an optical disc 60 having a computer program (i.e., a program product) stored thereon, where the computer program, when executed by a processor, implements the steps described in the above method embodiment, for example, performs color value feature extraction on an input video to be evaluated to obtain a anchor color value feature and color value evaluation information corresponding to the anchor color value feature; extracting tone features of a video to be evaluated to obtain anchor tone features and tone evaluation information corresponding to the anchor tone features; extracting content characteristics of a video to be evaluated to obtain live broadcast content characteristics and content evaluation information corresponding to the live broadcast content characteristics; inputting the anchor color value characteristics, the anchor tone color characteristics and the live broadcast content characteristics into an expressive force evaluation model obtained by pre-training to obtain anchor expressive force comprehensive evaluation information; determining the color value evaluation information, the tone evaluation information, the content evaluation information and the anchor expressiveness comprehensive evaluation information as video comprehensive evaluation information; the specific implementation of each step is not repeated here.

It should be noted that examples of the computer-readable storage medium may also include, but are not limited to, a phase change memory (PRAM), a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), other types of Random Access Memories (RAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a flash memory, or other optical and magnetic storage media, which are not described in detail herein.

Having described the methods, media, and apparatus of exemplary embodiments of the invention, a computing device for multi-modal fusion based anchor performance evaluation of exemplary embodiments of the invention is next described with reference to FIG. 7.

FIG. 7 illustrates a block diagram of an exemplary computing device 70, which computing device 70 may be a computer system or server, suitable for use in implementing embodiments of the present invention. The computing device 70 shown in FIG. 7 is only one example and should not be taken to limit the scope of use and functionality of embodiments of the present invention.

As shown in fig. 7, components of computing device 70 may include, but are not limited to: one or more processors or processing units 701, a system memory 702, and a bus 703 that couples various system components including the system memory 702 and the processing unit 701.

Computing device 70 typically includes a variety of computer system readable media. Such media may be any available media that is accessible by computing device 70 and includes both volatile and nonvolatile media, removable and non-removable media.

The system memory 702 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 7021 and/or cache memory 7022. Computing device 70 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, ROM7023 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, and commonly referred to as a "hard drive"). Although not shown in FIG. 7, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In these cases, each drive may be connected to bus 703 via one or more data media interfaces. Included in system memory 702 may be at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the present invention.

A program/utility 7025 having a set (at least one) of program modules 7024 may be stored, for example, in system memory 702, and such program modules 7024 include, but are not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment. Program modules 7024 generally perform the functions and/or methodologies of the described embodiments of the invention.

Computing device 70 may also communicate with one or more external devices 704 (e.g., keyboard, pointing device, display, etc.). Such communication may occur via input/output (I/O) interfaces 705. Moreover, computing device 70 may also communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet) through network adapter 706. As shown in FIG. 7, network adapter 706 communicates with other modules of computing device 70, such as processing unit 701, via bus 703. It should be appreciated that although not shown in FIG. 7, other hardware and/or software modules may be used in conjunction with computing device 70.

The processing unit 701 executes various functional applications and data processing by running a program stored in the system memory 702, for example, performs color value feature extraction on an input video to be evaluated to obtain an anchor color value feature and color value evaluation information corresponding to the anchor color value feature; extracting tone features of a video to be evaluated to obtain anchor tone features and tone evaluation information corresponding to the anchor tone features; extracting content characteristics of a video to be evaluated to obtain live broadcast content characteristics and content evaluation information corresponding to the live broadcast content characteristics; inputting the anchor color value characteristics, the anchor tone color characteristics and the live broadcast content characteristics into an expressive force evaluation model obtained by pre-training to obtain anchor expressive force comprehensive evaluation information; and determining the color value evaluation information, the tone evaluation information, the content evaluation information and the anchor expressiveness comprehensive evaluation information as video comprehensive evaluation information. The specific implementation of each step is not repeated here. It should be noted that although in the above detailed description several units/modules or sub-units/sub-modules of the anchor performance evaluation apparatus based on multi-modal fusion are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module according to embodiments of the invention. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

In the description of the present invention, it should be noted that the terms "first", "second", and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Further, while operations of the methods of the invention are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Claims

1. A multi-modal fusion-based anchor expressive force evaluation method comprises the following steps:

extracting tone features of the video to be evaluated to obtain anchor tone features and tone evaluation information corresponding to the anchor tone features;

2. The anchor expressiveness evaluation method based on multi-modal fusion according to claim 1, wherein the manner of extracting color value features of the input video to be evaluated to obtain anchor color value features and color value evaluation information corresponding to the anchor color value features is specifically as follows:

performing color value feature extraction on an input video to be evaluated through a pre-constructed color value feature extraction model to obtain a main play color value feature and color value evaluation information corresponding to the main play color value feature;

3. The anchor expressiveness evaluation method based on multi-modal fusion according to claim 1, wherein the extracting timbre features of the video to be evaluated to obtain anchor timbre features and timbre evaluation information corresponding to the anchor timbre features comprises:

windowing the frame audio data to obtain windowed voice data;

4. The anchor expressiveness evaluation method based on multi-modal fusion according to claim 1, wherein the content feature extraction of the video to be evaluated to obtain live content features and content evaluation information corresponding to the live content features comprises:

superposing the training corpus obtained according to pre-training with the text information to obtain live broadcast content characteristics corresponding to the text information; the training corpus is obtained by training a pre-constructed bidirectional coding model;

5. The anchor performance evaluation method based on multi-modal fusion according to any one of claims 1 to 4, wherein the anchor color value feature is a feature of a color value feature type, the anchor tone color feature is a feature of a tone color feature type, the live content feature is a feature of a content feature type, and the anchor color value feature, the anchor tone color feature and the live content feature are input to a performance evaluation model obtained by pre-training to obtain anchor performance comprehensive evaluation information, and the method comprises:

obtaining a fusion tensor according to the multi-quantum fusion network;

6. The anchor expressiveness evaluation method based on multi-modal fusion according to claim 5, wherein the deriving a fusion tensor from the multi-quantum fusion network comprises:

and combining the plurality of compression tensors to obtain a fusion tensor.

7. The anchor expressive force evaluation method based on multi-modal fusion according to claim 6, wherein the inputting the fusion tensor and the preset feature weight into an expressive force evaluation model obtained by pre-training to obtain the comprehensive expressive force evaluation information comprises:

8. A multi-modal fusion-based anchor expressiveness evaluation device, comprising:

the color value feature extraction unit is used for extracting color value features of an input video to be evaluated to obtain anchor color value features and color value evaluation information corresponding to the anchor color value features;

the tone characteristic extraction unit is used for extracting tone characteristics of the video to be evaluated to obtain anchor tone characteristics and tone evaluation information corresponding to the anchor tone characteristics;

the input unit is used for inputting the anchor color value characteristics, the anchor tone color characteristics and the live content characteristics into a pre-trained expressive force evaluation model to obtain anchor expressive force comprehensive evaluation information;

a determination unit configured to determine the color value evaluation information, the tone evaluation information, the content evaluation information, and the anchor expression comprehensive evaluation information collectively as video comprehensive evaluation information.

9. A computing device, the computing device comprising:

at least one processor, a memory, and an input-output unit;

wherein the memory is configured to store a computer program and the processor is configured to invoke the computer program stored in the memory to perform the method of any of claims 1-7.

10. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1-7.