CN117251599B

CN117251599B - Video corpus intelligent test optimization method, device and storage medium

Info

Publication number: CN117251599B
Application number: CN202311504149.9A
Authority: CN
Inventors: 王晓军; 李成哲; 陈萧冰; 曲欣; 袁子洁
Original assignee: China Ordnance Equipment Group Ordnance Equipment Research Institute
Current assignee: China Ordnance Equipment Group Ordnance Equipment Research Institute
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-03-15
Anticipated expiration: 2043-11-13
Also published as: CN117251599A

Abstract

The invention provides an intelligent testing and optimizing method and device for video corpus and a storage medium, wherein the method comprises the following steps: collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data; establishing a video feature matching algorithm model, extracting a first feature based on the video data, extracting a second feature based on the true value label, calculating an attention weight matrix between the first feature and the second feature, and evaluating the matching degree of the first feature and the second feature; defining an optimization target as a cross entropy Loss function Loss for minimizing weighting, wherein the cross entropy Loss function Loss comprises the attention weight matrix; training an algorithm model, fitting a mapping between the first feature and the second feature, and minimizing the cross entropy Loss function Loss; until reaching the result of meeting the preset optimization condition, the scheme obviously reduces the labor cost of test preparation and improves the test effectiveness.

Description

Video corpus intelligent test optimization method, device and storage medium

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to an intelligent testing and optimizing method and device for video corpus and a storage medium.

Background

Current speech algorithm model testing faces a number of difficulties. The preparation of the test set is very cumbersome and inefficient, requiring a lot of time to select the proper audio sample, while also ensuring that the sample covers different language, accent and background noise conditions, which increases the difficulty of sample collection. The labeling work is time-consuming, and manual verification of the audio content, correction of errors in automatic transcription, addition of format labels and the like are required, and particularly, labeling of long audio is more labor-intensive. This results in a time cost for test set generation that is prohibitive. The individual differences of the labeling staff also cause uneven labeling quality, and the fairness of the test is affected to a certain extent. In addition, the existing test set is too dependent on the job data of a specific field, which cannot fully verify the generalization capability of the model in a complex real scene. Meanwhile, the test result also has the interference of artificial annotation errors. These factors restrict the evaluation effectiveness and iterative optimization effect of the algorithm model. The recognition errors of the model cannot be returned to specific reasons, and cannot be improved in a targeted manner. Overall, the prior art has significant drawbacks in supporting the test performance of the speech algorithm.

Disclosure of Invention

In view of the above, the invention provides an intelligent testing and optimizing method for video corpus, which can fully utilize the visualization and analysis capabilities of an attention mechanism, can locate weak point samples and perform targeted model optimization. And finally, an efficient solution for generating the test set and iteratively lifting the model is formed. Compared with the prior art, the scheme obviously reduces the labor cost of test preparation, greatly improves the test effectiveness, and comprises the following steps:

step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of attributes of the video;

step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction network, a multimodal fusion network, an attention network, and a mapping network;

step 3, in the feature extraction and multi-modal fusion network, extracting a first feature based on the video data The method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels _t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +.>And a second feature P _t Attention weighting matrix between->Evaluating the matching degree of the two;

step 4, defining a cross entropy Loss function Loss with an optimization target of minimizing weighting in the mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrixThe method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix->Adjusting sample weight and adjusting the cross entropy Loss function Loss;

step 5, training the video feature matching algorithm model, and fitting the first featureAnd a second feature P _t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;

And 6, repeating the steps 3-5, and iteratively optimizing the video feature matching algorithm model until the preset optimization condition is met.

In particular, in the feature extraction and multimodal fusion network, the first extraction function may be passed throughExtracting said first feature->The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ _t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function->And the second extraction function θ may use a pre-trained language model as an encoder for directly outputting the encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning.

In particular, the first extraction functionAnd a second extraction function θ when using a pre-trained language model as an encoder, said first extraction function ++>Comprises an image feature extraction model ResNet (2); the second extraction function θ includes: a BERT model or a RoBERTa model;

setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.

In particular, the attention weighting matrix is determined byAnd (3) performing calculation:

（1）

wherein,attention matrix representing time T, +.>The superscript T in (a) denotes the transpose, (-)>Representing the second characteristic P _t Second mapped word vector space feature obtained after mapping to word vector space,/for>Representing the first featureθ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing the spatial feature of the second mapping word vector>Is a length of (2); />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->The probability distribution can be met as an attention weight; wherein->The function is a nonlinear function.

In particular, in the step 4, the cross entropy loss function；/>Representing attention weighting matrix->The weight corresponding to the ith sample in (i), i.e. the weight is given to the loss term of each sample, the higher the attention weight is, the greater the contribution of the loss to the total loss is；/>True tag representing the i-th sample, +. >An algorithmic model prediction output representing the ith sample.

In particular, in the step 5, optimizing and adjusting the unqualified samples from the test set according to a preset algorithm includes determining a sample prediction error according to the following formula, optimizing and adjusting the unqualified samples:

（2）

wherein,representing a sample prediction error, wherein the value range is 0-1, 0 is completely accurate, and 1 is completely wrong; />For the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />Marking quality of the sample, wherein the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; />For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; />For the deviation of the sample label from the true value, the value ranges are 0 to +infinity, and 0 is completely consistent; />For the characteristic learning property of the sample, the value range is 0-1, and 0 represents the characteristicThe symptoms cannot be learned, and 1 represents that the features can be completely learned; />Labeling the sample with a learning property; />The value range is 0 to + -infinity for the real deviation of the algorithm model and the real scene, and 0 is the algorithm model to completely match the real scene; sample-based prediction error- >And a threshold value optimizes the failed adjustment samples from the test set.

In particular, in the step 6, the preset optimizing condition at least includes at least one of the following conditions: the attention weight matrixThe weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.

In particular, in the step 1, labeling the video, and obtaining the true value label corresponding to the video data includes: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; disassembling the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating the start-stop time of the subtitles according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; the audio generated by interception is subjected to multi-person voice recognition, the audio generated by interception is screened, and the corpus generated after screening is stored in a database according to categories and languages.

The invention also provides an intelligent testing and optimizing device for the video corpus, which comprises the following steps:

the video collection and truth value acquisition module is used for collecting video data, wherein the video data comprises audio and images; labeling the video to obtain a true value label corresponding to the video data, wherein the true value label at least comprises: the audio subtitles and/or the translated contents of the subtitles and the translated evaluation contents, the labels of characters, scenes and actions in the images, the language types of video data, the accurate time axis alignment information of the subtitles and the labels of the attributes of the video;

the model building module is used for building a video feature matching algorithm model, and the model comprises the following components: a feature extraction network, a multimodal fusion network, an attention network, and a mapping network;

a feature extraction fusion and attention weight matrix generation module for extracting a first feature based on the video data in the feature extraction and multimodal fusion networkThe method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels _t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +. >And a second feature P _t Attention weighting matrix between->Evaluating the matching degree of the two;

a cross entropy Loss function setting module, configured to define a cross entropy Loss function Loss with an optimization objective being a minimization weight in the mapping network, where the cross entropy Loss function Loss includes the attention weight matrixThe method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix->Adjusting sample weight and adjusting the cross entropy Loss function Loss;

the model training module is used for training the video feature matching algorithm model and fitting the first featureAnd a second feature P _t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;

and the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting the preset optimization condition is reached.

The invention also provides a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the computer program realizes the intelligent testing and optimizing method of the video corpus when being executed by a processor.

The beneficial effects are that:

according to the scheme provided by the invention, the weight of each sample is analyzed by using an attention mechanism, so that the weaknesses and error sources of the model can be effectively positioned;

by the scheme, the method and the device for training the samples with low attention weight can be used for strengthening training, so that the recognition accuracy of the model on the error-prone samples can be obviously improved;

according to the scheme provided by the invention, the attention network is added by adjusting the model structure, so that the extraction and utilization capacity of the model to key features can be enhanced;

according to the scheme, the attention correspondence matrix between the test sample characteristics and the true value labels is constructed, and a visual analysis basis is provided for model errors;

according to the scheme provided by the invention, the sample weight and the model structure are regulated simultaneously, so that the effective combination of test set generation and model optimization is realized;

according to the scheme provided by the invention, the quality factors of the samples in multiple aspects are evaluated, the contribution of each factor to the error is positioned, and the targeted improvement is performed.

By the scheme, priori knowledge of a pre-training language model and fitting capacity of a new training projection layer are integrated, so that stronger feature extraction is formed.

By the scheme of the invention, a whole set of system schemes from test set construction to iterative optimization model is provided, so that the effect of video understanding is obviously improved.

Drawings

FIG. 1 is a flow chart of an intelligent testing and optimizing method for video corpus provided by the invention;

fig. 2 is a schematic diagram of an intelligent testing and optimizing device for video corpus.

Detailed Description

The invention will now be described in detail by way of example with reference to the accompanying drawings.

The invention provides an intelligent testing and optimizing method for video corpus, as shown in figure 1, comprising the following steps:

step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of the attributes of the video. Wherein, the video data is collected, and the video data comprises audio and images; labeling the video, and acquiring the true value label corresponding to the video data comprises the following steps: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; decomposing the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating start-stop time according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; intercepting the generated audio to carry out multi-person voice recognition, screening the audio according to actual conditions, and storing the corpus generated after screening into a database according to categories and languages.

Step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction and multimodal fusion network, an attention network, and a mapping network; in the feature extraction and multi-mode fusion network, a convolutional neural network can be selected to perform feature extraction on an input video frame to obtain visual feature representation, and a cyclic neural network is selected to perform feature extraction on video and audio to obtain voice feature representation. Specifically, this step may acquire the following data as needed: visual features, namely image semantic features extracted from video frames; speech features extracted from audio, such as Mel spectral features, acoustic model features, etc.; text features, namely language features extracted from video captions and text descriptions, such as word vectors, BERT features and the like; metadata features, namely digital features of the metadata such as shooting time, place, event and the like of the video; identifying language related features such as language category, dialect and the like in the video; audio attribute features such as acoustic features of background music, noise, tones; video attribute characteristics such as video quality, editing method, scene category and the like.

and (3) an attention network, namely inputting the multi-mode features and the true value labeling features of the video, and calculating an attention weight matrix between the multi-mode features and the true value labeling features to represent the focus of attention of the model.

In the feature extraction and multimodal fusion network, the first extraction function can be used forExtracting said first feature->The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ _t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function->And the second extraction function θ may use a pre-trained language model as an encoder for directly outputting the encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning. Said first extraction function->And a second extraction function θ when using a pre-trained language model as an encoder, said first extraction function ++ >The method comprises an image feature extraction model ResNet; the second extraction function θ includes: BERT model or RoBERTa model.

Extracting a second feature P based on the truth labels _t Calculate a first characteristicAnd a second feature P _t Attention weighting matrix between->Evaluating the matching degree of the two; the attention matrix->The calculation of (1) comprises:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The attention matrix representing the moment T,the superscript T in (a) represents a transpose;

wherein,representing the second characteristic P _t Mapping to wordsThe second mapping word vector space characteristics are obtained after vector space;

Wherein the method comprises the steps ofRepresenting said first feature->θ represents the second mapping word vector space featureIs a second extraction function of->Representing solving the spatial feature of the second mapping word vector>A length; />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->As an attention weight conforming to a probability distribution; wherein->The function is a nonlinear function.

Further, effective pre-training language models such as BERT/ResNet are adopted to extract primary semantic features of input samples, and the pre-training is performedAfter training the language model encoder, adding a full connection layer as a projection layer, and using a large number of samples to supervise and train the projection layer, and learning and projecting the projection layer into the needed feature representation space. By freezing the pre-training language model parameters, only the projection layer is trained. The progressive thawing mode may also be used to gradually open the high-level layer parameters of the pre-trained language model. The projection layer may include a plurality of fully connected layer composition multi-layer perceptrons, enhancing the nonlinear fitting capability of feature transformation. Residual connection, regularization and other means are added between projection layers, so that the feature extraction effect is further improved. Through the combination of the pre-training language model and the projection layer, priori knowledge is utilized, and the learning capability of the model for specific problems is increased. The relationship between the further pre-trained language model and the plurality of projection layers may be expressed by the following formula: The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Representing a pre-trained language model pair->Is equivalent to using a priori knowledge; />Representing an input sample; />And->Representing projection layer pair +.>Is equivalent to the learned projection function,/->Representing a new feature representation obtained by the projection layer; />，Are bias terms which respectively regulate +>And->Output values of the two projection layers. Through learning, worry about>，/>Different values can be taken, which will introduce a certain distinction for the two projection layers, so that slightly different feature maps can be obtained. The final max operation realizes the fusion of two mapping results, and the offset value can be initialized to 0 or can be assigned a random value smaller than 1. After training, the bias term is updated gradually to help the model fit the target mapping relation; max represents the feature and new feature of the fusion pre-training language model; so the formula +.>And then combining to obtain the features fused with priori knowledge and new learning.

Step 4, defining an optimization target as a cross entropy Loss function Loss with minimum weighting in a mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix; in the step 4, the cross entropy loss function ；/>Representing the weight corresponding to the ith sample in the attention weight matrix, namely weighting the loss term of each sample, wherein the higher the attention weight is, the greater the contribution of the loss to the total loss is; />True tag representing the i-th sample, +.>The model prediction output representing the i-th sample. The mapping network also maps the video multi-mode characteristics to classified output, namely video category prediction, through a full connection layer; by means of a attention weighting matrix->And adjusting sample weight, and adjusting the cross entropy Loss function Loss.

Step 5, training the video feature matching algorithm model, and fitting the first featureAnd a second feature P _t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm; by means of a attention weighting matrix->Adjusting the sample weight; adjusting the cross entropy Loss function Loss; in the step 5, optimizing and adjusting unqualified samples from the test set according to a preset algorithm, including determining a sample prediction error according to the following formula, optimizing and adjusting unqualified samples:

（2）

Wherein,representing a sample prediction error, wherein the value range is 0-1, 0 is completely accurate, and 1 is completely wrong; />For the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />Marking quality of the sample, wherein the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; />For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; />For the deviation of the sample label from the true value, the value ranges are 0 to +infinity, and 0 is completely consistent; />For sample feature learning, the value range is 0-1, the 0 representative feature cannot be learned, and the 1 representative feature can be completely learned; />Labeling the sample with a learning property;the value range is 0 to + -infinity for the real deviation of the algorithm model and the real scene, and 0 is the algorithm model to completely match the real scene; sample-based prediction error->And a threshold value optimizes the failed adjustment samples from the test set.

Step 6, repeating the steps 3-5, and iteratively optimizing the video feature matching algorithm model until the preset requirement is metResults of optimizing the conditions. In the step 6, a result of a preset optimization condition is satisfied, where the preset optimization condition at least includes at least one of the following conditions: the attention weight matrix The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.

The invention also provides an intelligent testing and optimizing device for video corpus, as shown in fig. 2, comprising:

the video collection and truth value acquisition module is used for collecting video data, wherein the video data comprises audio and images; labeling the video to obtain a true value label corresponding to the video data, wherein the true value label at least comprises: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of attributes of the video; wherein, the video data is collected, and the video data comprises audio and images; labeling the video, and acquiring the true value label corresponding to the video data comprises the following steps: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; decomposing the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating start-stop time according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; intercepting the generated audio to carry out multi-person voice recognition, screening the audio according to actual conditions, and storing the corpus generated after screening into a database according to categories and languages.

The model building module is used for building a video feature matching algorithm model, and the model comprises the following components: a feature extraction and multimodal fusion network, an attention network, and a mapping network; in the feature extraction and multi-mode fusion network, a convolutional neural network can be selected to perform feature extraction on an input video frame to obtain visual feature representation, and a cyclic neural network is selected to perform feature extraction on video and audio to obtain voice feature representation. Specifically, this step may acquire the following data as needed: visual features, namely image semantic features extracted from video frames; speech features extracted from audio, such as Mel spectral features, acoustic model features, etc.; text features, namely language features extracted from video captions and text descriptions, such as word vectors, BERT features and the like; metadata features, namely digital features of the metadata such as shooting time, place, event and the like of the video; identifying language related features such as language category, dialect and the like in the video; audio attribute features such as acoustic features of background music, noise, tones; video attribute characteristics such as video quality, editing method, scene category and the like.

A feature extraction fusion and attention weight matrix generation module for extracting a first feature based on the video data in the feature extraction and multimodal fusion network The method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels _t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +.>And a second feature P _t Attention weighting matrix between->Evaluating the matching degree of the two;

wherein,representing the second characteristic P _t A second mapping word vector space feature obtained after mapping to the word vector space;

Wherein the method comprises the steps ofRepresenting said first feature->θ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing solving the spatial feature of the second mapping word vector>A length; />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->As an attention weight conforming to a probability distribution; wherein->The function is a nonlinear function.

Further, effective pre-training language models such as BERT/ResNet are adopted, primary semantic features of input samples are extracted, after a pre-training language model encoder is adopted, a fully connected layer is added to serve as a projection layer, a large number of samples are used for supervised training of the projection layer, and the projection is learned and projected to a feature representation space needed by people. By freezing the pre-training language model parameters, only the projection layer is trained. The progressive thawing mode may also be used to gradually open the high-level layer parameters of the pre-trained language model. The projection layer may include a plurality of fully connected layer composition multi-layer perceptrons, enhancing the nonlinear fitting capability of feature transformation. Residual connection, regularization and other means are added between projection layers, so that the feature extraction effect is further improved. Through the combination of the pre-training language model and the projection layer, priori knowledge is utilized, and the learning capability of the model for specific problems is increased. The relationship between the further pre-trained language model and the projection layer can be expressed by the following formula: The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Representing a pre-trained language model pair->Is equivalent to using a priori knowledge; />Representing an input sample; />And->Representing projection layer pair +.>Is equivalent to the learned projection function,/->Representing a new feature representation obtained by the projection layer;，/>are bias terms which respectively regulate +>And->Output values of the two projection layers. Through learning, worry about>，/>Different values can be taken, which will be twoThe projection layer introduces a certain distinction so that it can get slightly different feature maps. The final max operation realizes the fusion of two mapping results, and the offset value can be initialized to 0 or can be assigned a random value smaller than 1. After training, the bias term is updated gradually to help the model fit the target mapping relation; max represents the feature and new feature of the fusion pre-training language model; so the formula +.>And then combining to obtain the features fused with priori knowledge and new learning.

The cross entropy Loss function setting module is used for defining an optimization target as a cross entropy Loss function Loss with minimized weighting in a mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix; wherein the cross entropy loss function ；/>Representing the weight corresponding to the ith sample in the attention weight matrix, namely weighting the loss term of each sample, wherein the higher the attention weight is, the greater the contribution of the loss to the total loss is; />True tag representing the i-th sample, +.>The model prediction output representing the i-th sample. The mapping network also maps the video multi-mode characteristics to classified output, namely video category prediction, through a full connection layer; by means of a attention weighting matrix->And adjusting sample weight, and adjusting the cross entropy Loss function Loss.

The model training module is used for training the video feature matching algorithm model and fitting the first featureAnd a second feature P _t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm; by means of a attention weighting matrix->Adjusting the sample weight; adjusting the cross entropy Loss function Loss; and optimizing and adjusting unqualified samples from the test set according to a preset algorithm, wherein the optimizing and adjusting unqualified samples comprises determining sample prediction errors through the following formula:

（2）

And the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting the preset optimization condition is reached. In the optimization completion module, the result of meeting the preset optimization conditions comprises at least one of the following conditions: the attention weight matrix The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease. />

The invention also discloses a computer readable storage medium, and the computer readable storage medium is stored with a computer program, and the computer program realizes the intelligent testing optimization method of the video corpus when being executed by a processor. The features of the embodiments and method are similar, and will not be described in detail.

In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

It will be evident to those skilled in the art that the embodiments of the invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units, modules or means recited in a system, means or terminal claim may also be implemented by means of software or hardware by means of one and the same unit, module or means. The terms first, second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the embodiment of the present invention, and not for limiting, and although the embodiment of the present invention has been described in detail with reference to the above-mentioned preferred embodiments, it should be understood by those skilled in the art that modifications and equivalent substitutions can be made to the technical solution of the embodiment of the present invention without departing from the spirit and scope of the technical solution of the embodiment of the present invention.

Claims

1. The intelligent testing and optimizing method for the video corpus is characterized by comprising the following steps of:

step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or the translated contents of the subtitles and the translated evaluation contents, the labels of characters, scenes and actions in the images, the language types of video data, the accurate time axis alignment information of the subtitles and the labels of the attributes of the video;

step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction and multimodal fusion network, an attention network, and a mapping network;

step 3, in the feature extraction and multi-modal fusion network, extracting a first feature P based on the video data _s The method comprises the steps of carrying out a first treatment on the surface of the The first characteristic P _s Including visual and speech features in the video data; extracting a second feature P based on the truth labels _t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in the attention network, a first feature P is calculated _s And a second feature P _t Attention weighting matrix W between _T Evaluating the matching degree of the two;

step 4, defining a cross entropy Loss function Loss with an optimization target of minimizing weighting in the mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix W _T The method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix W _T Adjusting sample weight and adjusting the cross entropy Loss function Loss;

step 5, training the video feature matching algorithm model, and fitting the first feature P _s And a second feature P _t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix W _T Wherein by analysing the attention weighting matrix W _T Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;

Step 6, repeating the step 3-5, and iteratively optimizing the video feature matching algorithm model until a preset optimization condition is met;

in step 5, optimizing and adjusting unqualified samples from the test set according to a preset algorithm, including determining a sample prediction error according to the following formula, optimizing and adjusting unqualified samples:

wherein M is ² The sample prediction error is represented, the value range is 0-1, 0 is completely accurate, and 1 is completely erroneous;for the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />The quality of the sample is marked, the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; l (L) _J For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; l (L) _C For the deviation of the sample label from the true value, the value range is 0 to +infinity, and 0 is completely consistent; />For the sample feature learning property, the value range is 0-1, the 0 represents feature can not be learned, and the 1 represents feature can be completely learned; />Labeling the sample with a learning property; ρ is the real deviation of the algorithm model and the real scene, the value range is 0 to++ infinity, and 0 is the algorithm model completely matched with the real scene; based on sample prediction error M ² And threshold values are optimized from the test set to be disqualifiedAnd (3) a sample.

2. The intelligent testing and optimizing method for video corpus of claim 1,

in the feature extraction and multimodal fusion network, the first feature P may be extracted by a first extraction function φ _s The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ _t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function phi and the second extraction function theta may use a pre-trained language model as an encoder for directly outputting encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning.

3. The intelligent testing and optimizing method for video corpus of claim 2,

when the first extraction function phi and the second extraction function theta use a pre-training language model as an encoder, the first extraction function phi comprises an image feature extraction model ResNet; the second extraction function θ includes: a BERT model or a RoBERTa model;

4. The intelligent testing and optimizing method for video corpus of claim 1,

the attention weight matrix W is determined by _T And (3) performing calculation:

wherein W is _T The attention matrix representing the moment T,representing the second characteristic P _t Second mapped word vector space feature obtained after mapping to word vector space,/for>The superscript T in (1) represents the transpose, phi represents the first feature P _s θ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing the spatial feature of the second mapping word vector>Is a length of (2); />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, σ (P _s ) Representing the first feature P by a sigma function _s The probability distribution can be met as an attention weight; wherein the sigma function is a nonlinear function.

5. The intelligent test optimization method for video corpus according to claim 1, wherein in the step 4, the cross entropy loss functionW _T [i]Representing an attention weighting matrix W _T The weight corresponding to the ith sample in (i), namely, the loss term of each sample is weighted, and the higher the attention weight is, the greater the contribution of the loss to the total loss is; y is _i True tag representing the i-th sample, +.>An algorithmic model prediction output representing the ith sample.

6. The intelligent testing optimization method for video corpus according to claim 1, wherein: in the step 6, the preset optimizing condition at least includes at least one of the following conditions: the attention weight matrix W _T The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.

7. The intelligent testing optimization method for video corpus according to claim 1, wherein: in the step 1, labeling the video, and obtaining the true value label corresponding to the video data includes: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; disassembling the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating the start-stop time of the subtitles according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; the audio generated by interception is subjected to multi-person voice recognition, the audio generated by interception is screened, and the corpus generated after screening is stored in a database according to categories and languages.

8. An intelligent testing and optimizing device for video corpus is characterized by comprising:

a feature extraction fusion and attention weight matrix generation module for extracting a first feature P based on the video data in the feature extraction and multimodal fusion network _s The method comprises the steps of carrying out a first treatment on the surface of the The first characteristic P _s Including visual and speech features in the video data; extracting a second feature P based on the truth labels _t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in the attention network, a first feature P is calculated _s And a second feature P _t Attention weighting matrix W between _T Evaluating the matching degree of the two;

a cross entropy Loss function setting module, configured to define a cross entropy Loss function Loss with an optimization objective being a minimization weight in the mapping network, where the cross entropy Loss function Loss includes the attention weight matrix W _T The method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix W _T Adjusting sample weight and adjusting the cross entropy Loss function Loss;

the model training module is used for training the video feature matching algorithm model and fitting the first modelFeature P _s And a second feature P _t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix W _T Wherein by analysing the attention weighting matrix W _T Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;

the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting a preset optimization condition is reached;

The model training module optimizes and adjusts unqualified samples from the test set according to a preset algorithm, wherein the model training module determines sample prediction errors through the following formula, and optimizes and adjusts the unqualified samples:

wherein M is ² The sample prediction error is represented, the value range is 0-1, 0 is completely accurate, and 1 is completely erroneous;for the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />The quality of the sample is marked, the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; l (L) _J For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; l (L) _C For the deviation of the sample label from the true value, the value range is 0 to +infinity, and 0 is completely consistent; />For the sample feature learning property, the value range is 0-1, the 0 represents feature can not be learned, and the 1 represents feature can be completely learned; />Labeling the sample with a learning property; ρ is the real deviation of the algorithm model and the real scene, the value range is 0 to++ infinity, and 0 is the algorithm model completely matched with the real scene; based on sample prediction error M ² And a threshold value optimizes the failed adjustment samples from the test set.

9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, which when executed by a processor, implements the video corpus intelligent test optimization method according to any of claims 1 to 7.