CN117251599B - Video corpus intelligent test optimization method, device and storage medium - Google Patents

Video corpus intelligent test optimization method, device and storage medium Download PDF

Info

Publication number
CN117251599B
CN117251599B CN202311504149.9A CN202311504149A CN117251599B CN 117251599 B CN117251599 B CN 117251599B CN 202311504149 A CN202311504149 A CN 202311504149A CN 117251599 B CN117251599 B CN 117251599B
Authority
CN
China
Prior art keywords
feature
video
sample
model
attention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311504149.9A
Other languages
Chinese (zh)
Other versions
CN117251599A (en
Inventor
王晓军
李成哲
陈萧冰
曲欣
袁子洁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Ordnance Equipment Group Ordnance Equipment Research Institute
Original Assignee
China Ordnance Equipment Group Ordnance Equipment Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Ordnance Equipment Group Ordnance Equipment Research Institute filed Critical China Ordnance Equipment Group Ordnance Equipment Research Institute
Priority to CN202311504149.9A priority Critical patent/CN117251599B/en
Publication of CN117251599A publication Critical patent/CN117251599A/en
Application granted granted Critical
Publication of CN117251599B publication Critical patent/CN117251599B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/75Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/776Validation; Performance evaluation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Library & Information Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an intelligent testing and optimizing method and device for video corpus and a storage medium, wherein the method comprises the following steps: collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data; establishing a video feature matching algorithm model, extracting a first feature based on the video data, extracting a second feature based on the true value label, calculating an attention weight matrix between the first feature and the second feature, and evaluating the matching degree of the first feature and the second feature; defining an optimization target as a cross entropy Loss function Loss for minimizing weighting, wherein the cross entropy Loss function Loss comprises the attention weight matrix; training an algorithm model, fitting a mapping between the first feature and the second feature, and minimizing the cross entropy Loss function Loss; until reaching the result of meeting the preset optimization condition, the scheme obviously reduces the labor cost of test preparation and improves the test effectiveness.

Description

Video corpus intelligent test optimization method, device and storage medium
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to an intelligent testing and optimizing method and device for video corpus and a storage medium.
Background
Current speech algorithm model testing faces a number of difficulties. The preparation of the test set is very cumbersome and inefficient, requiring a lot of time to select the proper audio sample, while also ensuring that the sample covers different language, accent and background noise conditions, which increases the difficulty of sample collection. The labeling work is time-consuming, and manual verification of the audio content, correction of errors in automatic transcription, addition of format labels and the like are required, and particularly, labeling of long audio is more labor-intensive. This results in a time cost for test set generation that is prohibitive. The individual differences of the labeling staff also cause uneven labeling quality, and the fairness of the test is affected to a certain extent. In addition, the existing test set is too dependent on the job data of a specific field, which cannot fully verify the generalization capability of the model in a complex real scene. Meanwhile, the test result also has the interference of artificial annotation errors. These factors restrict the evaluation effectiveness and iterative optimization effect of the algorithm model. The recognition errors of the model cannot be returned to specific reasons, and cannot be improved in a targeted manner. Overall, the prior art has significant drawbacks in supporting the test performance of the speech algorithm.
Disclosure of Invention
In view of the above, the invention provides an intelligent testing and optimizing method for video corpus, which can fully utilize the visualization and analysis capabilities of an attention mechanism, can locate weak point samples and perform targeted model optimization. And finally, an efficient solution for generating the test set and iteratively lifting the model is formed. Compared with the prior art, the scheme obviously reduces the labor cost of test preparation, greatly improves the test effectiveness, and comprises the following steps:
step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of attributes of the video;
step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction network, a multimodal fusion network, an attention network, and a mapping network;
step 3, in the feature extraction and multi-modal fusion network, extracting a first feature based on the video data The method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +.>And a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two;
step 4, defining a cross entropy Loss function Loss with an optimization target of minimizing weighting in the mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrixThe method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix->Adjusting sample weight and adjusting the cross entropy Loss function Loss;
step 5, training the video feature matching algorithm model, and fitting the first featureAnd a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;
And 6, repeating the steps 3-5, and iteratively optimizing the video feature matching algorithm model until the preset optimization condition is met.
In particular, in the feature extraction and multimodal fusion network, the first extraction function may be passed throughExtracting said first feature->The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function->And the second extraction function θ may use a pre-trained language model as an encoder for directly outputting the encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning.
In particular, the first extraction functionAnd a second extraction function θ when using a pre-trained language model as an encoder, said first extraction function ++>Comprises an image feature extraction model ResNet (2); the second extraction function θ includes: a BERT model or a RoBERTa model;
setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.
In particular, the attention weighting matrix is determined byAnd (3) performing calculation:
(1)
wherein,attention matrix representing time T, +.>The superscript T in (a) denotes the transpose, (-)>Representing the second characteristic P t Second mapped word vector space feature obtained after mapping to word vector space,/for>Representing the first featureθ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing the spatial feature of the second mapping word vector>Is a length of (2); />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->The probability distribution can be met as an attention weight; wherein->The function is a nonlinear function.
In particular, in the step 4, the cross entropy loss function;/>Representing attention weighting matrix->The weight corresponding to the ith sample in (i), i.e. the weight is given to the loss term of each sample, the higher the attention weight is, the greater the contribution of the loss to the total loss is;/>True tag representing the i-th sample, +. >An algorithmic model prediction output representing the ith sample.
In particular, in the step 5, optimizing and adjusting the unqualified samples from the test set according to a preset algorithm includes determining a sample prediction error according to the following formula, optimizing and adjusting the unqualified samples:
(2)
wherein,representing a sample prediction error, wherein the value range is 0-1, 0 is completely accurate, and 1 is completely wrong; />For the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />Marking quality of the sample, wherein the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; />For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; />For the deviation of the sample label from the true value, the value ranges are 0 to +infinity, and 0 is completely consistent; />For the characteristic learning property of the sample, the value range is 0-1, and 0 represents the characteristicThe symptoms cannot be learned, and 1 represents that the features can be completely learned; />Labeling the sample with a learning property; />The value range is 0 to + -infinity for the real deviation of the algorithm model and the real scene, and 0 is the algorithm model to completely match the real scene; sample-based prediction error- >And a threshold value optimizes the failed adjustment samples from the test set.
In particular, in the step 6, the preset optimizing condition at least includes at least one of the following conditions: the attention weight matrixThe weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.
In particular, in the step 1, labeling the video, and obtaining the true value label corresponding to the video data includes: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; disassembling the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating the start-stop time of the subtitles according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; the audio generated by interception is subjected to multi-person voice recognition, the audio generated by interception is screened, and the corpus generated after screening is stored in a database according to categories and languages.
The invention also provides an intelligent testing and optimizing device for the video corpus, which comprises the following steps:
the video collection and truth value acquisition module is used for collecting video data, wherein the video data comprises audio and images; labeling the video to obtain a true value label corresponding to the video data, wherein the true value label at least comprises: the audio subtitles and/or the translated contents of the subtitles and the translated evaluation contents, the labels of characters, scenes and actions in the images, the language types of video data, the accurate time axis alignment information of the subtitles and the labels of the attributes of the video;
the model building module is used for building a video feature matching algorithm model, and the model comprises the following components: a feature extraction network, a multimodal fusion network, an attention network, and a mapping network;
a feature extraction fusion and attention weight matrix generation module for extracting a first feature based on the video data in the feature extraction and multimodal fusion networkThe method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +. >And a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two;
a cross entropy Loss function setting module, configured to define a cross entropy Loss function Loss with an optimization objective being a minimization weight in the mapping network, where the cross entropy Loss function Loss includes the attention weight matrixThe method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix->Adjusting sample weight and adjusting the cross entropy Loss function Loss;
the model training module is used for training the video feature matching algorithm model and fitting the first featureAnd a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;
and the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting the preset optimization condition is reached.
The invention also provides a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the computer program realizes the intelligent testing and optimizing method of the video corpus when being executed by a processor.
The beneficial effects are that:
according to the scheme provided by the invention, the weight of each sample is analyzed by using an attention mechanism, so that the weaknesses and error sources of the model can be effectively positioned;
by the scheme, the method and the device for training the samples with low attention weight can be used for strengthening training, so that the recognition accuracy of the model on the error-prone samples can be obviously improved;
according to the scheme provided by the invention, the attention network is added by adjusting the model structure, so that the extraction and utilization capacity of the model to key features can be enhanced;
according to the scheme, the attention correspondence matrix between the test sample characteristics and the true value labels is constructed, and a visual analysis basis is provided for model errors;
according to the scheme provided by the invention, the sample weight and the model structure are regulated simultaneously, so that the effective combination of test set generation and model optimization is realized;
according to the scheme provided by the invention, the quality factors of the samples in multiple aspects are evaluated, the contribution of each factor to the error is positioned, and the targeted improvement is performed.
By the scheme, priori knowledge of a pre-training language model and fitting capacity of a new training projection layer are integrated, so that stronger feature extraction is formed.
By the scheme of the invention, a whole set of system schemes from test set construction to iterative optimization model is provided, so that the effect of video understanding is obviously improved.
Drawings
FIG. 1 is a flow chart of an intelligent testing and optimizing method for video corpus provided by the invention;
fig. 2 is a schematic diagram of an intelligent testing and optimizing device for video corpus.
Detailed Description
The invention will now be described in detail by way of example with reference to the accompanying drawings.
The invention provides an intelligent testing and optimizing method for video corpus, as shown in figure 1, comprising the following steps:
step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of the attributes of the video. Wherein, the video data is collected, and the video data comprises audio and images; labeling the video, and acquiring the true value label corresponding to the video data comprises the following steps: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; decomposing the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating start-stop time according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; intercepting the generated audio to carry out multi-person voice recognition, screening the audio according to actual conditions, and storing the corpus generated after screening into a database according to categories and languages.
Step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction and multimodal fusion network, an attention network, and a mapping network; in the feature extraction and multi-mode fusion network, a convolutional neural network can be selected to perform feature extraction on an input video frame to obtain visual feature representation, and a cyclic neural network is selected to perform feature extraction on video and audio to obtain voice feature representation. Specifically, this step may acquire the following data as needed: visual features, namely image semantic features extracted from video frames; speech features extracted from audio, such as Mel spectral features, acoustic model features, etc.; text features, namely language features extracted from video captions and text descriptions, such as word vectors, BERT features and the like; metadata features, namely digital features of the metadata such as shooting time, place, event and the like of the video; identifying language related features such as language category, dialect and the like in the video; audio attribute features such as acoustic features of background music, noise, tones; video attribute characteristics such as video quality, editing method, scene category and the like.
Step 3, in the feature extraction and multi-modal fusion network, extracting a first feature based on the video data The method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +.>And a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two;
and (3) an attention network, namely inputting the multi-mode features and the true value labeling features of the video, and calculating an attention weight matrix between the multi-mode features and the true value labeling features to represent the focus of attention of the model.
In the feature extraction and multimodal fusion network, the first extraction function can be used forExtracting said first feature->The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function->And the second extraction function θ may use a pre-trained language model as an encoder for directly outputting the encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning. Said first extraction function->And a second extraction function θ when using a pre-trained language model as an encoder, said first extraction function ++ >The method comprises an image feature extraction model ResNet; the second extraction function θ includes: BERT model or RoBERTa model.
Setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.
Extracting a second feature P based on the truth labels t Calculate a first characteristicAnd a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two; the attention matrix->The calculation of (1) comprises:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The attention matrix representing the moment T,the superscript T in (a) represents a transpose;
wherein,representing the second characteristic P t Mapping to wordsThe second mapping word vector space characteristics are obtained after vector space;
Wherein the method comprises the steps ofRepresenting said first feature->θ represents the second mapping word vector space featureIs a second extraction function of->Representing solving the spatial feature of the second mapping word vector>A length; />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->As an attention weight conforming to a probability distribution; wherein->The function is a nonlinear function.
Further, effective pre-training language models such as BERT/ResNet are adopted to extract primary semantic features of input samples, and the pre-training is performedAfter training the language model encoder, adding a full connection layer as a projection layer, and using a large number of samples to supervise and train the projection layer, and learning and projecting the projection layer into the needed feature representation space. By freezing the pre-training language model parameters, only the projection layer is trained. The progressive thawing mode may also be used to gradually open the high-level layer parameters of the pre-trained language model. The projection layer may include a plurality of fully connected layer composition multi-layer perceptrons, enhancing the nonlinear fitting capability of feature transformation. Residual connection, regularization and other means are added between projection layers, so that the feature extraction effect is further improved. Through the combination of the pre-training language model and the projection layer, priori knowledge is utilized, and the learning capability of the model for specific problems is increased. The relationship between the further pre-trained language model and the plurality of projection layers may be expressed by the following formula: The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Representing a pre-trained language model pair->Is equivalent to using a priori knowledge; />Representing an input sample; />And->Representing projection layer pair +.>Is equivalent to the learned projection function,/->Representing a new feature representation obtained by the projection layer; />Are bias terms which respectively regulate +>And->Output values of the two projection layers. Through learning, worry about>,/>Different values can be taken, which will introduce a certain distinction for the two projection layers, so that slightly different feature maps can be obtained. The final max operation realizes the fusion of two mapping results, and the offset value can be initialized to 0 or can be assigned a random value smaller than 1. After training, the bias term is updated gradually to help the model fit the target mapping relation; max represents the feature and new feature of the fusion pre-training language model; so the formula +.>And then combining to obtain the features fused with priori knowledge and new learning.
Step 4, defining an optimization target as a cross entropy Loss function Loss with minimum weighting in a mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix; in the step 4, the cross entropy loss function ;/>Representing the weight corresponding to the ith sample in the attention weight matrix, namely weighting the loss term of each sample, wherein the higher the attention weight is, the greater the contribution of the loss to the total loss is; />True tag representing the i-th sample, +.>The model prediction output representing the i-th sample. The mapping network also maps the video multi-mode characteristics to classified output, namely video category prediction, through a full connection layer; by means of a attention weighting matrix->And adjusting sample weight, and adjusting the cross entropy Loss function Loss.
Step 5, training the video feature matching algorithm model, and fitting the first featureAnd a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm; by means of a attention weighting matrix->Adjusting the sample weight; adjusting the cross entropy Loss function Loss; in the step 5, optimizing and adjusting unqualified samples from the test set according to a preset algorithm, including determining a sample prediction error according to the following formula, optimizing and adjusting unqualified samples:
(2)
Wherein,representing a sample prediction error, wherein the value range is 0-1, 0 is completely accurate, and 1 is completely wrong; />For the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />Marking quality of the sample, wherein the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; />For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; />For the deviation of the sample label from the true value, the value ranges are 0 to +infinity, and 0 is completely consistent; />For sample feature learning, the value range is 0-1, the 0 representative feature cannot be learned, and the 1 representative feature can be completely learned; />Labeling the sample with a learning property;the value range is 0 to + -infinity for the real deviation of the algorithm model and the real scene, and 0 is the algorithm model to completely match the real scene; sample-based prediction error->And a threshold value optimizes the failed adjustment samples from the test set.
Step 6, repeating the steps 3-5, and iteratively optimizing the video feature matching algorithm model until the preset requirement is metResults of optimizing the conditions. In the step 6, a result of a preset optimization condition is satisfied, where the preset optimization condition at least includes at least one of the following conditions: the attention weight matrix The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.
The invention also provides an intelligent testing and optimizing device for video corpus, as shown in fig. 2, comprising:
the video collection and truth value acquisition module is used for collecting video data, wherein the video data comprises audio and images; labeling the video to obtain a true value label corresponding to the video data, wherein the true value label at least comprises: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of attributes of the video; wherein, the video data is collected, and the video data comprises audio and images; labeling the video, and acquiring the true value label corresponding to the video data comprises the following steps: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; decomposing the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating start-stop time according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; intercepting the generated audio to carry out multi-person voice recognition, screening the audio according to actual conditions, and storing the corpus generated after screening into a database according to categories and languages.
The model building module is used for building a video feature matching algorithm model, and the model comprises the following components: a feature extraction and multimodal fusion network, an attention network, and a mapping network; in the feature extraction and multi-mode fusion network, a convolutional neural network can be selected to perform feature extraction on an input video frame to obtain visual feature representation, and a cyclic neural network is selected to perform feature extraction on video and audio to obtain voice feature representation. Specifically, this step may acquire the following data as needed: visual features, namely image semantic features extracted from video frames; speech features extracted from audio, such as Mel spectral features, acoustic model features, etc.; text features, namely language features extracted from video captions and text descriptions, such as word vectors, BERT features and the like; metadata features, namely digital features of the metadata such as shooting time, place, event and the like of the video; identifying language related features such as language category, dialect and the like in the video; audio attribute features such as acoustic features of background music, noise, tones; video attribute characteristics such as video quality, editing method, scene category and the like.
A feature extraction fusion and attention weight matrix generation module for extracting a first feature based on the video data in the feature extraction and multimodal fusion network The method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +.>And a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two;
and (3) an attention network, namely inputting the multi-mode features and the true value labeling features of the video, and calculating an attention weight matrix between the multi-mode features and the true value labeling features to represent the focus of attention of the model.
In the feature extraction and multimodal fusion network, the first extraction function can be used forExtracting said first feature->The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function->And the second extraction function θ may use a pre-trained language model as an encoder for directly outputting the encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning. Said first extraction function->And a second extraction function θ when using a pre-trained language model as an encoder, said first extraction function ++ >The method comprises an image feature extraction model ResNet; the second extraction function θ includes: BERT model or RoBERTa model.
Setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.
Extracting a second feature P based on the truth labels t Calculate a first characteristicAnd a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two; the attention matrix->The calculation of (1) comprises:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The attention matrix representing the moment T,the superscript T in (a) represents a transpose;
wherein,representing the second characteristic P t A second mapping word vector space feature obtained after mapping to the word vector space;
Wherein the method comprises the steps ofRepresenting said first feature->θ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing solving the spatial feature of the second mapping word vector>A length; />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->As an attention weight conforming to a probability distribution; wherein->The function is a nonlinear function.
Further, effective pre-training language models such as BERT/ResNet are adopted, primary semantic features of input samples are extracted, after a pre-training language model encoder is adopted, a fully connected layer is added to serve as a projection layer, a large number of samples are used for supervised training of the projection layer, and the projection is learned and projected to a feature representation space needed by people. By freezing the pre-training language model parameters, only the projection layer is trained. The progressive thawing mode may also be used to gradually open the high-level layer parameters of the pre-trained language model. The projection layer may include a plurality of fully connected layer composition multi-layer perceptrons, enhancing the nonlinear fitting capability of feature transformation. Residual connection, regularization and other means are added between projection layers, so that the feature extraction effect is further improved. Through the combination of the pre-training language model and the projection layer, priori knowledge is utilized, and the learning capability of the model for specific problems is increased. The relationship between the further pre-trained language model and the projection layer can be expressed by the following formula: The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Representing a pre-trained language model pair->Is equivalent to using a priori knowledge; />Representing an input sample; />And->Representing projection layer pair +.>Is equivalent to the learned projection function,/->Representing a new feature representation obtained by the projection layer;,/>are bias terms which respectively regulate +>And->Output values of the two projection layers. Through learning, worry about>,/>Different values can be taken, which will be twoThe projection layer introduces a certain distinction so that it can get slightly different feature maps. The final max operation realizes the fusion of two mapping results, and the offset value can be initialized to 0 or can be assigned a random value smaller than 1. After training, the bias term is updated gradually to help the model fit the target mapping relation; max represents the feature and new feature of the fusion pre-training language model; so the formula +.>And then combining to obtain the features fused with priori knowledge and new learning.
The cross entropy Loss function setting module is used for defining an optimization target as a cross entropy Loss function Loss with minimized weighting in a mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix; wherein the cross entropy loss function ;/>Representing the weight corresponding to the ith sample in the attention weight matrix, namely weighting the loss term of each sample, wherein the higher the attention weight is, the greater the contribution of the loss to the total loss is; />True tag representing the i-th sample, +.>The model prediction output representing the i-th sample. The mapping network also maps the video multi-mode characteristics to classified output, namely video category prediction, through a full connection layer; by means of a attention weighting matrix->And adjusting sample weight, and adjusting the cross entropy Loss function Loss.
The model training module is used for training the video feature matching algorithm model and fitting the first featureAnd a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm; by means of a attention weighting matrix->Adjusting the sample weight; adjusting the cross entropy Loss function Loss; and optimizing and adjusting unqualified samples from the test set according to a preset algorithm, wherein the optimizing and adjusting unqualified samples comprises determining sample prediction errors through the following formula:
(2)
Wherein,representing a sample prediction error, wherein the value range is 0-1, 0 is completely accurate, and 1 is completely wrong; />For the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />Marking quality of the sample, wherein the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; />For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; />For the deviation of the sample label from the true value, the value ranges are 0 to +infinity, and 0 is completely consistent; />For sample feature learning, the value range is 0-1, the 0 representative feature cannot be learned, and the 1 representative feature can be completely learned; />Labeling the sample with a learning property;the value range is 0 to + -infinity for the real deviation of the algorithm model and the real scene, and 0 is the algorithm model to completely match the real scene; sample-based prediction error->And a threshold value optimizes the failed adjustment samples from the test set.
And the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting the preset optimization condition is reached. In the optimization completion module, the result of meeting the preset optimization conditions comprises at least one of the following conditions: the attention weight matrix The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease. />
The invention also discloses a computer readable storage medium, and the computer readable storage medium is stored with a computer program, and the computer program realizes the intelligent testing optimization method of the video corpus when being executed by a processor. The features of the embodiments and method are similar, and will not be described in detail.
In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
It will be evident to those skilled in the art that the embodiments of the invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units, modules or means recited in a system, means or terminal claim may also be implemented by means of software or hardware by means of one and the same unit, module or means. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the embodiment of the present invention, and not for limiting, and although the embodiment of the present invention has been described in detail with reference to the above-mentioned preferred embodiments, it should be understood by those skilled in the art that modifications and equivalent substitutions can be made to the technical solution of the embodiment of the present invention without departing from the spirit and scope of the technical solution of the embodiment of the present invention.

Claims (9)

1. The intelligent testing and optimizing method for the video corpus is characterized by comprising the following steps of:
step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or the translated contents of the subtitles and the translated evaluation contents, the labels of characters, scenes and actions in the images, the language types of video data, the accurate time axis alignment information of the subtitles and the labels of the attributes of the video;
step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction and multimodal fusion network, an attention network, and a mapping network;
step 3, in the feature extraction and multi-modal fusion network, extracting a first feature P based on the video data s The method comprises the steps of carrying out a first treatment on the surface of the The first characteristic P s Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in the attention network, a first feature P is calculated s And a second feature P t Attention weighting matrix W between T Evaluating the matching degree of the two;
step 4, defining a cross entropy Loss function Loss with an optimization target of minimizing weighting in the mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix W T The method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix W T Adjusting sample weight and adjusting the cross entropy Loss function Loss;
step 5, training the video feature matching algorithm model, and fitting the first feature P s And a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix W T Wherein by analysing the attention weighting matrix W T Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;
Step 6, repeating the step 3-5, and iteratively optimizing the video feature matching algorithm model until a preset optimization condition is met;
in step 5, optimizing and adjusting unqualified samples from the test set according to a preset algorithm, including determining a sample prediction error according to the following formula, optimizing and adjusting unqualified samples:
wherein M is 2 The sample prediction error is represented, the value range is 0-1, 0 is completely accurate, and 1 is completely erroneous;for the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />The quality of the sample is marked, the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; l (L) J For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; l (L) C For the deviation of the sample label from the true value, the value range is 0 to +infinity, and 0 is completely consistent; />For the sample feature learning property, the value range is 0-1, the 0 represents feature can not be learned, and the 1 represents feature can be completely learned; />Labeling the sample with a learning property; ρ is the real deviation of the algorithm model and the real scene, the value range is 0 to++ infinity, and 0 is the algorithm model completely matched with the real scene; based on sample prediction error M 2 And threshold values are optimized from the test set to be disqualifiedAnd (3) a sample.
2. The intelligent testing and optimizing method for video corpus of claim 1,
in the feature extraction and multimodal fusion network, the first feature P may be extracted by a first extraction function φ s The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function phi and the second extraction function theta may use a pre-trained language model as an encoder for directly outputting encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning.
3. The intelligent testing and optimizing method for video corpus of claim 2,
when the first extraction function phi and the second extraction function theta use a pre-training language model as an encoder, the first extraction function phi comprises an image feature extraction model ResNet; the second extraction function θ includes: a BERT model or a RoBERTa model;
setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.
4. The intelligent testing and optimizing method for video corpus of claim 1,
the attention weight matrix W is determined by T And (3) performing calculation:
wherein W is T The attention matrix representing the moment T,representing the second characteristic P t Second mapped word vector space feature obtained after mapping to word vector space,/for>The superscript T in (1) represents the transpose, phi represents the first feature P s θ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing the spatial feature of the second mapping word vector>Is a length of (2); />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, σ (P s ) Representing the first feature P by a sigma function s The probability distribution can be met as an attention weight; wherein the sigma function is a nonlinear function.
5. The intelligent test optimization method for video corpus according to claim 1, wherein in the step 4, the cross entropy loss functionW T [i]Representing an attention weighting matrix W T The weight corresponding to the ith sample in (i), namely, the loss term of each sample is weighted, and the higher the attention weight is, the greater the contribution of the loss to the total loss is; y is i True tag representing the i-th sample, +.>An algorithmic model prediction output representing the ith sample.
6. The intelligent testing optimization method for video corpus according to claim 1, wherein: in the step 6, the preset optimizing condition at least includes at least one of the following conditions: the attention weight matrix W T The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.
7. The intelligent testing optimization method for video corpus according to claim 1, wherein: in the step 1, labeling the video, and obtaining the true value label corresponding to the video data includes: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; disassembling the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating the start-stop time of the subtitles according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; the audio generated by interception is subjected to multi-person voice recognition, the audio generated by interception is screened, and the corpus generated after screening is stored in a database according to categories and languages.
8. An intelligent testing and optimizing device for video corpus is characterized by comprising:
the video collection and truth value acquisition module is used for collecting video data, wherein the video data comprises audio and images; labeling the video to obtain a true value label corresponding to the video data, wherein the true value label at least comprises: the audio subtitles and/or the translated contents of the subtitles and the translated evaluation contents, the labels of characters, scenes and actions in the images, the language types of video data, the accurate time axis alignment information of the subtitles and the labels of the attributes of the video;
the model building module is used for building a video feature matching algorithm model, and the model comprises the following components: a feature extraction network, a multimodal fusion network, an attention network, and a mapping network;
a feature extraction fusion and attention weight matrix generation module for extracting a first feature P based on the video data in the feature extraction and multimodal fusion network s The method comprises the steps of carrying out a first treatment on the surface of the The first characteristic P s Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in the attention network, a first feature P is calculated s And a second feature P t Attention weighting matrix W between T Evaluating the matching degree of the two;
a cross entropy Loss function setting module, configured to define a cross entropy Loss function Loss with an optimization objective being a minimization weight in the mapping network, where the cross entropy Loss function Loss includes the attention weight matrix W T The method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix W T Adjusting sample weight and adjusting the cross entropy Loss function Loss;
the model training module is used for training the video feature matching algorithm model and fitting the first modelFeature P s And a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix W T Wherein by analysing the attention weighting matrix W T Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;
the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting a preset optimization condition is reached;
The model training module optimizes and adjusts unqualified samples from the test set according to a preset algorithm, wherein the model training module determines sample prediction errors through the following formula, and optimizes and adjusts the unqualified samples:
wherein M is 2 The sample prediction error is represented, the value range is 0-1, 0 is completely accurate, and 1 is completely erroneous;for the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />The quality of the sample is marked, the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; l (L) J For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; l (L) C For the deviation of the sample label from the true value, the value range is 0 to +infinity, and 0 is completely consistent; />For the sample feature learning property, the value range is 0-1, the 0 represents feature can not be learned, and the 1 represents feature can be completely learned; />Labeling the sample with a learning property; ρ is the real deviation of the algorithm model and the real scene, the value range is 0 to++ infinity, and 0 is the algorithm model completely matched with the real scene; based on sample prediction error M 2 And a threshold value optimizes the failed adjustment samples from the test set.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, which when executed by a processor, implements the video corpus intelligent test optimization method according to any of claims 1 to 7.
CN202311504149.9A 2023-11-13 2023-11-13 Video corpus intelligent test optimization method, device and storage medium Active CN117251599B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311504149.9A CN117251599B (en) 2023-11-13 2023-11-13 Video corpus intelligent test optimization method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311504149.9A CN117251599B (en) 2023-11-13 2023-11-13 Video corpus intelligent test optimization method, device and storage medium

Publications (2)

Publication Number Publication Date
CN117251599A CN117251599A (en) 2023-12-19
CN117251599B true CN117251599B (en) 2024-03-15

Family

ID=89137160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311504149.9A Active CN117251599B (en) 2023-11-13 2023-11-13 Video corpus intelligent test optimization method, device and storage medium

Country Status (1)

Country Link
CN (1) CN117251599B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114626441A (en) * 2022-02-23 2022-06-14 苏州大学 Implicit multi-mode matching method and system based on visual contrast attention
CN114724548A (en) * 2022-03-11 2022-07-08 中国科学技术大学 Training method of multi-mode speech recognition model, speech recognition method and equipment
CN114743143A (en) * 2022-04-11 2022-07-12 同济大学 Video description generation method based on multi-concept knowledge mining and storage medium
CN116955699A (en) * 2023-07-18 2023-10-27 北京邮电大学 Video cross-mode search model training method, searching method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086549B2 (en) * 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398961A (en) * 2021-12-28 2022-04-26 西南交通大学 Visual question-answering method based on multi-mode depth feature fusion and model thereof
CN114626441A (en) * 2022-02-23 2022-06-14 苏州大学 Implicit multi-mode matching method and system based on visual contrast attention
CN114724548A (en) * 2022-03-11 2022-07-08 中国科学技术大学 Training method of multi-mode speech recognition model, speech recognition method and equipment
CN114743143A (en) * 2022-04-11 2022-07-12 同济大学 Video description generation method based on multi-concept knowledge mining and storage medium
CN116955699A (en) * 2023-07-18 2023-10-27 北京邮电大学 Video cross-mode search model training method, searching method and device

Also Published As

Publication number Publication date
CN117251599A (en) 2023-12-19

Similar Documents

Publication Publication Date Title
US11270079B2 (en) Translation model based training method and translation method, computer device, and storage medium
CN108399428B (en) Triple loss function design method based on trace ratio criterion
CN108765383B (en) Video description method based on deep migration learning
CN113255755A (en) Multi-modal emotion classification method based on heterogeneous fusion network
CN110796199B (en) Image processing method and device and electronic medical equipment
CN114298158A (en) Multi-mode pre-training method based on image-text linear combination
CN111444367B (en) Image title generation method based on global and local attention mechanism
CN113314205A (en) Efficient medical image labeling and learning system
CN114998602B (en) Domain adaptive learning method and system based on low confidence sample contrast loss
CN113448843B (en) Image recognition software test data enhancement method and device based on defect analysis
CN116311483B (en) Micro-expression recognition method based on local facial area reconstruction and memory contrast learning
CN112084793B (en) Semantic recognition method, device and readable storage medium based on dependency syntax
CN116956929B (en) Multi-feature fusion named entity recognition method and device for bridge management text data
CN114329034A (en) Image text matching discrimination method and system based on fine-grained semantic feature difference
CN116912642A (en) Multimode emotion analysis method, device and medium based on dual-mode and multi-granularity interaction
CN116955699A (en) Video cross-mode search model training method, searching method and device
CN111126155B (en) Pedestrian re-identification method for generating countermeasure network based on semantic constraint
CN114880307A (en) Structured modeling method for knowledge in open education field
CN111144462A (en) Unknown individual identification method and device for radar signals
WO2020216286A1 (en) Method for training teaching style prediction model, and computer storage medium
CN116631566B (en) Medical image report intelligent generation method based on big data
CN117251599B (en) Video corpus intelligent test optimization method, device and storage medium
CN111161238A (en) Image quality evaluation method and device, electronic device, and storage medium
CN113688879B (en) Generalized zero sample learning classification method based on confidence distribution external detection
CN116484053B (en) Intelligent data analysis platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant