CN117251599B - Video corpus intelligent test optimization method, device and storage medium - Google Patents
Video corpus intelligent test optimization method, device and storage medium Download PDFInfo
- Publication number
- CN117251599B CN117251599B CN202311504149.9A CN202311504149A CN117251599B CN 117251599 B CN117251599 B CN 117251599B CN 202311504149 A CN202311504149 A CN 202311504149A CN 117251599 B CN117251599 B CN 117251599B
- Authority
- CN
- China
- Prior art keywords
- feature
- video
- sample
- model
- attention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 title claims abstract description 54
- 238000005457 optimization Methods 0.000 title claims abstract description 31
- 238000003860 storage Methods 0.000 title claims abstract description 10
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 62
- 238000013507 mapping Methods 0.000 claims abstract description 61
- 239000011159 matrix material Substances 0.000 claims abstract description 56
- 238000012549 training Methods 0.000 claims abstract description 52
- 238000002372 labelling Methods 0.000 claims abstract description 24
- 230000006870 function Effects 0.000 claims description 92
- 238000000605 extraction Methods 0.000 claims description 80
- 230000004927 fusion Effects 0.000 claims description 28
- 239000013598 vector Substances 0.000 claims description 22
- 230000000007 visual effect Effects 0.000 claims description 21
- 230000009466 transformation Effects 0.000 claims description 12
- 238000011156 evaluation Methods 0.000 claims description 11
- 238000013136 deep learning model Methods 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 238000012216 screening Methods 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 241000239290 Araneae Species 0.000 claims description 4
- 230000004913 activation Effects 0.000 claims description 4
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 4
- 238000009826 distribution Methods 0.000 claims description 4
- 238000005516 engineering process Methods 0.000 claims description 4
- 238000012797 qualification Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 4
- 238000012163 sequencing technique Methods 0.000 claims description 4
- 238000002360 preparation method Methods 0.000 abstract description 3
- 230000000694 effects Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000008014 freezing Effects 0.000 description 2
- 238000007710 freezing Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 238000010257 thawing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000135 prohibitive effect Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/65—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/60—Information retrieval; Database structures therefor; File system structures therefor of audio data
- G06F16/68—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/683—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/776—Validation; Performance evaluation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Library & Information Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention provides an intelligent testing and optimizing method and device for video corpus and a storage medium, wherein the method comprises the following steps: collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data; establishing a video feature matching algorithm model, extracting a first feature based on the video data, extracting a second feature based on the true value label, calculating an attention weight matrix between the first feature and the second feature, and evaluating the matching degree of the first feature and the second feature; defining an optimization target as a cross entropy Loss function Loss for minimizing weighting, wherein the cross entropy Loss function Loss comprises the attention weight matrix; training an algorithm model, fitting a mapping between the first feature and the second feature, and minimizing the cross entropy Loss function Loss; until reaching the result of meeting the preset optimization condition, the scheme obviously reduces the labor cost of test preparation and improves the test effectiveness.
Description
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to an intelligent testing and optimizing method and device for video corpus and a storage medium.
Background
Current speech algorithm model testing faces a number of difficulties. The preparation of the test set is very cumbersome and inefficient, requiring a lot of time to select the proper audio sample, while also ensuring that the sample covers different language, accent and background noise conditions, which increases the difficulty of sample collection. The labeling work is time-consuming, and manual verification of the audio content, correction of errors in automatic transcription, addition of format labels and the like are required, and particularly, labeling of long audio is more labor-intensive. This results in a time cost for test set generation that is prohibitive. The individual differences of the labeling staff also cause uneven labeling quality, and the fairness of the test is affected to a certain extent. In addition, the existing test set is too dependent on the job data of a specific field, which cannot fully verify the generalization capability of the model in a complex real scene. Meanwhile, the test result also has the interference of artificial annotation errors. These factors restrict the evaluation effectiveness and iterative optimization effect of the algorithm model. The recognition errors of the model cannot be returned to specific reasons, and cannot be improved in a targeted manner. Overall, the prior art has significant drawbacks in supporting the test performance of the speech algorithm.
Disclosure of Invention
In view of the above, the invention provides an intelligent testing and optimizing method for video corpus, which can fully utilize the visualization and analysis capabilities of an attention mechanism, can locate weak point samples and perform targeted model optimization. And finally, an efficient solution for generating the test set and iteratively lifting the model is formed. Compared with the prior art, the scheme obviously reduces the labor cost of test preparation, greatly improves the test effectiveness, and comprises the following steps:
step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of attributes of the video;
step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction network, a multimodal fusion network, an attention network, and a mapping network;
step 3, in the feature extraction and multi-modal fusion network, extracting a first feature based on the video data The method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +.>And a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two;
step 4, defining a cross entropy Loss function Loss with an optimization target of minimizing weighting in the mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrixThe method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix->Adjusting sample weight and adjusting the cross entropy Loss function Loss;
step 5, training the video feature matching algorithm model, and fitting the first featureAnd a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;
And 6, repeating the steps 3-5, and iteratively optimizing the video feature matching algorithm model until the preset optimization condition is met.
In particular, in the feature extraction and multimodal fusion network, the first extraction function may be passed throughExtracting said first feature->The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function->And the second extraction function θ may use a pre-trained language model as an encoder for directly outputting the encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning.
In particular, the first extraction functionAnd a second extraction function θ when using a pre-trained language model as an encoder, said first extraction function ++>Comprises an image feature extraction model ResNet (2); the second extraction function θ includes: a BERT model or a RoBERTa model;
setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.
In particular, the attention weighting matrix is determined byAnd (3) performing calculation:
(1)
wherein,attention matrix representing time T, +.>The superscript T in (a) denotes the transpose, (-)>Representing the second characteristic P t Second mapped word vector space feature obtained after mapping to word vector space,/for>Representing the first featureθ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing the spatial feature of the second mapping word vector>Is a length of (2); />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->The probability distribution can be met as an attention weight; wherein->The function is a nonlinear function.
In particular, in the step 4, the cross entropy loss function;/>Representing attention weighting matrix->The weight corresponding to the ith sample in (i), i.e. the weight is given to the loss term of each sample, the higher the attention weight is, the greater the contribution of the loss to the total loss is;/>True tag representing the i-th sample, +. >An algorithmic model prediction output representing the ith sample.
In particular, in the step 5, optimizing and adjusting the unqualified samples from the test set according to a preset algorithm includes determining a sample prediction error according to the following formula, optimizing and adjusting the unqualified samples:
(2)
wherein,representing a sample prediction error, wherein the value range is 0-1, 0 is completely accurate, and 1 is completely wrong; />For the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />Marking quality of the sample, wherein the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; />For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; />For the deviation of the sample label from the true value, the value ranges are 0 to +infinity, and 0 is completely consistent; />For the characteristic learning property of the sample, the value range is 0-1, and 0 represents the characteristicThe symptoms cannot be learned, and 1 represents that the features can be completely learned; />Labeling the sample with a learning property; />The value range is 0 to + -infinity for the real deviation of the algorithm model and the real scene, and 0 is the algorithm model to completely match the real scene; sample-based prediction error- >And a threshold value optimizes the failed adjustment samples from the test set.
In particular, in the step 6, the preset optimizing condition at least includes at least one of the following conditions: the attention weight matrixThe weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.
In particular, in the step 1, labeling the video, and obtaining the true value label corresponding to the video data includes: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; disassembling the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating the start-stop time of the subtitles according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; the audio generated by interception is subjected to multi-person voice recognition, the audio generated by interception is screened, and the corpus generated after screening is stored in a database according to categories and languages.
The invention also provides an intelligent testing and optimizing device for the video corpus, which comprises the following steps:
the video collection and truth value acquisition module is used for collecting video data, wherein the video data comprises audio and images; labeling the video to obtain a true value label corresponding to the video data, wherein the true value label at least comprises: the audio subtitles and/or the translated contents of the subtitles and the translated evaluation contents, the labels of characters, scenes and actions in the images, the language types of video data, the accurate time axis alignment information of the subtitles and the labels of the attributes of the video;
the model building module is used for building a video feature matching algorithm model, and the model comprises the following components: a feature extraction network, a multimodal fusion network, an attention network, and a mapping network;
a feature extraction fusion and attention weight matrix generation module for extracting a first feature based on the video data in the feature extraction and multimodal fusion networkThe method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +. >And a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two;
a cross entropy Loss function setting module, configured to define a cross entropy Loss function Loss with an optimization objective being a minimization weight in the mapping network, where the cross entropy Loss function Loss includes the attention weight matrixThe method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix->Adjusting sample weight and adjusting the cross entropy Loss function Loss;
the model training module is used for training the video feature matching algorithm model and fitting the first featureAnd a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;
and the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting the preset optimization condition is reached.
The invention also provides a computer readable storage medium, wherein the computer readable storage medium is stored with a computer program, and the computer program realizes the intelligent testing and optimizing method of the video corpus when being executed by a processor.
The beneficial effects are that:
according to the scheme provided by the invention, the weight of each sample is analyzed by using an attention mechanism, so that the weaknesses and error sources of the model can be effectively positioned;
by the scheme, the method and the device for training the samples with low attention weight can be used for strengthening training, so that the recognition accuracy of the model on the error-prone samples can be obviously improved;
according to the scheme provided by the invention, the attention network is added by adjusting the model structure, so that the extraction and utilization capacity of the model to key features can be enhanced;
according to the scheme, the attention correspondence matrix between the test sample characteristics and the true value labels is constructed, and a visual analysis basis is provided for model errors;
according to the scheme provided by the invention, the sample weight and the model structure are regulated simultaneously, so that the effective combination of test set generation and model optimization is realized;
according to the scheme provided by the invention, the quality factors of the samples in multiple aspects are evaluated, the contribution of each factor to the error is positioned, and the targeted improvement is performed.
By the scheme, priori knowledge of a pre-training language model and fitting capacity of a new training projection layer are integrated, so that stronger feature extraction is formed.
By the scheme of the invention, a whole set of system schemes from test set construction to iterative optimization model is provided, so that the effect of video understanding is obviously improved.
Drawings
FIG. 1 is a flow chart of an intelligent testing and optimizing method for video corpus provided by the invention;
fig. 2 is a schematic diagram of an intelligent testing and optimizing device for video corpus.
Detailed Description
The invention will now be described in detail by way of example with reference to the accompanying drawings.
The invention provides an intelligent testing and optimizing method for video corpus, as shown in figure 1, comprising the following steps:
step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of the attributes of the video. Wherein, the video data is collected, and the video data comprises audio and images; labeling the video, and acquiring the true value label corresponding to the video data comprises the following steps: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; decomposing the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating start-stop time according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; intercepting the generated audio to carry out multi-person voice recognition, screening the audio according to actual conditions, and storing the corpus generated after screening into a database according to categories and languages.
Step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction and multimodal fusion network, an attention network, and a mapping network; in the feature extraction and multi-mode fusion network, a convolutional neural network can be selected to perform feature extraction on an input video frame to obtain visual feature representation, and a cyclic neural network is selected to perform feature extraction on video and audio to obtain voice feature representation. Specifically, this step may acquire the following data as needed: visual features, namely image semantic features extracted from video frames; speech features extracted from audio, such as Mel spectral features, acoustic model features, etc.; text features, namely language features extracted from video captions and text descriptions, such as word vectors, BERT features and the like; metadata features, namely digital features of the metadata such as shooting time, place, event and the like of the video; identifying language related features such as language category, dialect and the like in the video; audio attribute features such as acoustic features of background music, noise, tones; video attribute characteristics such as video quality, editing method, scene category and the like.
Step 3, in the feature extraction and multi-modal fusion network, extracting a first feature based on the video data The method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +.>And a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two;
and (3) an attention network, namely inputting the multi-mode features and the true value labeling features of the video, and calculating an attention weight matrix between the multi-mode features and the true value labeling features to represent the focus of attention of the model.
In the feature extraction and multimodal fusion network, the first extraction function can be used forExtracting said first feature->The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function->And the second extraction function θ may use a pre-trained language model as an encoder for directly outputting the encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning. Said first extraction function->And a second extraction function θ when using a pre-trained language model as an encoder, said first extraction function ++ >The method comprises an image feature extraction model ResNet; the second extraction function θ includes: BERT model or RoBERTa model.
Setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.
Extracting a second feature P based on the truth labels t Calculate a first characteristicAnd a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two; the attention matrix->The calculation of (1) comprises:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The attention matrix representing the moment T,the superscript T in (a) represents a transpose;
wherein,representing the second characteristic P t Mapping to wordsThe second mapping word vector space characteristics are obtained after vector space;
Wherein the method comprises the steps ofRepresenting said first feature->θ represents the second mapping word vector space featureIs a second extraction function of->Representing solving the spatial feature of the second mapping word vector>A length; />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->As an attention weight conforming to a probability distribution; wherein->The function is a nonlinear function.
Further, effective pre-training language models such as BERT/ResNet are adopted to extract primary semantic features of input samples, and the pre-training is performedAfter training the language model encoder, adding a full connection layer as a projection layer, and using a large number of samples to supervise and train the projection layer, and learning and projecting the projection layer into the needed feature representation space. By freezing the pre-training language model parameters, only the projection layer is trained. The progressive thawing mode may also be used to gradually open the high-level layer parameters of the pre-trained language model. The projection layer may include a plurality of fully connected layer composition multi-layer perceptrons, enhancing the nonlinear fitting capability of feature transformation. Residual connection, regularization and other means are added between projection layers, so that the feature extraction effect is further improved. Through the combination of the pre-training language model and the projection layer, priori knowledge is utilized, and the learning capability of the model for specific problems is increased. The relationship between the further pre-trained language model and the plurality of projection layers may be expressed by the following formula: The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Representing a pre-trained language model pair->Is equivalent to using a priori knowledge; />Representing an input sample; />And->Representing projection layer pair +.>Is equivalent to the learned projection function,/->Representing a new feature representation obtained by the projection layer; />,Are bias terms which respectively regulate +>And->Output values of the two projection layers. Through learning, worry about>,/>Different values can be taken, which will introduce a certain distinction for the two projection layers, so that slightly different feature maps can be obtained. The final max operation realizes the fusion of two mapping results, and the offset value can be initialized to 0 or can be assigned a random value smaller than 1. After training, the bias term is updated gradually to help the model fit the target mapping relation; max represents the feature and new feature of the fusion pre-training language model; so the formula +.>And then combining to obtain the features fused with priori knowledge and new learning.
Step 4, defining an optimization target as a cross entropy Loss function Loss with minimum weighting in a mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix; in the step 4, the cross entropy loss function ;/>Representing the weight corresponding to the ith sample in the attention weight matrix, namely weighting the loss term of each sample, wherein the higher the attention weight is, the greater the contribution of the loss to the total loss is; />True tag representing the i-th sample, +.>The model prediction output representing the i-th sample. The mapping network also maps the video multi-mode characteristics to classified output, namely video category prediction, through a full connection layer; by means of a attention weighting matrix->And adjusting sample weight, and adjusting the cross entropy Loss function Loss.
Step 5, training the video feature matching algorithm model, and fitting the first featureAnd a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm; by means of a attention weighting matrix->Adjusting the sample weight; adjusting the cross entropy Loss function Loss; in the step 5, optimizing and adjusting unqualified samples from the test set according to a preset algorithm, including determining a sample prediction error according to the following formula, optimizing and adjusting unqualified samples:
(2)
Wherein,representing a sample prediction error, wherein the value range is 0-1, 0 is completely accurate, and 1 is completely wrong; />For the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />Marking quality of the sample, wherein the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; />For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; />For the deviation of the sample label from the true value, the value ranges are 0 to +infinity, and 0 is completely consistent; />For sample feature learning, the value range is 0-1, the 0 representative feature cannot be learned, and the 1 representative feature can be completely learned; />Labeling the sample with a learning property;the value range is 0 to + -infinity for the real deviation of the algorithm model and the real scene, and 0 is the algorithm model to completely match the real scene; sample-based prediction error->And a threshold value optimizes the failed adjustment samples from the test set.
Step 6, repeating the steps 3-5, and iteratively optimizing the video feature matching algorithm model until the preset requirement is metResults of optimizing the conditions. In the step 6, a result of a preset optimization condition is satisfied, where the preset optimization condition at least includes at least one of the following conditions: the attention weight matrix The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.
The invention also provides an intelligent testing and optimizing device for video corpus, as shown in fig. 2, comprising:
the video collection and truth value acquisition module is used for collecting video data, wherein the video data comprises audio and images; labeling the video to obtain a true value label corresponding to the video data, wherein the true value label at least comprises: the audio subtitles and/or translated content of the subtitles and translated evaluation content, labels of characters, scenes and actions in the images, language types of video data, accurate time axis alignment information of the subtitles and labels of attributes of the video; wherein, the video data is collected, and the video data comprises audio and images; labeling the video, and acquiring the true value label corresponding to the video data comprises the following steps: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; decomposing the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating start-stop time according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; intercepting the generated audio to carry out multi-person voice recognition, screening the audio according to actual conditions, and storing the corpus generated after screening into a database according to categories and languages.
The model building module is used for building a video feature matching algorithm model, and the model comprises the following components: a feature extraction and multimodal fusion network, an attention network, and a mapping network; in the feature extraction and multi-mode fusion network, a convolutional neural network can be selected to perform feature extraction on an input video frame to obtain visual feature representation, and a cyclic neural network is selected to perform feature extraction on video and audio to obtain voice feature representation. Specifically, this step may acquire the following data as needed: visual features, namely image semantic features extracted from video frames; speech features extracted from audio, such as Mel spectral features, acoustic model features, etc.; text features, namely language features extracted from video captions and text descriptions, such as word vectors, BERT features and the like; metadata features, namely digital features of the metadata such as shooting time, place, event and the like of the video; identifying language related features such as language category, dialect and the like in the video; audio attribute features such as acoustic features of background music, noise, tones; video attribute characteristics such as video quality, editing method, scene category and the like.
A feature extraction fusion and attention weight matrix generation module for extracting a first feature based on the video data in the feature extraction and multimodal fusion network The method comprises the steps of carrying out a first treatment on the surface of the Said first feature->Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in said attention network, the first feature +.>And a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two;
and (3) an attention network, namely inputting the multi-mode features and the true value labeling features of the video, and calculating an attention weight matrix between the multi-mode features and the true value labeling features to represent the focus of attention of the model.
In the feature extraction and multimodal fusion network, the first extraction function can be used forExtracting said first feature->The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function->And the second extraction function θ may use a pre-trained language model as an encoder for directly outputting the encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning. Said first extraction function->And a second extraction function θ when using a pre-trained language model as an encoder, said first extraction function ++ >The method comprises an image feature extraction model ResNet; the second extraction function θ includes: BERT model or RoBERTa model.
Setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.
Extracting a second feature P based on the truth labels t Calculate a first characteristicAnd a second feature P t Attention weighting matrix between->Evaluating the matching degree of the two; the attention matrix->The calculation of (1) comprises:
the method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The attention matrix representing the moment T,the superscript T in (a) represents a transpose;
wherein,representing the second characteristic P t A second mapping word vector space feature obtained after mapping to the word vector space;
Wherein the method comprises the steps ofRepresenting said first feature->θ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing solving the spatial feature of the second mapping word vector>A length; />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, the +.>Indicating pass->The function makes the first feature->As an attention weight conforming to a probability distribution; wherein->The function is a nonlinear function.
Further, effective pre-training language models such as BERT/ResNet are adopted, primary semantic features of input samples are extracted, after a pre-training language model encoder is adopted, a fully connected layer is added to serve as a projection layer, a large number of samples are used for supervised training of the projection layer, and the projection is learned and projected to a feature representation space needed by people. By freezing the pre-training language model parameters, only the projection layer is trained. The progressive thawing mode may also be used to gradually open the high-level layer parameters of the pre-trained language model. The projection layer may include a plurality of fully connected layer composition multi-layer perceptrons, enhancing the nonlinear fitting capability of feature transformation. Residual connection, regularization and other means are added between projection layers, so that the feature extraction effect is further improved. Through the combination of the pre-training language model and the projection layer, priori knowledge is utilized, and the learning capability of the model for specific problems is increased. The relationship between the further pre-trained language model and the projection layer can be expressed by the following formula: The method comprises the steps of carrying out a first treatment on the surface of the Wherein->Representing a pre-trained language model pair->Is equivalent to using a priori knowledge; />Representing an input sample; />And->Representing projection layer pair +.>Is equivalent to the learned projection function,/->Representing a new feature representation obtained by the projection layer;,/>are bias terms which respectively regulate +>And->Output values of the two projection layers. Through learning, worry about>,/>Different values can be taken, which will be twoThe projection layer introduces a certain distinction so that it can get slightly different feature maps. The final max operation realizes the fusion of two mapping results, and the offset value can be initialized to 0 or can be assigned a random value smaller than 1. After training, the bias term is updated gradually to help the model fit the target mapping relation; max represents the feature and new feature of the fusion pre-training language model; so the formula +.>And then combining to obtain the features fused with priori knowledge and new learning.
The cross entropy Loss function setting module is used for defining an optimization target as a cross entropy Loss function Loss with minimized weighting in a mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix; wherein the cross entropy loss function ;/>Representing the weight corresponding to the ith sample in the attention weight matrix, namely weighting the loss term of each sample, wherein the higher the attention weight is, the greater the contribution of the loss to the total loss is; />True tag representing the i-th sample, +.>The model prediction output representing the i-th sample. The mapping network also maps the video multi-mode characteristics to classified output, namely video category prediction, through a full connection layer; by means of a attention weighting matrix->And adjusting sample weight, and adjusting the cross entropy Loss function Loss.
The model training module is used for training the video feature matching algorithm model and fitting the first featureAnd a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix +.>Wherein by analyzing the attention weight matrix +.>Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm; by means of a attention weighting matrix->Adjusting the sample weight; adjusting the cross entropy Loss function Loss; and optimizing and adjusting unqualified samples from the test set according to a preset algorithm, wherein the optimizing and adjusting unqualified samples comprises determining sample prediction errors through the following formula:
(2)
Wherein,representing a sample prediction error, wherein the value range is 0-1, 0 is completely accurate, and 1 is completely wrong; />For the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />Marking quality of the sample, wherein the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; />For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; />For the deviation of the sample label from the true value, the value ranges are 0 to +infinity, and 0 is completely consistent; />For sample feature learning, the value range is 0-1, the 0 representative feature cannot be learned, and the 1 representative feature can be completely learned; />Labeling the sample with a learning property;the value range is 0 to + -infinity for the real deviation of the algorithm model and the real scene, and 0 is the algorithm model to completely match the real scene; sample-based prediction error->And a threshold value optimizes the failed adjustment samples from the test set.
And the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting the preset optimization condition is reached. In the optimization completion module, the result of meeting the preset optimization conditions comprises at least one of the following conditions: the attention weight matrix The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease. />
The invention also discloses a computer readable storage medium, and the computer readable storage medium is stored with a computer program, and the computer program realizes the intelligent testing optimization method of the video corpus when being executed by a processor. The features of the embodiments and method are similar, and will not be described in detail.
In summary, the above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
It will be evident to those skilled in the art that the embodiments of the invention are not limited to the details of the foregoing illustrative embodiments, and that the embodiments of the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of embodiments being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units, modules or means recited in a system, means or terminal claim may also be implemented by means of software or hardware by means of one and the same unit, module or means. The terms first, second, etc. are used to denote a name, but not any particular order.
Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the embodiment of the present invention, and not for limiting, and although the embodiment of the present invention has been described in detail with reference to the above-mentioned preferred embodiments, it should be understood by those skilled in the art that modifications and equivalent substitutions can be made to the technical solution of the embodiment of the present invention without departing from the spirit and scope of the technical solution of the embodiment of the present invention.
Claims (9)
1. The intelligent testing and optimizing method for the video corpus is characterized by comprising the following steps of:
step 1, collecting video data, wherein the video data comprises audio and images; labeling the video data to obtain true value labels corresponding to the video data, wherein the true value labels at least comprise: the audio subtitles and/or the translated contents of the subtitles and the translated evaluation contents, the labels of characters, scenes and actions in the images, the language types of video data, the accurate time axis alignment information of the subtitles and the labels of the attributes of the video;
step 2, a video feature matching algorithm model is established, and the model comprises the following steps: a feature extraction and multimodal fusion network, an attention network, and a mapping network;
step 3, in the feature extraction and multi-modal fusion network, extracting a first feature P based on the video data s The method comprises the steps of carrying out a first treatment on the surface of the The first characteristic P s Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in the attention network, a first feature P is calculated s And a second feature P t Attention weighting matrix W between T Evaluating the matching degree of the two;
step 4, defining a cross entropy Loss function Loss with an optimization target of minimizing weighting in the mapping network, wherein the cross entropy Loss function Loss comprises the attention weight matrix W T The method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix W T Adjusting sample weight and adjusting the cross entropy Loss function Loss;
step 5, training the video feature matching algorithm model, and fitting the first feature P s And a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix W T Wherein by analysing the attention weighting matrix W T Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;
Step 6, repeating the step 3-5, and iteratively optimizing the video feature matching algorithm model until a preset optimization condition is met;
in step 5, optimizing and adjusting unqualified samples from the test set according to a preset algorithm, including determining a sample prediction error according to the following formula, optimizing and adjusting unqualified samples:
wherein M is 2 The sample prediction error is represented, the value range is 0-1, 0 is completely accurate, and 1 is completely erroneous;for the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />The quality of the sample is marked, the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; l (L) J For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; l (L) C For the deviation of the sample label from the true value, the value range is 0 to +infinity, and 0 is completely consistent; />For the sample feature learning property, the value range is 0-1, the 0 represents feature can not be learned, and the 1 represents feature can be completely learned; />Labeling the sample with a learning property; ρ is the real deviation of the algorithm model and the real scene, the value range is 0 to++ infinity, and 0 is the algorithm model completely matched with the real scene; based on sample prediction error M 2 And threshold values are optimized from the test set to be disqualifiedAnd (3) a sample.
2. The intelligent testing and optimizing method for video corpus of claim 1,
in the feature extraction and multimodal fusion network, the first feature P may be extracted by a first extraction function φ s The method comprises the steps of carrying out a first treatment on the surface of the Extracting the second feature P by a second extraction function θ t The method comprises the steps of carrying out a first treatment on the surface of the Wherein the first extraction function phi and the second extraction function theta may use a pre-trained language model as an encoder for directly outputting encoded features; and fusing the visual and voice features through linear mapping and nonlinear transformation, and obtaining the multi-modal features of the video through setting a deep learning model for automatic learning.
3. The intelligent testing and optimizing method for video corpus of claim 2,
when the first extraction function phi and the second extraction function theta use a pre-training language model as an encoder, the first extraction function phi comprises an image feature extraction model ResNet; the second extraction function θ includes: a BERT model or a RoBERTa model;
setting the deep learning model includes: setting the output dimension of the projection layer to generate a desired feature representation; the projection layer comprises one or more linear full-connection layers for realizing linear mapping or nonlinear activation functions; learning a weight matrix of the projection layer through back propagation training to fit an optimal feature map; the plurality of projection layers obtained through training are connected in series to form a multi-layer perceptron structure, the mapping capacity is enhanced, and a regularization mode is added among the plurality of projection layers to help optimize training; the input end corresponds to different projection layers to fit feature maps of different domains and prevent the projection layers from being overfitted by arranging a discarding layer.
4. The intelligent testing and optimizing method for video corpus of claim 1,
the attention weight matrix W is determined by T And (3) performing calculation:
wherein W is T The attention matrix representing the moment T,representing the second characteristic P t Second mapped word vector space feature obtained after mapping to word vector space,/for>The superscript T in (1) represents the transpose, phi represents the first feature P s θ represents the second mapping word vector spatial feature +.>Is a second extraction function of->Representing the spatial feature of the second mapping word vector>Is a length of (2); />Representing spatial characteristics of the second mapping word vector by evolution->Carrying out normalization treatment; the softmax function represents the normalization of the result of the first extraction function and the result of the second extraction function after the inner product operation, σ (P s ) Representing the first feature P by a sigma function s The probability distribution can be met as an attention weight; wherein the sigma function is a nonlinear function.
5. The intelligent test optimization method for video corpus according to claim 1, wherein in the step 4, the cross entropy loss functionW T [i]Representing an attention weighting matrix W T The weight corresponding to the ith sample in (i), namely, the loss term of each sample is weighted, and the higher the attention weight is, the greater the contribution of the loss to the total loss is; y is i True tag representing the i-th sample, +.>An algorithmic model prediction output representing the ith sample.
6. The intelligent testing optimization method for video corpus according to claim 1, wherein: in the step 6, the preset optimizing condition at least includes at least one of the following conditions: the attention weight matrix W T The weights in (2) are all above a preset threshold; the corpus intelligent test is optimized, and the prediction accuracy or other evaluation indexes reach the expected target; the number of samples that are out of qualification for optimal adjustment from the test set continues to decrease.
7. The intelligent testing optimization method for video corpus according to claim 1, wherein: in the step 1, labeling the video, and obtaining the true value label corresponding to the video data includes: the video spider crawls the required video from the main stream video station according to the requirement, carries out clear smoothness detection on the video, and eliminates the video which does not meet the requirement; disassembling the video into continuous pictures according to the video attributes, sequencing the generated pictures, and identifying subtitles in the video by using an OCR technology; performing de-duplication processing on the generated subtitles, and calculating the start-stop time of the subtitles according to the video frame rate; decomposing audio from the original video, and intercepting the audio according to the start-stop time of the caption; noise reduction processing is carried out on the audio, so that definition is improved; the audio generated by interception is subjected to multi-person voice recognition, the audio generated by interception is screened, and the corpus generated after screening is stored in a database according to categories and languages.
8. An intelligent testing and optimizing device for video corpus is characterized by comprising:
the video collection and truth value acquisition module is used for collecting video data, wherein the video data comprises audio and images; labeling the video to obtain a true value label corresponding to the video data, wherein the true value label at least comprises: the audio subtitles and/or the translated contents of the subtitles and the translated evaluation contents, the labels of characters, scenes and actions in the images, the language types of video data, the accurate time axis alignment information of the subtitles and the labels of the attributes of the video;
the model building module is used for building a video feature matching algorithm model, and the model comprises the following components: a feature extraction network, a multimodal fusion network, an attention network, and a mapping network;
a feature extraction fusion and attention weight matrix generation module for extracting a first feature P based on the video data in the feature extraction and multimodal fusion network s The method comprises the steps of carrying out a first treatment on the surface of the The first characteristic P s Including visual and speech features in the video data; extracting a second feature P based on the truth labels t The method comprises the steps of carrying out a first treatment on the surface of the Fusing the visual and voice features through linear mapping and nonlinear transformation to obtain multi-modal feature representation of the video; in the attention network, a first feature P is calculated s And a second feature P t Attention weighting matrix W between T Evaluating the matching degree of the two;
a cross entropy Loss function setting module, configured to define a cross entropy Loss function Loss with an optimization objective being a minimization weight in the mapping network, where the cross entropy Loss function Loss includes the attention weight matrix W T The method comprises the steps of carrying out a first treatment on the surface of the By means of a attention weighting matrix W T Adjusting sample weight and adjusting the cross entropy Loss function Loss;
the model training module is used for training the video feature matching algorithm model and fitting the first modelFeature P s And a second feature P t The mapping between the two is minimized, and the cross entropy Loss function Loss is minimized; evaluating the algorithm model on a test set, updating the attention weight matrix W T Wherein by analysing the attention weighting matrix W T Finding out samples with insufficient attention of the algorithm model so as to enhance the targeted training of the algorithm model or optimize and adjust unqualified samples from the test set according to a preset algorithm;
the optimization completion module is used for repeatedly executing the feature extraction fusion and attention weight matrix generation module to the model training module and iteratively optimizing the video feature matching algorithm model until a result meeting a preset optimization condition is reached;
The model training module optimizes and adjusts unqualified samples from the test set according to a preset algorithm, wherein the model training module determines sample prediction errors through the following formula, and optimizes and adjusts the unqualified samples:
wherein M is 2 The sample prediction error is represented, the value range is 0-1, 0 is completely accurate, and 1 is completely erroneous;for the sample feature expression quality, the value range is 0-1, the 0 representative feature cannot be expressed completely and accurately, and the 1 representative feature expression is complete and accurate; />The quality of the sample is marked, the value range is 0-1, 0 represents marking errors, and 1 represents marking correctness and no errors; l (L) J For deviations of sample features from real scenes, the feature real deviations, the value range is 0 to + -infinity, and 0 is the sample characteristic which is completely matched with the real scene; l (L) C For the deviation of the sample label from the true value, the value range is 0 to +infinity, and 0 is completely consistent; />For the sample feature learning property, the value range is 0-1, the 0 represents feature can not be learned, and the 1 represents feature can be completely learned; />Labeling the sample with a learning property; ρ is the real deviation of the algorithm model and the real scene, the value range is 0 to++ infinity, and 0 is the algorithm model completely matched with the real scene; based on sample prediction error M 2 And a threshold value optimizes the failed adjustment samples from the test set.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program, which when executed by a processor, implements the video corpus intelligent test optimization method according to any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311504149.9A CN117251599B (en) | 2023-11-13 | 2023-11-13 | Video corpus intelligent test optimization method, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311504149.9A CN117251599B (en) | 2023-11-13 | 2023-11-13 | Video corpus intelligent test optimization method, device and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117251599A CN117251599A (en) | 2023-12-19 |
CN117251599B true CN117251599B (en) | 2024-03-15 |
Family
ID=89137160
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311504149.9A Active CN117251599B (en) | 2023-11-13 | 2023-11-13 | Video corpus intelligent test optimization method, device and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117251599B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114398961A (en) * | 2021-12-28 | 2022-04-26 | 西南交通大学 | Visual question-answering method based on multi-mode depth feature fusion and model thereof |
CN114626441A (en) * | 2022-02-23 | 2022-06-14 | 苏州大学 | Implicit multi-mode matching method and system based on visual contrast attention |
CN114724548A (en) * | 2022-03-11 | 2022-07-08 | 中国科学技术大学 | Training method of multi-mode speech recognition model, speech recognition method and equipment |
CN114743143A (en) * | 2022-04-11 | 2022-07-12 | 同济大学 | Video description generation method based on multi-concept knowledge mining and storage medium |
CN116955699A (en) * | 2023-07-18 | 2023-10-27 | 北京邮电大学 | Video cross-mode search model training method, searching method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8086549B2 (en) * | 2007-11-09 | 2011-12-27 | Microsoft Corporation | Multi-label active learning |
-
2023
- 2023-11-13 CN CN202311504149.9A patent/CN117251599B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114398961A (en) * | 2021-12-28 | 2022-04-26 | 西南交通大学 | Visual question-answering method based on multi-mode depth feature fusion and model thereof |
CN114626441A (en) * | 2022-02-23 | 2022-06-14 | 苏州大学 | Implicit multi-mode matching method and system based on visual contrast attention |
CN114724548A (en) * | 2022-03-11 | 2022-07-08 | 中国科学技术大学 | Training method of multi-mode speech recognition model, speech recognition method and equipment |
CN114743143A (en) * | 2022-04-11 | 2022-07-12 | 同济大学 | Video description generation method based on multi-concept knowledge mining and storage medium |
CN116955699A (en) * | 2023-07-18 | 2023-10-27 | 北京邮电大学 | Video cross-mode search model training method, searching method and device |
Also Published As
Publication number | Publication date |
---|---|
CN117251599A (en) | 2023-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11270079B2 (en) | Translation model based training method and translation method, computer device, and storage medium | |
CN108399428B (en) | Triple loss function design method based on trace ratio criterion | |
CN108765383B (en) | Video description method based on deep migration learning | |
CN113255755A (en) | Multi-modal emotion classification method based on heterogeneous fusion network | |
CN110796199B (en) | Image processing method and device and electronic medical equipment | |
CN114298158A (en) | Multi-mode pre-training method based on image-text linear combination | |
CN111444367B (en) | Image title generation method based on global and local attention mechanism | |
CN113314205A (en) | Efficient medical image labeling and learning system | |
CN114998602B (en) | Domain adaptive learning method and system based on low confidence sample contrast loss | |
CN113448843B (en) | Image recognition software test data enhancement method and device based on defect analysis | |
CN116311483B (en) | Micro-expression recognition method based on local facial area reconstruction and memory contrast learning | |
CN112084793B (en) | Semantic recognition method, device and readable storage medium based on dependency syntax | |
CN116956929B (en) | Multi-feature fusion named entity recognition method and device for bridge management text data | |
CN114329034A (en) | Image text matching discrimination method and system based on fine-grained semantic feature difference | |
CN116912642A (en) | Multimode emotion analysis method, device and medium based on dual-mode and multi-granularity interaction | |
CN116955699A (en) | Video cross-mode search model training method, searching method and device | |
CN111126155B (en) | Pedestrian re-identification method for generating countermeasure network based on semantic constraint | |
CN114880307A (en) | Structured modeling method for knowledge in open education field | |
CN111144462A (en) | Unknown individual identification method and device for radar signals | |
WO2020216286A1 (en) | Method for training teaching style prediction model, and computer storage medium | |
CN116631566B (en) | Medical image report intelligent generation method based on big data | |
CN117251599B (en) | Video corpus intelligent test optimization method, device and storage medium | |
CN111161238A (en) | Image quality evaluation method and device, electronic device, and storage medium | |
CN113688879B (en) | Generalized zero sample learning classification method based on confidence distribution external detection | |
CN116484053B (en) | Intelligent data analysis platform |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |