CN113722514A - Internet education video image screening and extracting method based on deep learning - Google Patents

Internet education video image screening and extracting method based on deep learning Download PDF

Info

Publication number
CN113722514A
CN113722514A CN202111032198.8A CN202111032198A CN113722514A CN 113722514 A CN113722514 A CN 113722514A CN 202111032198 A CN202111032198 A CN 202111032198A CN 113722514 A CN113722514 A CN 113722514A
Authority
CN
China
Prior art keywords
text
data
feature
vectors
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202111032198.8A
Other languages
Chinese (zh)
Inventor
王晓跃
耿晨熙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Xifeng Education Technology Co ltd
Original Assignee
Jiangsu Xifeng Education Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Xifeng Education Technology Co ltd filed Critical Jiangsu Xifeng Education Technology Co ltd
Priority to CN202111032198.8A priority Critical patent/CN113722514A/en
Publication of CN113722514A publication Critical patent/CN113722514A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/43Querying
    • G06F16/435Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • G06F16/48Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an internet education video image screening and extracting method based on deep learning, belonging to the technical field of multimedia data retrieval, and specifically comprising the following steps: step one, inputting multimedia data to be detected; step two, feature extraction; step three, feature fusion; step four, screening and extracting; the method adopts the deep learning technology to extract and fuse the characteristics of the multi-modal data, projects the multi-modal data to the same public space, and realizes the cross-modal retrieval of the multi-modal data through similarity calculation.

Description

Internet education video image screening and extracting method based on deep learning
Technical Field
The invention relates to the technical field of multimedia data retrieval, in particular to an internet education video image screening and extracting method based on deep learning.
Background
Through retrieval, the Chinese patent No. CN111723111A discloses a method, a device and equipment for extracting data based on video production, although the method screens out the data meeting the requirements through keyword search, the data screening precision and efficiency are relatively low because the input data mode is single; with the development of science and technology, multimedia is increasingly commonly used in teaching and occupies an increasingly important position in teaching; various schools and academic units establish multimedia classrooms or multifunctional halls; the multimedia teaching integrates sound, image, video, characters and other media, can be used as an effective auxiliary teaching means, can visually display the contents to be presented, is convenient to understand, can achieve the purposes of teaching knowledge, developing intelligence and cultivating capability, and can also achieve the purposes of teaching by factors and personalized teaching, thereby being deeply favored by teachers and students and academic institutions; however, at present, multimedia teaching plans are mostly made by teachers through office software, and teachers need to collect and import multimedia teaching plan materials required by contents of all parts in advance before making the multimedia teaching plans, however, in an era that the information is explosively increased, multimodal internet data is often unavailable to some teachers, and the teachers are difficult to accurately find the materials wanted by the teachers in mass internet data, so that the teachers often spend a lot of time on finding the materials; the deep learning technology is a new research direction in the field of machine learning, and has great potential in character, image and voice recognition, so how to combine the deep learning technology to perform multi-modal retrieval has become the focus of research; therefore, the invention of the internet education video image screening and extracting method based on deep learning becomes more important;
most of the existing multimedia data screening and extracting methods realize retrieval through a single mode, and the method has relatively low screening precision and efficiency of multimedia materials due to the fact that the mode of input data is single, and is easy to reduce the working efficiency of teachers during making multimedia teaching plans; therefore, the internet education video image screening and extracting method based on deep learning is provided.
Disclosure of Invention
The invention aims to solve the defects in the prior art and provides an internet education video image screening and extracting method based on deep learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for screening and extracting Internet education video images based on deep learning comprises the following specific steps:
inputting multimedia data to be detected, namely inputting the multimedia data to be detected, which needs to be retrieved by a user, wherein the multimedia data to be detected is multi-modal data and specifically comprises text data and image data;
step two, feature extraction, namely acquiring the text data and the image data obtained in the step one, and respectively inputting the text data and the image data into corresponding text models and image models to carry out vector feature extraction so as to respectively obtain text feature vectors and image feature vectors;
step three, feature fusion, namely acquiring the text feature vector and the image feature vector in the step two, constructing corresponding gate features and residual features of the text feature vector and the image feature vector through a fusion algorithm, performing feature fusion by adopting a metric learning method to obtain a fusion feature vector,
and step four, screening and extracting, namely performing vector transformation on multi-modal data in the multimedia teaching plan material library to obtain target data characteristic vectors, projecting the target data characteristic vectors and the fusion characteristic vectors to the same public space, performing similarity measurement calculation, sequencing according to the similarity, and screening and extracting the first N candidate data as retrieval results.
Further, before feature extraction, word segmentation is required for the text word vector in the step one, wherein the word segmentation is a word segmentation algorithm based on statistics and is used for removing stop words and dividing keywords, and the stop words comprise two types: one group refers to frequently occurring words and the other group refers to certain fictional words including adverb, preposition, conjunctions, and exclamation words, the stop words being replaced with symbols including "()", "-", "/", and "&", and removed from the word segmentation result.
Further, the text model and the image model in the second step are a BERT text representation model and a VGGNet network model respectively.
Further, the specific process of feature fusion in step three is as follows:
s1: constructing gate characteristics and residual characteristics according to the text characteristic vectors and the image characteristic vectors by utilizing a corresponding multiplication mode of the same-position elements,
the door characteristic calculation formula is as follows:
fgatext)=σ(Wg*ReLU(φ't))⊙φx (1)
the residual error characteristic calculation formula is as follows:
fresxt)=σ(Wr*ReLU(φ't) (2)
in the formula: σ is sigmoid function, WgAnd WrConvolution filters of 3 x 3 each, ReLU being a linear modification unit, and a calculation method of corresponding multiplication of the same-position elements;
s2: and carrying out weight proportioning on the constructed gate characteristic and the residual error characteristic, and carrying out linear combination, wherein the formula is as follows:
Figure BDA0003245780590000041
s3: and performing weight parameter optimization on the gate features and the residual error features by adopting a depth measurement learning mode to obtain a fusion feature vector.
Further, the gate features and the residual features need to be unified in spatial structure with respect to the text feature vector and the image feature vector before being constructed, that is, the text feature vector is structurally transformed by a 3 × 3 convolution filter, and the formula is as follows:
φ't=W*(φx,φt) (4)
in the formula: phitFor the feature vectors of the text after the structural transformation, phixRepresenting the image feature vector, phitRepresenting the text feature vector and W representing a 3 x 3 convolution filter.
Further, the similarity measurement calculation in the fourth step is implemented by using a cosine distance algorithm, and the specific formula is as follows:
Figure BDA0003245780590000042
in the formula: x is a fusion feature vector; y is a target data feature vector; cos is a cosine value, the value range of the cos cosine value is [ -1,1], if the cos cosine value is larger, the two vectors are more similar, otherwise, the two vectors are opposite.
Compared with the prior art, the invention has the beneficial effects that:
the application provides an internet education video image screening extraction method based on deep learning, adopt the deep learning technique to carry out feature extraction and feature fusion to multimodal data to project it in same public space, the cross-modal retrieval of multimodal data has been realized through similarity calculation, it compares in current single mode retrieval method, it is favorable to improving multimedia material screening precision and extraction efficiency, and then is favorable to assisting the teacher to carry out the preparation of multimedia teaching plan, improves the work efficiency of teacher when making the multimedia teaching plan.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
Fig. 1 is an overall flowchart of the internet education video image screening and extracting method based on deep learning according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
Example 1
Referring to fig. 1, the embodiment discloses a screening and extracting method for internet education video images based on deep learning, which specifically comprises the following steps:
firstly, inputting multimedia data to be detected, and inputting the multimedia data to be detected, which needs to be retrieved by a user;
specifically, the multimedia data to be detected is multi-modal data, which specifically includes text data and image data, the text data and the image data can be input in a single-modal form to realize cross-modal screening extraction (i.e., cross-modal retrieval of the multimedia data is realized by single text data or single image data), or can be input in a mode combination mode to realize cross-modal screening extraction (i.e., input in a multi-modal form of text data plus image data to realize cross-modal retrieval of the multimedia data), the text word vector needs to be subjected to word segmentation before feature extraction, the word segmentation is based on a statistical word segmentation algorithm, and is used for removing stop words and dividing keywords, and the stop words include two types: one group refers to frequently occurring words, and the other group refers to certain imaginary words including adverb, preposition, conjunctions, and exclamation words, the stop words are replaced with symbols and removed from the word segmentation result, the symbols including "()", "" - ","/", and" & ".
Then, feature extraction is carried out, text data and image data are obtained and are respectively input into a corresponding text model and an image model for vector feature extraction, and a text feature vector and an image feature vector are respectively obtained;
specifically, the text model and the image model are a BERT text representation model and a VGGNet network model respectively, the BERT text representation model is a pre-training language model issued by Google in October 2018, the BERT text representation model shows very strong performance on related tasks in the field of natural language processing, and a BERT-as-service tool is used for processing original text data to obtain text feature vectors; the VGGNet network model is specifically a VGGNet-16 network model, compared with the network structures of the rest of VGGNet at different levels, the VGGNet-16 network model has higher utilization rate, only comprises convolution operation with the size of 3 multiplied by 3 and pooling operation with the size of 2 multiplied by 2 in the whole training process, proves that the network model is simpler and easier to use while taking excellent feature expression into consideration, and can efficiently obtain image feature vectors by taking the network model as a picture feature extractor;
then, feature fusion is carried out to obtain a text feature vector and an image feature vector, corresponding gate features and residual features are constructed through a fusion algorithm, feature fusion is carried out by adopting a metric learning method to obtain a fusion feature vector,
specifically, because the input text feature vector and the image feature vector are respectively output from different network models and the expressions of the text feature vector and the image feature vector on the spatial structure are not consistent, the feature fusion adopts a method for constructing and combining gate features and residual features, the two features are fused to obtain a unified expression mode which keeps consistency on the spatial structure and combines the respective features, namely, the fusion feature vector is formed by taking the original features as the basis;
finally, screening and extracting, namely performing vector transformation on multi-modal data in the multimedia teaching plan material library to obtain target data characteristic vectors, projecting the target data characteristic vectors and the fusion characteristic vectors to the same public space, performing similarity measurement calculation, sequencing according to the similarity, and screening and extracting the first N candidate data as retrieval results;
specifically, the similarity measurement calculation is implemented by a cosine distance algorithm, and the specific formula is as follows:
Figure BDA0003245780590000071
in the formula: x is a fusion feature vector; y is a target data feature vector; cos is cosine value with the range of [ -1,1 []If the cos cosine value is larger, the two vectors are more similar, otherwise, the two vectors are opposite; compared with the existing single-mode retrieval method, the method and the device are beneficial to improving the multimedia material screening precision and the extraction efficiency, and further beneficial to assisting teachers to make multimedia teaching plans, and improving the working efficiency of teachers when making the multimedia teaching plans.
Example 2
Referring to fig. 1, this embodiment discloses a depth learning-based internet education video image screening and extracting method, which specifically describes a specific process of feature fusion, except for the same structure as the above embodiment;
specifically, the specific process of feature fusion is as follows: firstly, unifying the space structure of the text feature vector and the image feature vector, namely, performing structure transformation on the text feature vector through a 3-by-3 convolution filter, wherein the formula is as follows: phit=W*(φx,φt) In the formula: phitFor the feature vectors of the text after the structural transformation, phixRepresenting the image feature vector, phitRepresenting the text feature vector, W representing a 3 x 3 convolution filter; then, constructing a gate feature and a residual feature by utilizing a corresponding multiplication mode of the same-position elements according to the text feature vector and the image feature vector, wherein a gate feature calculation formula is as follows: f. ofgatext)=σ(Wg*ReLU(φ't))⊙φx(ii) a The residual error characteristic calculation formula is as follows: f. ofresxt)=σ(Wr*ReLU(φ't) In the formula: σ is sigmoid function, WgAnd WrConvolution filters of 3 x 3 each, ReLU being a linear modification unit, and a calculation method of corresponding multiplication of the same-position elements; then, weight proportion is carried out on the constructed gate features and residual error features, and linear combination is carried out, wherein the formula is as follows:
Figure BDA0003245780590000081
finally, performing weight parameter optimization on the gate features and the residual error features by adopting a depth measurement learning mode to obtain a fusion feature vector;
in the embodiment, a deep learning technology is adopted to perform feature extraction and feature fusion on multi-modal data, the multi-modal data are projected to the same public space, and cross-modal retrieval of the multi-modal data is realized through similarity calculation.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (6)

1. A method for screening and extracting Internet education video images based on deep learning is characterized by comprising the following specific steps:
inputting multimedia data to be detected, namely inputting the multimedia data to be detected, which needs to be retrieved by a user, wherein the multimedia data to be detected is multi-modal data and specifically comprises text data and image data;
step two, feature extraction, namely acquiring the text data and the image data obtained in the step one, and respectively inputting the text data and the image data into corresponding text models and image models to carry out vector feature extraction so as to respectively obtain text feature vectors and image feature vectors;
step three, feature fusion, namely acquiring the text feature vector and the image feature vector in the step two, constructing corresponding gate features and residual features of the text feature vector and the image feature vector through a fusion algorithm, performing feature fusion by adopting a metric learning method to obtain a fusion feature vector,
and step four, screening and extracting, namely performing vector transformation on multi-modal data in the multimedia teaching plan material library to obtain target data characteristic vectors, projecting the target data characteristic vectors and the fusion characteristic vectors to the same public space, performing similarity measurement calculation, sequencing according to the similarity, and screening and extracting the first N candidate data as retrieval results.
2. The method as claimed in claim 1, wherein in step one, word segmentation is performed on the text word vector before feature extraction, the word segmentation is a statistical-based word segmentation algorithm, and is used for removing stop words and dividing keywords, and the stop words include two types: one group refers to frequently occurring words and the other group refers to certain fictional words including adverb, preposition, conjunctions, and exclamation words, the stop words being replaced with symbols including "()", "-", "/", and "&", and removed from the word segmentation result.
3. The method as claimed in claim 1, wherein the text model and the image model in step two are a BERT text representation model and a VGGNet network model, respectively.
4. The method for screening and extracting video images for internet education based on deep learning as claimed in claim 1, wherein the specific process of feature fusion in step three is as follows:
s1: constructing gate characteristics and residual characteristics according to the text characteristic vectors and the image characteristic vectors by utilizing a corresponding multiplication mode of the same-position elements,
the door characteristic calculation formula is as follows:
fgatex,φt)=σ(Wg*ReLU(φ′t))⊙φx (1)
the residual error characteristic calculation formula is as follows:
fresx,φt)=σ(Wr*ReLU(φ′t) (2)
in the formula: σ is sigmoid function, WgAnd WrConvolution filters of 3 x 3 each, ReLU being a linear modification unit, and a calculation method of corresponding multiplication of the same-position elements;
s2: and carrying out weight proportioning on the constructed gate characteristic and the residual error characteristic, and carrying out linear combination, wherein the formula is as follows:
Figure FDA0003245780580000021
s3: and performing weight parameter optimization on the gate features and the residual error features by adopting a depth measurement learning mode to obtain a fusion feature vector.
5. The method as claimed in claim 4, wherein the gate features and residual features require spatial structure unification of text feature vectors and image feature vectors before construction, that is, the text feature vectors are structurally transformed by a 3 × 3 convolution filter, and the formula is as follows:
φ′t=W*(φx,φt) (4)
in the formula: phi'tFor the feature vectors of the text after the structural transformation, phixRepresenting the image feature vector, phitRepresenting the text feature vector and W representing a 3 x 3 convolution filter.
6. The method for screening and extracting video images for internet education based on deep learning of claim 1, wherein the similarity measurement calculation in the fourth step is implemented by a cosine distance algorithm, and the specific formula is as follows:
Figure FDA0003245780580000031
in the formula: x is a fusion feature vector; y is a target data feature vector; cos is a cosine value, the value range of the cos cosine value is [ -1,1], if the cos cosine value is larger, the two vectors are more similar, otherwise, the two vectors are opposite.
CN202111032198.8A 2021-09-03 2021-09-03 Internet education video image screening and extracting method based on deep learning Withdrawn CN113722514A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111032198.8A CN113722514A (en) 2021-09-03 2021-09-03 Internet education video image screening and extracting method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111032198.8A CN113722514A (en) 2021-09-03 2021-09-03 Internet education video image screening and extracting method based on deep learning

Publications (1)

Publication Number Publication Date
CN113722514A true CN113722514A (en) 2021-11-30

Family

ID=78681429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111032198.8A Withdrawn CN113722514A (en) 2021-09-03 2021-09-03 Internet education video image screening and extracting method based on deep learning

Country Status (1)

Country Link
CN (1) CN113722514A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114625897A (en) * 2022-03-21 2022-06-14 腾讯科技(深圳)有限公司 Multimedia resource processing method and device, electronic equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114625897A (en) * 2022-03-21 2022-06-14 腾讯科技(深圳)有限公司 Multimedia resource processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111858954B (en) Task-oriented text-generated image network model
CN109359215B (en) Video intelligent pushing method and system
Zhou et al. A real-time global inference network for one-stage referring expression comprehension
CN109086664B (en) Dynamic and static fusion polymorphic gesture recognition method
CN109145083B (en) Candidate answer selecting method based on deep learning
CN114419642A (en) Method, device and system for extracting key value pair information in document image
CN113868448A (en) Fine-grained scene level sketch-based image retrieval method and system
CN117033609A (en) Text visual question-answering method, device, computer equipment and storage medium
CN110543551B (en) Question and statement processing method and device
CN113722514A (en) Internet education video image screening and extracting method based on deep learning
Tan et al. Towards embodied scene description
CN115186072A (en) Knowledge graph visual question-answering method based on double-process cognitive theory
CN116385830A (en) Sketch work intelligent evaluation method based on deep learning
CN113010712B (en) Visual question answering method based on multi-graph fusion
CN111078724A (en) Method, device and equipment for searching test questions in learning system and storage medium
Chuang et al. Facilitating architect-client communication in the pre-design phase
CN114911930A (en) Global and local complementary bidirectional attention video question-answering method and system
CN115359486A (en) Method and system for determining custom information in document image
WO2020237519A1 (en) Identification method, apparatus and device, and storage medium
CN113805977A (en) Test evidence obtaining method, model training method, device, equipment and storage medium
Pai et al. Multimodal integration, fine tuning of large language model for autism support
Chiu et al. Using rough set theory to construct e-learning faq retrieval infrastructure
Srilatha et al. 3D Smartlearning Using Machine Learning Technique
US20240070436A1 (en) System and method for cross-modal interaction based on pre-trained model
CN112613495B (en) Real person video generation method and device, readable storage medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20211130