CN113656641A - Efficient video retrieval system supporting fuzzy comment mining - Google Patents

Efficient video retrieval system supporting fuzzy comment mining Download PDF

Info

Publication number
CN113656641A
CN113656641A CN202110971077.3A CN202110971077A CN113656641A CN 113656641 A CN113656641 A CN 113656641A CN 202110971077 A CN202110971077 A CN 202110971077A CN 113656641 A CN113656641 A CN 113656641A
Authority
CN
China
Prior art keywords
video
words
word
comment
mining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110971077.3A
Other languages
Chinese (zh)
Inventor
严大莲
王�华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202110971077.3A priority Critical patent/CN113656641A/en
Publication of CN113656641A publication Critical patent/CN113656641A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention extracts the high-level abstract concept of the video, realizes a set of video retrieval system based on fuzzy comment mining, collects comment data of various videos by coding a web spider, solves the problem of crawling dynamic webpage information of various videos, firstly cleans the comment data, then adopts an association rule mining algorithm to extract a frequently-appearing noun set from the comments, removes a noise set in the comment by means of isolation and relevance pruning, filters possible non-subject frequent words by means of point mutual information to obtain a feature word set close to the video, clusters repeated expressions by means of a clustering algorithm, provides a series of feature nouns corresponding to possible relevant content of the video based on an information community and relevant subjects surrounding each feature word, and performs subject mining by means of an LDA subject model, the method is realized based on a Lucene open source retrieval framework, and has higher accuracy and efficiency.

Description

Efficient video retrieval system supporting fuzzy comment mining
Technical Field
The invention relates to a video efficient retrieval system, in particular to a video efficient retrieval system supporting fuzzy comment mining, and belongs to the technical field of video fuzzy retrieval.
Background
Along with the rapid popularization of mobile intelligent equipment, the mode of acquiring videos by people is more and more convenient. The number of network videos also starts to grow explosively. At present, people who watch short videos by using mobile equipment are more and more, and in the face of massive videos on the internet, rapid and accurate video retrieval is also a research and development focus.
Currently, the retrieval of video is divided into two directions: firstly, topic retrieval based on keywords is essentially text retrieval, video data needing to be retrieved needs to be manually marked, in a specific webpage, a webpage is edited to draw a title to the video and relevant descriptions are added, a search engine establishes indexes for the topic, the label and the text content of the page where the video is located, and after a user inputs retrieval keywords, the search engine analyzes the keywords and matches the keywords with an index library to find out matched items, and the matched items are sequentially returned to the user according to the strength of the relevant degree; secondly, video retrieval based on content, firstly, a retrieval system establishes a feature library for bottom layer visual information of video data, each frame of data of a video is divided into shots according to a certain mode, because the data volume of the video is too large, the difficulty still exists in searching in the shot mode, key frames need to be extracted from different shots, and finally all the key frames are clustered into different scenes.
In order to achieve the purpose of searching relevant videos in internet massive video data, the prior art provides a content-based thought, visual information in media data is firstly analyzed, relevant features are extracted and indexes are built, then a media library is searched according to certain interface interaction, wherein the video library is mainly used for storing videos to be searched, an image feature library is used for describing the content and the structure of the video library and mainly comprises visual features, textures, colors, shapes, motion information, object information and the like, a feature extraction algorithm library is mainly used for collecting tools for extracting features of videos in the video library and is a core module of a video search system, and the search system is a user interface and comprises an index library and is mainly used for building indexes on the features so as to facilitate quick search. Many scientific research institutions such as well-known enterprises and world major universities are involved in the field, wherein QBIC is the content-based image retrieval system which is used for inquiring the paintings stored in the Russian museum at the earliest time and is used successfully at the earliest time, besides, the VideoQ system developed by Columbia university is further functional, the video retrieval is realized, and besides the user uses keyword query to retrieve the video, the user can also query and retrieve the required video by adopting the low-level visual characteristics and the spatiotemporal relationship of the video. And the VisualSeek system developed by the university of Columbia in the United states also realizes the retrieval of multimedia information stored on the Internet on the basis of VideoQ. The PhotoBook system developed by the media research laboratory of the national institute of technology, Massachusetts, USA mainly aims at face images, and the system realizes content-based retrieval by extracting visual features of faces.
The video retrieval system in the prior art is still a retrieval system based on keywords, and on this basis, related video websites are also improved, for example, a video classification system is added, and in the case of a super-cool video, the video categories are as follows: the television drama, the movie, the synthesis art, the music and the cartoon 5 are large, before the video faces to the user, the website editing can manually draw a title and add related description, and the title belongs to different pages according to different categories. When the user selects the category to be searched, the video searching efficiency is improved. However, in such a classification system, the attribute of one video can only be one, but the classification standard has difference for each person, and a single division cannot meet the requirements of all people, and the core of the problem is that the existing retrieval system lacks understanding of high-level abstract concepts of retrieval objects and videos, such as topics expressed by the videos, included roles, formulated concepts and related scenes. Although the content-based video retrieval system performs image-level understanding on the video, the understanding is still uninterpretable for high-level abstract conceptual information as if a single word of an article is analyzed.
No matter the matching system based on the keywords or the video retrieval system based on the content, if only considered from the algorithm, the meaning of the video cannot be really understood like the human, the human can only completely understand the video information, the human enters the big data era of information explosion, any behavior of anyone on the network can be recorded under the authorization premise, the records can be processed and analyzed, so that a valuable conclusion can be extracted, the internet enables the collective intelligence of the human to be possible, and the accuracy and the efficiency of the video retrieval are improved by adopting the experience of the person who watches the video.
Video comment plates have become essential columns of various large websites and APPs, most of the comments are written information which is felt after a user watches videos, and the comment data has three characteristics: firstly, the scale is large; the variety is multiple, the viewpoints of the video comments are various, or the comments relate to the evaluation of the actor performance, or the objective description of the video content, or the internal perception of the key plot; thirdly, the value density is low, and due to different backgrounds of net friends and the randomness of speaking, the magnitude of the comment is huge, but a large amount of non-value data exists. The invention aims to mine a large amount of video comment data, mine the inherent attributes of the video by adopting the experience of predecessors, and improve the accuracy of video retrieval by combining the optimization of a retrieval technology.
In summary, the video retrieval system of the prior art has drawbacks and disadvantages, and the difficulties and problems to be solved by the present invention are mainly focused on the following aspects:
firstly, the number of network videos also starts to increase explosively, and in the face of massive videos on the internet, the prior art lacks a method capable of quickly and accurately retrieving videos, the video retrieval system in the prior art is still a keyword-based retrieval system, but in such a classification system, the attribute of one video is only one, however, the classification standard is different for each person, and a single partition cannot meet the requirements of all people, and the core of the problem is that the current retrieval system lacks understanding of high-level abstract concepts of retrieval objects and videos, such as topics expressed by videos, roles contained by videos, concepts described by descriptions, and related scenes, although the video retrieval system based on contents performs image-level understanding on videos, the understanding is like analyzing a single word of an article, and information of the high-level abstract concepts cannot be interpreted, no matter the matching system based on keywords or the video retrieval system based on contents, if only considered from the perspective of algorithm, the meaning of the video cannot be understood really like human, and the accuracy and efficiency of video retrieval cannot be improved.
Secondly, the video retrieval based on keywords in the prior art depends on the rapid development of a search engine, the retrieval mode is convenient, but the defects are obvious, the video labeling needs a large amount of manpower, different people have different understanding and expression modes of videos, and only can effectively retrieve structured text information, so the retrieval accuracy is low, in addition, due to the fact that network information is not good and uniform, the retrieval result often contains a large amount of noise, the retrieval result has many wrong results which are not in question, and the webpage title is seriously inconsistent with the video content;
third, the content-based video retrieval in the prior art is a hot spot in recent years, and the method mainly identifies the content of the video through an algorithm, compared with the traditional retrieval method, the method does not need too much manual participation, the cost is greatly reduced, the video annotation efficiency is greatly improved, in the retrieval method, the content-based video retrieval is more direct, the visual information of the image becomes the key of the retrieval, but the method also has objective difficulties, a huge gap exists between the low-level visual information and the high-level semantics, the semantic gap is difficult to span at present, the computer cannot completely and correctly understand the meaning of various languages of people, the internet video quantity growth speed is extremely remarkable, if the content analysis is carried out on each video by depending on the computer, the algorithm complexity is too high, therefore, content-based video retrieval remains a long distance away from applications;
fourthly, more and more people publish comments after watching videos on the internet at present, a video comment plate becomes a necessary column of each large website and APP, most of the comments are written information which is felt after the user watches the video, the comment data is large in scale, multiple in variety, low in value density, due to different backgrounds of net friends and the randomness of speech, the magnitude of comments is huge, but a large amount of worthless data exists, the high-level abstract concept of videos cannot be extracted through mining and analyzing the comment data in the prior art, a set of video retrieval system based on fuzzy comment mining is lacked, accurate query information must be input to obtain results, the prior art cannot mine a large amount of video comment data, the inherent attributes of the videos cannot be mined, and the accuracy and the efficiency of video retrieval are low.
Disclosure of Invention
Aiming at the defects of the prior art, the invention successfully solves the hot problem of accurately retrieving the required video in massive video data, and provides a novel video retrieval method which is efficient, accurate, moderate in algorithm complexity and strong in practicability: the fuzzy retrieval based on video concept expansion is characterized in that high-level abstract concept information which cannot be intuitively obtained, such as topics, expounded concepts, contained roles, related scenes and the like which need to be expressed by videos, is mined and refined, and retrieved domains of video objects are expanded, so that when a user retrieves the videos, the user can accurately and efficiently inquire even if submitting a fuzzy abstract description, and a video efficient retrieval system supporting fuzzy comment mining is successfully realized.
In order to achieve the technical characteristics, the technical scheme adopted by the invention is as follows:
a video high-efficiency retrieval system supporting fuzzy comment mining excavates and refines high-level abstract concept information which cannot be intuitively obtained and needs to be expressed by a video based on fuzzy retrieval of video concept expansion, and expands a retrieved domain of a video object, so that a user can accurately and efficiently inquire even if submitting a fuzzy abstract description when retrieving the video;
firstly, extracting a high-level abstract concept of a video through mining and analyzing video comment data to realize a set of video retrieval system based on fuzzy comment mining, and firstly, acquiring comments, crawling the network video comment data, collecting comment data of various videos through encoding a network spider, and solving the crawling problem of dynamic webpage information of various videos;
secondly, after the comments are obtained, mining information hidden in the comments, and cleaning comment data through word segmentation, part-of-speech tagging and word deactivation; then, extracting a frequently-occurring noun set from the comments by adopting an association rule mining algorithm, removing a noise set by means of isolation pruning and correlation pruning, and filtering possible non-subject frequent words by adopting point mutual information to finally obtain a characteristic word set which is close to the video;
thirdly, after the characteristic words are obtained, clustering is carried out on repeated expressions by adopting a clustering algorithm, the method provides that possible related contents of videos are corresponding to a series of characteristic nouns and related topics surrounding each characteristic word based on an information community, topics are mined through an LDA topic model, and after comment mining is finished, each video corresponds to one information community; the retrieval system is realized based on a Lucene open source retrieval framework;
acquiring video comments and extracting features: firstly, extracting a candidate feature word set from comments by correlation analysis of association rules, pruning the extracted feature words by adopting isolation and correlation, filtering out feature words irrelevant to a theme by combining webpage titles and adopting point mutual information, and finally extracting feature words relevant to the theme;
characteristic word clustering and potential topic mining: clustering feature words under the condition that the feature words are extracted, mining potential topics according to clusters, firstly, calculating character similarity and semantic similarity of the words, defining the similarity of the words by using the algorithm, then, carrying out cluster analysis on the feature words by selecting vector features and fusing a k-means + + clustering algorithm, and finally, mining the potential topics again in an original corpus by an LDA topic mining algorithm according to feature word clustering results;
the video efficient retrieval system consists of three parts: the system is developed by adopting Java language, and various intermediate files generated in the running process of the system comprise original comment data, word segmentation results, object files, pruning results, frequent feature word results, clustering results and LDA mining result files which are stored in a text form.
The video high-efficiency retrieval system supporting the fuzzy comment mining further acquires the video comments and extracts the characteristics: regarding nouns or noun phrases as a candidate set of possible feature words, extracting video comment features according to the following method:
firstly, video comment data are obtained through a web spider;
secondly, performing word segmentation and part-of-speech tagging on the obtained video comment by adopting an NLPIR system;
thirdly, searching a frequent word set by adopting an association rule;
fourthly, pruning the frequent words by adopting an isolation rule and a correlation rule and removing an invalid word set;
fifthly, pruning is carried out by adopting point mutual information, and candidate characteristic words are extracted;
the invention adopts a web spider to obtain the original data of the video comment, and has a dynamic data analysis process besides a webpage access process, and the specific flow is as follows:
the first process is as follows: inputting a seed webpage, downloading the seed webpage, and extracting all URLs in webpage information;
and a second process: filtering web pages which do not contain videos, and adding the web pages containing the videos into a queue;
and a third process: downloading the URLs in the queue in sequence in a breadth traversal mode;
and (4) a fourth process: and dynamically analyzing in a page containing the video, sending a comment data request by using an HttpClient toolkit, storing the data in the local, and ignoring a part of video comments with small comment quantity.
Extracting candidate characteristic words by adopting an association rule: extracting a set meeting the minimum support degree from the generated object files by adopting an association rule as a candidate characteristic word, and finding out a frequent set by adopting a generation-test strategy: firstly, finding out all frequent sets, then filtering out the frequent sets which do not meet the conditions by comparing the support degree of each frequent set with the minimum support degree, then finding out the rules through the frequent sets, wherein the rules need to meet the minimum support degree and the minimum confidence degree, and realizing the finding of all frequency sets which meet the conditions in a recursive mode;
the invention only considers three items and the following frequent sets, adopts the association rule to extract the candidate characteristic words, and the initial p is 1:
step 1, traversing object files generated by video comment preprocessing, and constructing a set of all candidate items with the size of p;
step 2, scanning the data set to judge whether the item sets with only one element meet the minimum support degree p, and forming a frequent set by the item sets meeting the minimum support degree p;
step 3, generating a candidate set with the size of (p +1) through the connection step;
step 4, scanning the data set to judge whether the item sets containing P +1 elements meet the minimum support degree P +1 through a pruning step, and forming a frequent set by the item sets meeting the minimum support degree P + 1;
step 5, p is p + 1;
if no new frequent set is generated or p is 3, the procedure terminates; otherwise, jumping to step 3.
The video high-efficiency retrieval system supporting fuzzy comment mining further adopts point mutual information to filter characteristic words: filtering out a part of nouns which are still in the feature word set and are irrelevant to the theme but frequently appear, wherein the meanings of the word words are uncertain, and the words have definite meanings under the condition of specific semantics, when filtering the part of words, measuring the correlation degree of the feature words and the theme by adopting point mutual information based on the correlation between the feature words and the theme, adopting the nouns of the titles of the web pages as seed feature words, segmenting the titles of the web pages, extracting the nouns from the words, calculating the point mutual information of the feature words and the seed feature words, and the calculation formula is as follows:
Figure BDA0003225795100000061
zzs is a seed feature word, namely a noun extracted from a webpage title, w1 is a feature word needing to calculate point mutual information, Oxcs (u1, u) represents the co-occurrence times of the feature word and the seed feature word, Oxcs (u1) represents the independent occurrence times of the feature word, Oxcs (u) represents the independent occurrence times of the seed feature word, a high point mutual information value represents strong correlation, a threshold value alpha is set, if the point mutual information value is larger than or equal to the threshold value, the point mutual information value is reserved as a feature, otherwise, the point mutual information value is filtered;
the search engine is adopted to search the entries, the number of returned search results is used as the occurrence number of the entries, the point mutual information and information search are combined to calculate the correlation between the words, the Baidu API is used as a tool, the larger the calculated point mutual information value is, the higher the probability that the characteristic words are correlated with the topics is, and the results lower than the critical value are filtered.
The efficient video retrieval system supporting fuzzy comment mining further comprises the following steps of: the text is converted into information which is easier to be recognized by a computer, namely, the text is subjected to vectorization processing, a vector space model expresses the text as a vector in a vector space, namely, one text is expressed as one vector, in the text vectorization process, the text is firstly split into sentences, then the sentences are split into the most basic components, namely, characters, words and phrases, the basic language units are collectively called as feature items, namely, the text is expressed as:
ai=(u1,i,u1,i…um,i,)
wherein u isj,iThe j characteristic item is represented in the text aiWeight value of medium, uj,iThe calculation mode of (2) is determined by the definition of the characteristic item, the cosine of an included angle between vectors is used as the similarity measurement between texts, and the text similarity is calculated as the cosine of the included angle between the vectors.
The efficient video retrieval system supporting fuzzy comment mining further comprises the following steps of calculating word approximation degree in word approximation degree calculation and vector selection: the feature word clustering gathers words with similar semantics into the same cluster, the similarity of the words needs to be calculated, and the similarity in two aspects is considered: the first is character approximation, the higher the word character approximation is, the higher the semantic approximation is, all words are set as a binary variable attribute set T, T contains character string 1 and character string 2, T is a superset of character string 1 and character string 2, let h be the total number of words commonly contained in both character string 1 and character string 2, c be the total number of words contained in only character string 1 but not contained in character string 2, T be the total number of words not contained in only character string 2 but not contained in character string 1, r be the total number of words not contained in both character string 1 and character string 2, and define h, T, c, r as four state components for comparing the character string approximations, wherein, the word which does not exist in the two character strings has no effect on the calculation of the approximation degree of the two character strings, so r is removed, and the approximation degree of the two character strings is defined as:
Figure BDA0003225795100000062
h is the total number of words contained in common in both the character string 1 and the character string 2, c is the total number of words contained in only the character string 1, and t is the total number of words contained in only the character string 2.
The video high-efficiency retrieval system supporting fuzzy comment mining further calculates the semantic similarity of words: calculating word semantic similarity based on a classification system, selecting HowNet as a dictionary, wherein the concept is description of words, one word can be expressed into a plurality of concepts, the sememe is the minimum meaning unit for describing one concept, each concept is expressed by a group of sememes, and a complex tree structure is formed between the sememes;
suppose a word U1There are m concepts: s11,S12…S1mWord U2There are n concepts: s21,S22,…S2nThe invention specifies the word U1And U2The approximation degree between the concepts is the maximum value of the approximation degree between the concepts, namely:
Figure BDA0003225795100000071
all concepts are finally attributed to the representation of the sememe, so the approximation degree of the concepts is found, namely the approximation degree of the sememe is found, all the sememes form a tree-shaped sememe hierarchy according to the organization relation, the approximation degree of the sememe is found by calculating the distance between the sememe nodes, and the semantic distance between the two sememes is obtained by assuming that the distance between the two sememes in the hierarchy is a:
Figure BDA0003225795100000072
where k1, k2 are two sememes, β is an adjustable parameter, and a is the distance of the two sememes in the hierarchy, a word approximation can be defined as:
Figure BDA0003225795100000073
wherein beta and alpha are positive parameters less than 1, the word approximation mainly affects the semantic approximation, beta is more than or equal to 0 and less than or equal to 0.3, alpha is more than or equal to 0.7 and less than or equal to 1, and beta + alpha is equal to 1;
selecting vector characteristics: m feature words are obtained in the feature word extraction process, the M feature words are used as a group of feature vectors, and the mutual word approximation degree of the M feature vectors is obtained according to the invention, namely, an M-dimensional feature vector Q (1,2, …, M) is formed for each feature word, namely, a clustering object.
The video high-efficiency retrieval system supporting fuzzy comment mining further comprises the following characteristic word clustering: clustering the characteristic words by adopting a K-means + + clustering algorithm, correcting the selection of the initial central point of the K-means by the K-means + + clustering algorithm, not randomly appointing the initial central point, selecting the initial clustering center according to the principle of the farthest distance, and performing the selection process as follows:
the method comprises the steps of firstly, randomly selecting a point from data as a clustering center point;
calculating the distance A (x) of the rest nearest clustering centers for the rest points in the data set;
and a third step of selecting a new point as a new clustering center according to the following selection rules: a (x) larger points are selected as new cluster centers with larger probability;
repeating the second process and the third process until P clustering centers are selected;
taking the P selected clustering centers as initial clustering centers to operate standard K-means;
the number of clustering results is not fixed, and different clustering numbers are set according to different videos.
The video efficient retrieval system supporting fuzzy comment mining further mines LDA potential topics: on the basis of clustering, further mining the original comment data; adopting an LDA topic model to carry out secondary mining on the comment data, and expressing the topic as a series of words related to the topic;
the process of potential topic mining after clustering the feature words comprises the following steps:
scheme 1: classifying the original comment data according to the clustering result;
and (2) a flow scheme: performing LDA potential theme mining on the original comment data of different categories;
based on the Gregor Heinrich realization of the LdaGibbsSample. java class, the engineering realization rewrites a Corpus class and a Vocabulary class, wherein the Corpus is used for reading a comment Corpus to form a word list, and the Vocabulary class processes the read Corpus to form the word list.
The video high-efficiency retrieval system supporting the fuzzy comment mining further comprises the following implementation processes:
the method comprises the following steps: establishing an index, reading text data into an internal memory by adopting an IndexWriter class, reading a video title, a clustered class cluster, a related characteristic word and a related potential theme extracted from a comment, adding the title of the video and the UTL of a webpage, instantiating IndexUTit, adding the extracted information into a Document object, wherein an interface function is as follows: pravatetasticvoid addDoc (IndexWriter w, Stringurl, Stringtitle, String [ ] clusterics, String [ ] topics);
step two: establishing a search request, reading the input of a user into a standard input (stdin), analyzing the request by a Parse class, and generating a Query object by an analyzed result;
step three: creating a Searcher object, searching by using the Query object generated above, encapsulating the matched result in a TopScoreDocCollector object and returning, wherein the number of returned results can be specified, and the number of required returned results is set by a create method of the TopScoreDocCollector;
some complex sorting algorithms are packaged in the org, apache, lucene, search package, the default search method in the search process is fuzzy search, the search domain needs to be specified in the process of realizing the search system, the potential topic mining content is semantic supplement to the cluster formed by the feature words, only the feature words and the video titles in the information community are searched, and the extracted topics are presented as the supplement content in the search result.
The video high-efficiency retrieval system supporting the fuzzy comment mining further comprises a comment acquisition subsystem: by adopting an open-source web spider WebCollector1.3, the crawled information mainly comprises three parts: the method comprises the steps that video comment data, video title data and video description data are divided into two extraction types, the video comment data and the video description data are static webpage data, an HTML (hypertext markup language) analysis tool jsup is adopted for extraction, the HTML data are analyzed into a DOM (document object model) tree, and the content of a specified node is selected according to the DOM tree structure to obtain structured information; the comment data are not directly loaded in a static webpage form, but are dynamically loaded through multiple network interactions, and a network spider cannot acquire the comment data, so that a WebCollector needs to be modified, before the comment data are extracted, network interaction parameters for acquiring comments need to be analyzed, network interaction is analyzed through a firbaug tool to obtain json parameters, then a network interaction process of comments is simulated by adopting an HttpClient toolkit, and final comment data are acquired and written into MySql data;
the video retrieval subsystem: the retrieval system is based on a Lucene open source retrieval framework, firstly reads an information community file, establishes an index for information community data, divides a retrieval statement through a user retrieval interface of Lucene, matches the retrieval statement with the index data, returns the retrieval statement to a user according to the relevance, exports a core subsystem into a jar packet after the retrieval function is realized, and finally designs a C # visual interface for the retrieval system.
Compared with the prior art, the invention has the following contributions and innovation points:
firstly, more and more people publish comments after watching videos on the internet at present, wherein the comments comprise understanding of the videos, the high-level abstract concept of the videos is extracted through mining and analyzing comment data, a set of video retrieval system based on fuzzy comment mining is realized, the comment data of the videos on the internet are crawled, the comment data of various videos are collected through encoding a web spider, the problem of crawling of dynamic webpage information of various videos is solved, information hidden in the comments is mined after the comments are acquired, and the comment data are firstly cleaned; then, extracting a frequently occurring noun set from the comments by adopting an association rule mining algorithm, removing a noise set from the comments by means of isolation pruning and correlation pruning, filtering possible non-subject frequent words by adopting point mutual information to obtain a characteristic word set which is close to the video, clustering repeated expressions by adopting a clustering algorithm to filter the repeated expressions to provide a series of characteristic nouns and related subjects surrounding each characteristic word corresponding to the possible related content of the video based on an information community, the topic is mined through the LDA topic model, the object retrieved by the video retrieval system is not only the title of the video, the retrieval system is realized based on a Lucene open source retrieval frame, and has higher accuracy and efficiency compared with a retrieval system based on keywords;
secondly, the invention successfully solves the hot problem of accurately retrieving the required video from massive video data, and provides a new video retrieval method with high efficiency, accuracy, moderate algorithm complexity and strong practicability: based on the fuzzy retrieval of video concept extension, the high-level abstract concept information which cannot be intuitively obtained, such as the theme, the explained concept, the contained roles, the related scenes and the like which need to be expressed by the video, is mined and refined, and the retrieved domain of a video object is extended, so that when a user retrieves the video, the user can accurately and efficiently inquire even if submitting a fuzzy abstract description, and a video efficient retrieval system supporting the fuzzy comment mining is successfully realized;
third, whether it is a keyword-based matching system or a content-based video retrieval system, if only considered from an algorithmic point of view, it is not really human-like to understand the meaning of the video anyway, any activity of anyone on the network can be recorded, with authorization, and the records can be processed and parsed, therefore, valuable conclusions are extracted, the internet enables collective intelligence of human beings to become possible, the invention is based on the thought, the accuracy and the efficiency of video retrieval are improved by adopting the experience of people who watch videos, a video comment plate becomes a necessary column of various websites and APPs, most of the comments are the sensed and written information after the user watches the video, and due to different backgrounds of net friends and the randomness of speaking, the magnitude of the comments is huge, but a large amount of valueless data exists in the comments. According to the invention, a large amount of video comment data are mined, the inherent attributes of the video are mined by adopting the experience of predecessors, and the accuracy of video retrieval is greatly improved by combining the optimization of a retrieval technology;
fourthly, in the actual video comments, people often do not refer to a comprehensive overall comment for the video, but refer to comments generated by objects such as a certain character, a certain impressive scene, a certain sentence of splendid lines and the like in the video, the evaluation objects are the characteristics of video objects and are the characteristics of distinguishing from other videos, the characteristics of people concerned are extracted from the video comments, the invention regards nouns or noun phrases as a candidate set of possible characteristic words, adopts a network spider to obtain the original data of the video comments, has a dynamic data analysis process besides a webpage access process, has no objectivity due to incompleteness and finiteness of linguistic materials, makes up the defects of a corpus, improves the calculation accuracy, adopts a search engine to search entries, and uses the number of returned search results as the number of occurrences of the entries, the problem of crawling of dynamic webpage information of various videos is successfully solved, the correlation between the feature words and the video topics is improved, and a feature word set close to the videos is obtained;
fifthly, because the different habits of each person can lead to the description aiming at the same characteristic, the forms are completely different, the extracted characteristic words still have a plurality of descriptions of different types of the same type, because the repeated semantics and the different forms lead to the overlapping of the follow-up topic mining, the characteristic words are clustered, meanwhile, the writing mode of the network comment is not as formal as that of the traditional text, the randomness of the user is larger, the semantic information is dispersed, the topic and the topic domain are not obvious, the extraction process of the characteristic words only considers nouns, the integrity of the semantics is also influenced, therefore, after the clustering is finished, the potential topic mining based on the cluster is carried out, the topic mining is the supplement of the characteristic word information, the fuzzy retrieval is provided based on the video concept expansion, the mining is carried out on the high-level abstract concept information which can not be intuitively obtained by the video, even a fuzzy abstract description is submitted, the query can be accurately and efficiently carried out, the retrieval efficiency is high, the retrieval mode is visual, the algorithm robustness is good, and the practicability is high.
Drawings
Fig. 1 is a schematic flow chart of a video comment feature extraction method of the present invention.
FIG. 2 is a schematic diagram of a method for extracting candidate feature words by using association rules according to the present invention.
FIG. 3 is a schematic diagram of the topic mining process based on class clustering according to the present invention.
FIG. 4 is a Venturi organizational chart of the word attributes set of the present invention.
FIG. 5 is a schematic diagram of information communities formed after clustering and LDA potential topic mining.
Fig. 6 is an organizational diagram of the video efficient retrieval system of the present invention.
FIG. 7 is a schematic diagram of the classification of comment data obtained by the web spider according to the present invention.
Detailed description of the invention
The technical solution of the efficient video retrieval system supporting fuzzy comment mining provided by the present invention is further described below with reference to the accompanying drawings, so that those skilled in the art can better understand the present invention and can implement the present invention.
With the rapid development of the mobile internet, video sharing has gradually become a new communication mode for replacing traditional text and picture information, and with the rapid popularization of intelligent devices and the emergence of various video applications, the internet video data has increased explosively, and how to accurately retrieve required videos from massive video data becomes a hot problem. The video retrieval system in the prior art is mainly divided into two modes based on keywords and contents, wherein the two retrieval modes respectively have advantages and disadvantages, the former has high retrieval efficiency, but a lot of noises exist in results, and the latter has more visual retrieval modes, but the algorithm is complex to realize and has low practicability.
In order to overcome the defects of the two methods, the invention provides a novel video retrieval method which has high efficiency and accuracy in retrieval, moderate algorithm complexity and strong practicability: the fuzzy retrieval based on video concept extension is characterized in that high-level abstract concept information which cannot be intuitively obtained, such as topics, explained ideas, contained roles, involved scenes and the like which need to be expressed by videos, is mined and extracted, and retrieved domains of video objects are extended, so that a user can accurately and efficiently inquire even if submitting a fuzzy abstract description when retrieving the videos.
More and more people publish comments after watching videos on the internet at present, wherein the comments include understanding of videos, high-level abstract concepts of the videos are extracted through mining and analyzing comment data, a set of video retrieval system based on fuzzy comment mining is realized, the first step is to acquire the comments, crawl the comment data of the videos on the internet, collect the comment data of various videos through encoding a web spider, and solve the problem of crawling of dynamic webpage information of various videos.
After the comments are obtained, mining information hidden in the comments, and cleaning comment data through word segmentation, part-of-speech tagging and word deactivation; and then extracting a frequently-occurring noun set from the comments by adopting an association rule mining algorithm, removing a noise set by virtue of isolation pruning and correlation pruning, and in addition, filtering possible non-subject frequent words by adopting point mutual information in order to improve the correlation between the feature words and the video subjects to finally obtain a feature word set which is close to the video.
After the characteristic words are obtained, due to different expression modes, a plurality of different descriptors of the same object still exist. In order to filter the repeated expressions, a clustering algorithm is adopted to cluster the repeated expressions, feature words only select nouns, and relevant information of the video is lost in the extraction process.
The retrieval system is realized based on a Lucene open source retrieval framework, and finally, experiments prove that the comment fuzzy retrieval system has higher accuracy and efficiency compared with a keyword-based retrieval system.
Firstly, obtaining video comments and extracting features
In the actual video comments, people often do not refer to a comprehensive overall comment for a video, but refer to comments generated by objects such as a certain person, certain impressive scenes, a certain sentence of wonderful speech and the like in the video, and the evaluation objects are the characteristics of video objects and are the characteristics distinguished from other videos, so that the characteristics of people concerned are extracted from the video comments, nouns or noun phrases are regarded as a candidate set of possible characteristic words in the invention, and the video comment characteristics are extracted according to the following method, as shown in fig. 1:
firstly, video comment data are obtained through a web spider;
secondly, performing word segmentation and part-of-speech tagging on the obtained video comment by adopting an NLPIR system;
thirdly, searching a frequent word set by adopting an association rule;
fourthly, pruning the frequent words by adopting an isolation rule and a correlation rule and removing an invalid word set;
and fifthly, pruning by using the point mutual information, and extracting candidate characteristic words.
Video comment crawling
Video comment data are scattered all over a network, comment information of users on videos exists in social media, video websites, movie comment websites and the like, most websites on the Internet display contents with a large amount of data interaction by adopting a dynamic webpage technology, the total number of comments is extremely large, and therefore collection by manpower is unrealistic.
The first process is as follows: inputting a seed webpage, downloading the seed webpage, and extracting all URLs in webpage information;
and a second process: filtering web pages which do not contain videos, and adding the web pages containing the videos into a queue;
and a third process: downloading the URLs in the queue in sequence in a breadth traversal mode;
and (4) a fourth process: and dynamically analyzing in a page containing the video, sending a comment data request by using an HttpClient toolkit, storing the data in the local, and ignoring a part of video comments with small comment quantity.
(II) video comment preprocessing
The invention discloses a comment preprocessing method, which is a prerequisite and a basis of comment mining, and is characterized in that firstly, an NLPIR Chinese participle system is adopted to perform participle and part-of-speech tagging on video comment data acquired by a web spider, and corpus granularity extracted by characteristics is sentences, so that firstly, comment sentence breaking is performed, and sentence breaking is performed according to punctuation marks of the sentences.
Some meaningless virtual words also exist in the comment corpus after word segmentation, part of speech tagging and sentence segmentation processing, a part of common nouns which frequently appear but are not video characteristics are added into the stop word list, and after the stop word processing is finished, nouns or noun phrases in the stop word list are extracted to generate an object file.
(III) extracting candidate characteristic words by adopting association rules
Extracting a set meeting the minimum support degree from the generated object files by adopting an association rule as a candidate characteristic word, and finding out a frequent set by adopting a generation-test strategy: firstly, finding out all frequent sets, then, filtering out the frequent sets which do not meet the conditions by comparing the support degree of each frequent set with the minimum support degree, then, finding the rules through the frequent sets, wherein the rules need to meet the minimum support degree and the minimum confidence degree, and realizing the finding of all frequency sets which meet the conditions in a recursive mode.
The invention only considers three items and the following frequent sets, adopts the association rule to extract the candidate characteristic words, and the initial p is 1:
step 1, traversing object files generated by video comment preprocessing, and constructing a set of all candidate items with the size of p;
step 2, scanning the data set to judge whether the item sets with only one element meet the minimum support degree p, and forming a frequent set by the item sets meeting the minimum support degree p;
step 3, generating a candidate set with the size of (p +1) through the connection step;
step 4, scanning the data set to judge whether the item sets containing P +1 elements meet the minimum support degree P +1 through a pruning step, and forming a frequent set by the item sets meeting the minimum support degree P + 1;
step 5, p is p + 1;
if no new frequent set is generated or p is 3, the procedure terminates; otherwise, jump to step 3 and the algorithm description is shown in FIG. 2.
(IV) relevance pruning and isolation pruning
After the association rule is adopted to obtain the frequent feature words, a lot of noises exist, for example, the single words which have no meaning exist in isolation, and the single words have no meaning in most cases, so the invention selects to remove the single words in one item set, and in addition, as the result of the frequent multi-item set is generated by the frequent set with smaller size, the problems of feature word repetition and unmatching exist, and the method is embodied in two aspects: first, the repetition of subsets and supersets; secondly, the words are not related, the existing results need to be pruned, two aspects of proximity pruning and isolation pruning are involved, the relevance pruning mainly aims at the condition that the words are not related, and the isolation pruning mainly aims at the condition that the subset semantics and the superset semantics are repeatedly stacked.
(1) Proximity pruning, if a frequent feature word R is composed of m words, occurs in order in a comment W (U)1,U2,…,Um) And the distance between any two adjacent words does not exceed the window value (the window value is 3 words), then the frequent word R is called to be adjacent in the comment W, if R appears n times in the comment corpus and is adjacent in P sentences, then if P/n > beta and P > 2, then it is an adjacent characteristic phrase, otherwise it is filtered out, wherein beta is determined according to actual experiments.
(2) And (3) isolated pruning, wherein the isolated support degree of a frequent feature word is defined as: the number of sentences of the superset that contain but do not, if present in n sentences in the review corpus, exist in isolation in p sentences, satisfy the isolation rule if p/n > β and p > 3, otherwise filter out, where β is determined from actual experiments.
(V) filtering feature words by adopting point mutual information
A part of nouns which are irrelevant to the theme and frequently appear still exist in the feature word set, the words also need to be filtered, most of the words are common auxiliary words, the meanings of the words are uncertain, and the clear meanings are provided under the specific semantic condition. When the part of words are filtered, the relevance between the characteristic words and the subjects is adopted, the invention adopts point mutual information to measure the relevance degree between the characteristic words and the subjects, the nouns of the titles of the web pages are adopted as seed characteristic words, the titles of the web pages are participated, the nouns are extracted, the point mutual information between the characteristic words and the seed characteristic words is calculated, and the calculation formula is as follows:
Figure BDA0003225795100000141
zzs is a seed feature word, namely a noun extracted from a webpage title, w1 is a feature word needing to calculate point mutual information, Oxcs (u1, u) represents the co-occurrence times of the feature word and the seed feature word, Oxcs (u1) represents the independent occurrence times of the feature word, Oxcs (u) represents the independent occurrence times of the seed feature word, a high point mutual information value represents strong correlation, a threshold value alpha is set, if the point mutual information value is larger than or equal to the threshold value, the feature word is kept as a feature, otherwise, the feature word is filtered.
Because of incompleteness and limitation of linguistic data, the co-occurrence of the characteristic words and the seed characteristic words does not have objectivity, in order to make up for the defects of a corpus and improve the calculation accuracy, a search engine is adopted to search the entries, the number of returned search result entries is used as the occurrence frequency of the entries, the point mutual information and information search are combined to calculate the correlation between the words in the mode, a Baidu API is used as a tool, the larger the calculated point mutual information value is, the larger the probability that the characteristic words are correlated with the theme is, and the results lower than the critical value are filtered.
Second, feature word clustering and potential topic mining
Although description aiming at the same feature is caused by different habits of each person, the form is completely different, a plurality of descriptions of different types of the same type still exist in the extracted feature words, the overlap of subsequent topic mining is caused by the repetition of semantics and the difference of the form, the feature words are clustered, meanwhile, the writing mode of network comments is not as formal as that of the traditional text, the randomness of users is larger, the semantic information is dispersed, the topic and the topic domain are not obvious, the extraction process of the feature words only considers nouns, the completeness of the semantics is also influenced, and therefore after the clustering is finished, potential topic mining based on the cluster is required, and the topic mining is supplementary to the information of the feature words, as shown in fig. 3.
(I) constructing a vector space model
The text analysis process can not be directly processed on the original text form, so the text needs to be converted into information which is easier to be recognized by a computer, namely, the text is subjected to vectorization processing, a vector space model expresses the text as a vector in a vector space, namely, one text is expressed as one vector, in the text vectorization process, the text is firstly split into sentences, then the sentences are split into the most basic components, namely, words, phrases, and the basic language units are collectively called as feature items, namely, the text is expressed as:
ai=(u1,i,u1,i…um,i,)
wherein u isj,iThe j characteristic item is represented in the text aiWeight value of medium, uj,iThe calculation mode of (2) is determined by the definition of the characteristic item, the cosine of an included angle between vectors is used as the similarity measurement between texts, and the text similarity is calculated as the cosine of the included angle between the vectors.
Word approximation calculation and vector selection
1. Calculating word approximation
The feature word clustering gathers words with similar semantics into the same cluster, the similarity of the words needs to be calculated, and the similarity in two aspects is considered: the first is character approximation, words with higher word character approximation have higher semantic approximation, all words are set as a binary variable attribute set T, wherein T comprises character string 1 and character string 2, T is a superset of character string 1 and character string 2, h is the total number of words commonly contained in character string 1 and character string 2, c is the total number of words contained in character string 1 only and not contained in character string 2, T is the total number of words not contained in character string 1 only and not contained in character string 2, r is the total number of words not contained in character string 1 and character string 2, and h, T, c and r are defined as four state components for comparing character string approximations. The structure of the Venturi diagram is shown in figure 4. Wherein, the word which does not exist in the two character strings has no effect on the calculation of the approximation degree of the two character strings, so r is removed, and the approximation degree of the two character strings is defined as:
Figure BDA0003225795100000151
h is the total number of words contained in common in both the character string 1 and the character string 2, c is the total number of words contained in only the character string 1, and t is the total number of words contained in only the character string 2.
2. Computing word semantic approximation
The invention adopts a classification system to calculate the semantic similarity of the words. HowNet is selected as a dictionary, concepts are a description of words, and a word can be expressed as several concepts. The sememes are the minimum meaning units for describing a concept, each concept is represented by a group of sememes, the sememes are the most basic units for describing the concept, complex relationships exist, and the sememes form a complex tree structure.
Suppose a word U1There are m concepts: s11,S12…S1mWord U2There are n concepts: s21,S22,…S2nThe invention specifies the word U1And U2The approximation degree between the concepts is the maximum of the approximation degree between the conceptsThe values, namely:
Figure BDA0003225795100000152
all concepts are finally attributed to the representation of the sememe, so the approximation degree of the concepts is found, namely the approximation degree of the sememe is found, all the sememes form a tree-shaped sememe hierarchy according to the organization relation, the approximation degree of the sememe is found by calculating the distance between the sememe nodes, and the semantic distance between the two sememes is obtained by assuming that the distance between the two sememes in the hierarchy is a:
Figure BDA0003225795100000153
where k1, k2 are two sememes, β is an adjustable parameter, and a is the distance of the two sememes in the hierarchy, a word approximation can be defined as:
Figure BDA0003225795100000154
wherein beta and alpha are positive parameters less than 1, the word approximation mainly affects the semantic approximation, beta is more than or equal to 0 and less than or equal to 0.3, alpha is more than or equal to 0.7 and less than or equal to 1, and beta + alpha is equal to 1.
3. Selecting vector features
M feature words are obtained in the feature word extraction process, the M feature words are used as a group of feature vectors, and the mutual word approximation degree of the M feature vectors is obtained according to the invention, namely, an M-dimensional feature vector Q (1,2, …, M) is formed for each feature word, namely, a clustering object.
(III) feature word clustering
And clustering the feature words by adopting a K-means + + clustering algorithm, correcting the selection of the initial central point of the K-means by the K-means + + clustering algorithm, not randomly appointing the initial central point, and selecting the initial clustering center according to the principle of farthest distance. The selection process comprises the following steps:
the method comprises the steps of firstly, randomly selecting a point from data as a clustering center point;
calculating the distance A (x) of the rest nearest clustering centers for the rest points in the data set;
and a third step of selecting a new point as a new clustering center according to the following selection rules: a (x) larger points are selected as new cluster centers with larger probability;
repeating the second process and the third process until P clustering centers are selected;
taking the P selected clustering centers as initial clustering centers to operate standard K-means;
the k-means + + algorithm cannot determine clustering into several clusters in a self-adaptive manner, so that the number of clustering results is not fixed, and different clustering numbers are set according to different videos.
(IV) Weka tool parsing clustering results
After word approximation is solved, feature vectors are obtained, a Weka software package is used as a clustering tool, data files need to be reconstructed, and because understanding of Chinese words needs to be accurately understood in a determined context, manual intervention is needed for partial results.
(V) mining LDA potential topics
The related content of the video cannot be comprehensively known only through a few key words, and meanwhile, because the extracted feature words are nouns, difficulty can be caused in understanding, and therefore, on the basis of clustering, the original comment data is further mined.
In the process of commenting on videos, although the center of the comment is the extracted feature nouns, related topics are involved around the keywords, only a series of feature nouns are extracted to achieve the purpose of understanding the videos, and various descriptors around the keywords (feature words) need to be found out, the descriptors are not limited in the range of the nouns any more, and the simple extraction of the nouns can cause emotional colors to be lost, so that the understanding is difficult. And secondarily mining the comment data by adopting an LDA theme model, and expressing the theme as a series of words related to the theme.
The process of potential topic mining after clustering the feature words comprises the following steps:
scheme 1: classifying the original comment data according to the clustering result;
and (2) a flow scheme: performing LDA potential theme mining on the original comment data of different categories;
based on the Gregor Heinrich realization of the LdaGibbsSample. java class, the engineering realization rewrites a Corpus class and a Vocabulary class, wherein the Corpus is used for reading a comment Corpus to form a word list, and the Vocabulary class processes the read Corpus to form the word list.
Three, high-efficient retrieval system of video
An information community is formed after clustering and LDA potential topic mining, the center of the information community is a set of class clusters formed by nouns which frequently appear in user comments, and the periphery of each class cluster is surrounded by associated words, wherein the words are not only nouns, but also a set of words related to the class clusters. The overall structure is shown in fig. 5.
After the comments are mined, each video forms an information community corresponding to the video, the information community is stored in a text form, and information extracted from the video comments is stored in the text form.
Retrieval system
The video efficient retrieval system is realized by the following steps:
the method comprises the following steps: establishing an index, mainly using an IndexWriter class, reading text data into an internal memory, reading a video title, a clustered class cluster, a related characteristic word and a related potential theme extracted from a comment, adding the title of the video and the UTL of a webpage, instantiating IndexUTit, adding the extracted information into a Document object, wherein an interface function of the Document object is as follows: pravatetasticvoid addDoc (IndexWriter w, Stringurl, Stringtitle, String [ ] clusterics, String [ ] topics);
step two: establishing a search request, reading the input of a user into a standard input (stdin), analyzing the request by a Parse class, and generating a Query object by an analyzed result;
step three: creating a Searcher object, searching by using the Query object generated above, encapsulating the matched result in a TopScoreDocCollector object and returning, wherein the number of returned results can specify, and how many results are required to be returned is set by the create method of the TopScoreDocCollector.
Some complex sorting algorithms are packaged in the org, apache, lucene, search package, the default retrieval method in the retrieval process is fuzzy retrieval, the retrieval domain needs to be specified in the implementation process of the retrieval system, and the potential topic mining content is semantic supplement to the class cluster formed by the feature words. Therefore, the invention only searches the characteristic words and the titles of the videos in the information community, and the extracted subjects are presented as the supplementary contents in the search result.
(II) displaying search results
In the video retrieval system realized by the invention, the user inputs the query sentence, and the retrieval system returns the result with the highest keyword matching degree to the user. The title and the UTL of the searched video are returned firstly, then the information community related to the video is returned to the user, namely the cluster of the characteristic words extracted from the comments and the possibly related topic vector surrounding the cluster, and the user can make an overview on the content of the video by looking at the information.
Fourth, high-efficient search system organization framework of video
Integral structure of retrieval system
The video efficient retrieval system mainly comprises three parts: the system comprises a comment acquisition subsystem, a comment mining subsystem and a video retrieval subsystem. The organization and composition of the system are shown in fig. 6, wherein the comment acquisition is mainly obtained by a web spider, the comment mining system mainly comprises feature word extraction and topic mining, the feature word extraction mainly comprises correlation analysis, word pruning and point mutual information calculation, the topic mining is mainly realized by cluster analysis and an LDA topic mining algorithm, and the retrieval system is mainly realized based on a Lucene retrieval framework.
The system is developed by adopting Java language, and integrates a comment crawling subsystem, a feature extraction subsystem, a theme mining subsystem and an information retrieval subsystem. During the operation of the system, various intermediate files can be generated, including original comment data, word segmentation results, object files, pruning results, frequent feature word results, clustering results and LDA mining result files which are stored in a text form.
(II) comment acquisition subsystem
The invention adopts an open-source web spider WebCollector1.3, and crawled information mainly comprises three parts: the method comprises the steps of dividing video comment data, video title data and video description data into two extraction types, wherein the video comment data and the video description data are static webpage data, adopting an HTML (hypertext markup language) analysis tool jsup for extraction, analyzing the HTML data into a DOM (document object model) tree, and selecting the content of a specified node according to the DOM tree structure to obtain structured information.
The comment data is not directly loaded in a static webpage form, but is dynamically loaded through multiple network interactions, and the comment data cannot be obtained by a network spider, so that the WebCollector needs to be modified, the network interaction parameters for obtaining the comment need to be analyzed before the comment data is extracted, the network interaction is analyzed through a firbaug tool to obtain a json parameter, then a network interaction process of the comment is simulated by adopting an HttpClient toolkit, the final comment data is obtained and written into MySql data, and finally the comment data is obtained through the network spider as shown in FIG. 7.
(III) comment mining subsystem
The comment mining is divided into two processes, namely feature extraction and theme mining, wherein the feature extraction comprises two parts: the method comprises the steps of comment preprocessing and feature word extraction, wherein the comment preprocessing adopts an NLPIR Chinese word segmentation system to carry out Chinese word segmentation and part-of-speech tagging on comments, then invalid information in the comments is filtered through a stop word list, and non-nouns in the comments are screened out to finally form an object file; extracting the characteristic words, extracting the related words possibly existing in the object file by adopting APRIORI related analysis, filtering out unreasonable characteristic words in the object file by adopting statistical rules and point mutual information, and finally obtaining the characteristic words.
Topic mining is divided into two parts: the method comprises the steps of clustering characteristic words and mining LDA topics, wherein before clustering, the character approximation degree of the words is firstly calculated, then the semantic approximation degree of the words is calculated by adopting HowNet, the approximation degree of the words is defined by combining the character approximation degree and the semantic approximation degree, finally the characteristic words are clustered according to the approximation degree by adopting a K-means + + clustering algorithm, after clustering is completed, original comments are re-divided according to different clusters, the divided comment data is subjected to topic mining, and finally a series of topic vectors related to the clusters are obtained.
(IV) video retrieval subsystem
The retrieval system is based on a Lucene open source retrieval framework, firstly reads an information community file, establishes an index for information community data, divides a retrieval statement through a user retrieval interface of Lucene, matches the retrieval statement with the index data, returns the retrieval statement to a user according to the relevance, exports a core subsystem into a jar packet after the retrieval function is realized, and finally designs a C # visual interface for the retrieval system.

Claims (10)

1. The video efficient retrieval system supporting the fuzzy comment mining is characterized in that high-level abstract concept information which cannot be intuitively obtained and needs to be expressed by a video is mined and refined based on the fuzzy retrieval of video concept expansion, and a retrieved domain of a video object is expanded, so that a user can accurately and efficiently inquire even if submitting a fuzzy abstract description when retrieving the video;
firstly, extracting a high-level abstract concept of a video through mining and analyzing video comment data to realize a set of video retrieval system based on fuzzy comment mining, and firstly, acquiring comments, crawling the network video comment data, collecting comment data of various videos through encoding a network spider, and solving the crawling problem of dynamic webpage information of various videos;
secondly, after the comments are obtained, mining information hidden in the comments, and cleaning comment data through word segmentation, part-of-speech tagging and word deactivation; then, extracting a frequently-occurring noun set from the comments by adopting an association rule mining algorithm, removing a noise set by means of isolation pruning and correlation pruning, and filtering possible non-subject frequent words by adopting point mutual information to finally obtain a characteristic word set which is close to the video;
thirdly, after the characteristic words are obtained, clustering is carried out on repeated expressions by adopting a clustering algorithm, the method provides that possible related contents of videos are corresponding to a series of characteristic nouns and related topics surrounding each characteristic word based on an information community, topics are mined through an LDA topic model, and after comment mining is finished, each video corresponds to one information community; the retrieval system is realized based on a Lucene open source retrieval framework;
acquiring video comments and extracting features: firstly, extracting a candidate feature word set from comments by correlation analysis of association rules, pruning the extracted feature words by adopting isolation and correlation, filtering out feature words irrelevant to a theme by combining webpage titles and adopting point mutual information, and finally extracting feature words relevant to the theme;
characteristic word clustering and potential topic mining: clustering feature words under the condition that the feature words are extracted, mining potential topics according to clusters, firstly, calculating character similarity and semantic similarity of the words, defining the similarity of the words by using the algorithm, then, carrying out cluster analysis on the feature words by selecting vector features and fusing a k-means + + clustering algorithm, and finally, mining the potential topics again in an original corpus by an LDA topic mining algorithm according to feature word clustering results;
the video efficient retrieval system consists of three parts: the system is developed by adopting Java language, and various intermediate files generated in the running process of the system comprise original comment data, word segmentation results, object files, pruning results, frequent feature word results, clustering results and LDA mining result files which are stored in a text form.
2. The video efficient retrieval system supporting fuzzy comment mining as claimed in claim 1, wherein video comments are obtained and features are extracted: regarding nouns or noun phrases as a candidate set of possible feature words, extracting video comment features according to the following method:
firstly, video comment data are obtained through a web spider;
secondly, performing word segmentation and part-of-speech tagging on the obtained video comment by adopting an NLPIR system;
thirdly, searching a frequent word set by adopting an association rule;
fourthly, pruning the frequent words by adopting an isolation rule and a correlation rule and removing an invalid word set;
fifthly, pruning is carried out by adopting point mutual information, and candidate characteristic words are extracted;
the invention adopts a web spider to obtain the original data of the video comment, and has a dynamic data analysis process besides a webpage access process, and the specific flow is as follows:
the first process is as follows: inputting a seed webpage, downloading the seed webpage, and extracting all URLs in webpage information;
and a second process: filtering web pages which do not contain videos, and adding the web pages containing the videos into a queue;
and a third process: downloading the URLs in the queue in sequence in a breadth traversal mode;
and (4) a fourth process: dynamically analyzing in a page containing a video, sending a comment data request by using an HttpClient toolkit, storing data locally, and ignoring a part of video comments with smaller comment quantity;
extracting candidate characteristic words by adopting an association rule: extracting a set meeting the minimum support degree from the generated object files by adopting an association rule as a candidate characteristic word, and finding out a frequent set by adopting a generation-test strategy: firstly, finding out all frequent sets, then filtering out the frequent sets which do not meet the conditions by comparing the support degree of each frequent set with the minimum support degree, then finding out the rules through the frequent sets, wherein the rules need to meet the minimum support degree and the minimum confidence degree, and realizing the finding of all frequency sets which meet the conditions in a recursive mode;
the invention only considers three items and the following frequent sets, adopts the association rule to extract the candidate characteristic words, and the initial p is 1:
step 1, traversing object files generated by video comment preprocessing, and constructing a set of all candidate items with the size of p;
step 2, scanning the data set to judge whether the item sets with only one element meet the minimum support degree p, and forming a frequent set by the item sets meeting the minimum support degree p;
step 3, generating a candidate set with the size of (p +1) through the connection step;
step 4, scanning the data set to judge whether the item sets containing P +1 elements meet the minimum support degree P +1 through a pruning step, and forming a frequent set by the item sets meeting the minimum support degree P + 1;
step 5, p is p + 1;
if no new frequent set is generated or p is 3, the procedure terminates; otherwise, jumping to step 3.
3. The video-efficient retrieval system supporting fuzzy comment mining as claimed in claim 1, wherein feature words are filtered by using point-to-point information: filtering out a part of nouns which are still in the feature word set and are irrelevant to the theme but frequently appear, wherein the meanings of the word words are uncertain, and the words have definite meanings under the condition of specific semantics, when filtering the part of words, measuring the correlation degree of the feature words and the theme by adopting point mutual information based on the correlation between the feature words and the theme, adopting the nouns of the titles of the web pages as seed feature words, segmenting the titles of the web pages, extracting the nouns from the words, calculating the point mutual information of the feature words and the seed feature words, and the calculation formula is as follows:
Figure FDA0003225795090000031
zzs is a seed feature word, namely a noun extracted from a webpage title, w1 is a feature word needing to calculate point mutual information, Oxcs (u1, u) represents the co-occurrence times of the feature word and the seed feature word, Oxcs (u1) represents the independent occurrence times of the feature word, Oxcs (u) represents the independent occurrence times of the seed feature word, a high point mutual information value represents strong correlation, a threshold value alpha is set, if the point mutual information value is larger than or equal to the threshold value, the point mutual information value is reserved as a feature, otherwise, the point mutual information value is filtered;
the search engine is adopted to search the entries, the number of returned search results is used as the occurrence number of the entries, the point mutual information and information search are combined to calculate the correlation between the words, the Baidu API is used as a tool, the larger the calculated point mutual information value is, the higher the probability that the characteristic words are correlated with the topics is, and the results lower than the critical value are filtered.
4. The video-efficient retrieval system supporting fuzzy comment mining as claimed in claim 1, wherein a vector space model is constructed: the text is converted into information which is easier to be recognized by a computer, namely, the text is subjected to vectorization processing, a vector space model expresses the text as a vector in a vector space, namely, one text is expressed as one vector, in the text vectorization process, the text is firstly split into sentences, then the sentences are split into the most basic components, namely, characters, words and phrases, the basic language units are collectively called as feature items, namely, the text is expressed as:
ai=(u1,i,u1,i…um,i,)
wherein u isj,iThe j characteristic item is represented in the text aiWeight value of medium, uj,iThe calculation mode of (2) is determined by the definition of the characteristic item, the cosine of an included angle between vectors is used as the similarity measurement between texts, and the text similarity is calculated as the cosine of the included angle between the vectors.
5. The video-efficient retrieval system supporting fuzzy comment mining as claimed in claim 1, wherein in the word approximation calculation and vector selection, the word approximation is calculated as follows: the feature word clustering gathers words with similar semantics into the same cluster, the similarity of the words needs to be calculated, and the similarity in two aspects is considered: the first is character approximation, the higher the word character approximation is, the higher the semantic approximation is, all words are set as a binary variable attribute set T, T contains character string 1 and character string 2, T is a superset of character string 1 and character string 2, let h be the total number of words commonly contained in both character string 1 and character string 2, c be the total number of words contained in only character string 1 but not contained in character string 2, T be the total number of words not contained in only character string 2 but not contained in character string 1, r be the total number of words not contained in both character string 1 and character string 2, and define h, T, c, r as four state components for comparing the character string approximations, wherein, the word which does not exist in the two character strings has no effect on the calculation of the approximation degree of the two character strings, so r is removed, and the approximation degree of the two character strings is defined as:
Figure FDA0003225795090000032
h is the total number of words contained in common in both the character string 1 and the character string 2, c is the total number of words contained in only the character string 1, and t is the total number of words contained in only the character string 2.
6. The video-efficient retrieval system supporting fuzzy comment mining of claim 5, wherein the term semantic approximation is calculated as: calculating word semantic similarity based on a classification system, selecting HowNet as a dictionary, wherein the concept is description of words, one word can be expressed into a plurality of concepts, the sememe is the minimum meaning unit for describing one concept, each concept is expressed by a group of sememes, and a complex tree structure is formed between the sememes;
suppose a word U1There are m concepts: s11,S12…S1mWord U2There are n concepts: s21,S22,…S2nThe invention specifies the word U1And U2The approximation degree between the concepts is the maximum value of the approximation degree between the concepts, namely:
Figure FDA0003225795090000041
all concepts are finally attributed to the representation of the sememe, so the approximation degree of the concepts is found, namely the approximation degree of the sememe is found, all the sememes form a tree-shaped sememe hierarchy according to the organization relation, the approximation degree of the sememe is found by calculating the distance between the sememe nodes, and the semantic distance between the two sememes is obtained by assuming that the distance between the two sememes in the hierarchy is a:
Figure FDA0003225795090000042
where k1, k2 are two sememes, β is an adjustable parameter, and a is the distance of the two sememes in the hierarchy, a word approximation can be defined as:
Xs(u1,u2)=β*Xswords(u1,u2)+α*Xsmeaninh(u1,u2)
wherein beta and alpha are positive parameters less than 1, the word approximation mainly affects the semantic approximation, beta is more than or equal to 0 and less than or equal to 0.3, alpha is more than or equal to 0.7 and less than or equal to 1, and beta + alpha is equal to 1;
selecting vector characteristics: m feature words are obtained in the feature word extraction process, the M feature words are used as a group of feature vectors, and the mutual word approximation degree of the M feature vectors is obtained according to the invention, namely, an M-dimensional feature vector Q (1,2, …, M) is formed for each feature word, namely, a clustering object.
7. The video-efficient retrieval system supporting fuzzy comment mining as claimed in claim 1, wherein feature word clustering: clustering the characteristic words by adopting a K-means + + clustering algorithm, correcting the selection of the initial central point of the K-means by the K-means + + clustering algorithm, not randomly appointing the initial central point, selecting the initial clustering center according to the principle of the farthest distance, and performing the selection process as follows:
the method comprises the steps of firstly, randomly selecting a point from data as a clustering center point;
calculating the distance A (x) of the rest nearest clustering centers for the rest points in the data set;
and a third step of selecting a new point as a new clustering center according to the following selection rules: a (x) larger points are selected as new cluster centers with larger probability;
repeating the second process and the third process until P clustering centers are selected;
taking the P selected clustering centers as initial clustering centers to operate standard K-means;
the number of clustering results is not fixed, and different clustering numbers are set according to different videos.
8. The video-efficient retrieval system with support for fuzzy comment mining of claim 1, wherein mining LDA potential topics: on the basis of clustering, further mining the original comment data; adopting an LDA topic model to carry out secondary mining on the comment data, and expressing the topic as a series of words related to the topic;
the process of potential topic mining after clustering the feature words comprises the following steps:
scheme 1: classifying the original comment data according to the clustering result;
and (2) a flow scheme: performing LDA potential theme mining on the original comment data of different categories;
based on the Gregor Heinrich realization of the LdaGibbsSample. java class, the engineering realization rewrites a Corpus class and a Vocabulary class, wherein the Corpus is used for reading a comment Corpus to form a word list, and the Vocabulary class processes the read Corpus to form the word list.
9. The video-efficient retrieval system supporting fuzzy comment mining as claimed in claim 1, wherein the video-efficient retrieval system is implemented by the following processes:
the method comprises the following steps: establishing an index, reading text data into an internal memory by adopting an IndexWriter class, reading a video title, a clustered class cluster, a related characteristic word and a related potential theme extracted from a comment, adding the title of the video and the UTL of a webpage, instantiating IndexUTit, adding the extracted information into a Document object, wherein an interface function is as follows: pravatetasticvoid addDoc (IndexWriter w, Stringurl, Stringtitle, String [ ] clusterics, String [ ] topics);
step two: establishing a search request, reading the input of a user into a standard input (stdin), analyzing the request by a Parse class, and generating a Query object by an analyzed result;
step three: creating a Searcher object, searching by using the Query object generated above, encapsulating the matched result in a TopScoreDocCollector object and returning, wherein the number of returned results can be specified, and the number of required returned results is set by a create method of the TopScoreDocCollector;
some complex sorting algorithms are packaged in the org, apache, lucene, search package, the default search method in the search process is fuzzy search, the search domain needs to be specified in the process of realizing the search system, the potential topic mining content is semantic supplement to the cluster formed by the feature words, only the feature words and the video titles in the information community are searched, and the extracted topics are presented as the supplement content in the search result.
10. The video-efficient retrieval system with support for fuzzy comment mining of claim 1, wherein the comment acquisition subsystem: by adopting an open-source web spider WebCollector1.3, the crawled information mainly comprises three parts: the method comprises the steps that video comment data, video title data and video description data are divided into two extraction types, the video comment data and the video description data are static webpage data, an HTML (hypertext markup language) analysis tool jsup is adopted for extraction, the HTML data are analyzed into a DOM (document object model) tree, and the content of a specified node is selected according to the DOM tree structure to obtain structured information; the comment data are not directly loaded in a static webpage form, but are dynamically loaded through multiple network interactions, and a network spider cannot acquire the comment data, so that a WebCollector needs to be modified, before the comment data are extracted, network interaction parameters for acquiring comments need to be analyzed, network interaction is analyzed through a firbaug tool to obtain json parameters, then a network interaction process of comments is simulated by adopting an HttpClient toolkit, and final comment data are acquired and written into MySql data;
the video retrieval subsystem: the retrieval system is based on a Lucene open source retrieval framework, firstly reads an information community file, establishes an index for information community data, divides a retrieval statement through a user retrieval interface of Lucene, matches the retrieval statement with the index data, returns the retrieval statement to a user according to the relevance, exports a core subsystem into a jar packet after the retrieval function is realized, and finally designs a C # visual interface for the retrieval system.
CN202110971077.3A 2021-08-23 2021-08-23 Efficient video retrieval system supporting fuzzy comment mining Pending CN113656641A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110971077.3A CN113656641A (en) 2021-08-23 2021-08-23 Efficient video retrieval system supporting fuzzy comment mining

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110971077.3A CN113656641A (en) 2021-08-23 2021-08-23 Efficient video retrieval system supporting fuzzy comment mining

Publications (1)

Publication Number Publication Date
CN113656641A true CN113656641A (en) 2021-11-16

Family

ID=78481694

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110971077.3A Pending CN113656641A (en) 2021-08-23 2021-08-23 Efficient video retrieval system supporting fuzzy comment mining

Country Status (1)

Country Link
CN (1) CN113656641A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609512A (en) * 2012-02-07 2012-07-25 北京中机科海科技发展有限公司 System and method for heterogeneous information mining and visual analysis
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system
CN103778207A (en) * 2014-01-15 2014-05-07 杭州电子科技大学 LDA-based news comment topic digging method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609512A (en) * 2012-02-07 2012-07-25 北京中机科海科技发展有限公司 System and method for heterogeneous information mining and visual analysis
CN103544242A (en) * 2013-09-29 2014-01-29 广东工业大学 Microblog-oriented emotion entity searching system
CN103778207A (en) * 2014-01-15 2014-05-07 杭州电子科技大学 LDA-based news comment topic digging method

Similar Documents

Publication Publication Date Title
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN112861990B (en) Topic clustering method and device based on keywords and entities and computer readable storage medium
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN111475625A (en) News manuscript generation method and system based on knowledge graph
CN110188349A (en) A kind of automation writing method based on extraction-type multiple file summarization method
CN107506472B (en) Method for classifying browsed webpages of students
Kallipolitis et al. Semantic search in the World News domain using automatically extracted metadata files
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
Wang et al. Detecting hot topics from academic big data
JP7395377B2 (en) Content search methods, devices, equipment, and storage media
Kisilevich et al. “Beautiful picture of an ugly place”. Exploring photo collections using opinion and sentiment analysis of user comments
Fernández et al. Vits: video tagging system from massive web multimedia collections
CN111125297A (en) Massive offline text real-time recommendation method based on search engine
CN112597370A (en) Webpage information autonomous collecting and screening system with specified demand range
KR101476225B1 (en) Method for Indexing Natural Language And Mathematical Formula, Apparatus And Computer-Readable Recording Medium with Program Therefor
CN116595043A (en) Big data retrieval method and device
Mezentseva et al. Optimization of analysis and minimization of information losses in text mining
Hybridised OntoKnowNHS: Ontology Driven Knowledge Centric Novel Hybridised Semantic Scheme for Image Recommendation Using Knowledge Graph
CN113641788B (en) Unsupervised long and short film evaluation fine granularity viewpoint mining method
Zheng et al. Architecture Descriptions Analysis Based on Text Mining and Crawling Technology
Lalitha et al. Potential Web Content Identification and Classification System using NLP and Machine Learning Techniques
CN113656641A (en) Efficient video retrieval system supporting fuzzy comment mining
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
CN112507105A (en) Multi-mode intelligent question-answering system and method based on WeChat public number
Luo et al. Multimedia news exploration and retrieval by integrating keywords, relations and visual features

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination