CN105447026A - Web information extraction method based on minimum weight communication determining set in multi-view image - Google Patents

Web information extraction method based on minimum weight communication determining set in multi-view image Download PDF

Info

Publication number
CN105447026A
CN105447026A CN201410426746.9A CN201410426746A CN105447026A CN 105447026 A CN105447026 A CN 105447026A CN 201410426746 A CN201410426746 A CN 201410426746A CN 105447026 A CN105447026 A CN 105447026A
Authority
CN
China
Prior art keywords
summit
determined set
image
text
visual angles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410426746.9A
Other languages
Chinese (zh)
Inventor
李涛
李千目
王鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Original Assignee
Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Science and Technology Changshu Research Institute Co Ltd filed Critical Nanjing University of Science and Technology Changshu Research Institute Co Ltd
Priority to CN201410426746.9A priority Critical patent/CN105447026A/en
Publication of CN105447026A publication Critical patent/CN105447026A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Web information extraction method based on a minimum weight communication determining set in a multi-view image. According to the web information extraction method, a text, an image and time information are integrated; and by transforming a problem into an optimization problem based on the image and solving the problem, an abstract based on a story axis is generated so as to reflect an event evolution process of a given theme. The Web information extraction method has the advantages that: (1) according to the method provided by the invention, image processing and text processing are combined so as to improve semantic analysis and provide a vivid graphical abstract for a reader; (2) the problem is transformed into the optimization problem based on the image, and the problem is solved by using an effective heuristic method; and (3) the generated story axis simultaneously achieves continuity of time and coherence of contents, and a retrieval speed is improved, and richer information and a better result are provided for the reader.

Description

The Web information extracting method of determined set is communicated with based on minimal weight in various visual angles figure
Technical field
The present invention relates to a kind of newly Web information extracting method is carried out to certain theme, be specifically a kind ofly communicated with by minimal weight in various visual angles figure the Web information extracting method that determined set generates diagram story axle.
Background technology
Along with the develop rapidly of infotech, internet has become most popular Information issued medium.It is very convenient that people release news or reading information all becomes.But, along with internet information explosively increases, when people search for information on the internet, often can run into this problem: browse a king-sized Web document sets and extract meaningful information.In the last few years, for addressing this problem, there has been proposed various types of Web document understanding system.Such as, based on the multi-document auto-abstracting system of inquiry, its objective is that from document, extract summary statement makes content perhaps associated with the query in its principle can transmitting document; Topic detection and tracker, be used for monitoring the event relevant with a certain topic dynamic; Time shaft generation system, utilizes the temporal information occurred in document, and the event described about certain theme by generating summary develops.
Multi-document auto-abstracting becomes a very little summary by extracting document Central Plains rational information or information associated with the query document subject feature vector.Have already been proposed various multi-document auto-abstracting method.The most frequently used based on barycenter or graphic based, in addition also just like latent semantic analysis (LSA), Non-negative Matrix Factorization (NMF) and the topic model etc. based on statement, they generate summary by selecting semantic important sentence in document.Most of existing method is extracted statement and is formed brief summary from input, but have ignored and may be present in sequential in input document and structural information.
The object of topic detection and tracking (TDT) divides into groups to extract that some are new to article according to main topic of discussion in news category article, the event do not reported in the past, and the following state of affairs of following the tracks of this theme.Information retrieval technique (as information extraction, filtering and text cluster) is applied to these problems usually.
Also exist in addition and generating the research in the time shaft and story axle of certain theme.These time shaft generation methods with reference to time sequence information, and display with linear structure.Google's news time shaft divides into groups according to theme to news category article, then sorts with chronological order.
Although the understanding system of these documents can reduce the problem of information overload, they still face two and limit to greatly: (1) most systems lays particular emphasis on outstanding and sums up the event of certain topic and lack the thematic structure that capturing events develops.Although time shaft system proposes a kind of sequence of events based on time sequencing, the event axle of linear structure loses the comprehensive information of event evolution process usually.(2) these systems make summary usually in a text form, but sometimes may seem dull and barren to text reader.
Summary of the invention
1, object of the present invention.
Of the present inventionly be communicated with the Web information extracting method of determined set based on minimal weight in various visual angles figure and be different from above-mentioned existing method, the method integration text that the present invention proposes, image and temporal information, and the summary generated based on story axle is to reflect the differentiation of given theme.The present invention, by generating the story axle of diagram and sequential to solve two limitations described above, namely based on sequential organization summary, provides the abstract structure that can be used for following the tracks of to reader, and uses diagram to make summary be easier to read and be easier to by reader understanding.
2, the technical solution adopted in the present invention.
(1) pre-service: the set of input theme and the object about this theme, wherein each object comprises the subsidiary timestamp of image and text message;
(2) by text and graphical analysis, a various visual angles object diagram is built in conjunction with time sequence information, in figure, each summit represents piece image, and with the textual association of this image is described, two groups of limits are had in various visual angles object diagram, nonoriented edge represents the grade of the similarity between object, and directed edge represents the paired sequential relationship according to type.Each summit is assigned with a weighted value, the similarity between weighted value representative object and inquiry;
(3) solve minimal weight determined set, thus obtain a group node, be i.e. decision objects;
(4) by using oriented Steiner to set generation story axle, the diagram sequential story axle that the object that associated by timestamp forms is exported.
Further, described various visual angles object diagram builds carries out in accordance with the following steps:
(1) define: various visual angles figure is a tlv triple , wherein the set on summit, the set of nonoriented edge, it is the set of directed edge; According to described known image and the set that describes with the text of timestamp, construct a various visual angles object diagram: image is considered as summit , based on the Similarity Measure nonoriented edge of text and image , calculate directed edge based on the difference of timestamp , use four non-negative arguments , , , define these limits;
(2) the word bag of standard is adopted to represent text; In information retrieval, word bag is namely for a text, ignore its word order and grammer, syntax, it is only regarded as a set of words, whether in text, the appearance of each word is independently, do not rely on other words and occur, in other words when the author of this section of article selects a vocabulary independently not select by the impact of previous sentence in any one position;
(3) for image, we adopt color and edge direction to describe, and from their feature of angle calculation of color and texture, calculate similarity by cosine tolerance.
Further, describedly calculate proper vector similarity by cosine tolerance and adopt with the following method:
Suppose with it is summit in two objects, use nonoriented edge these two objects link up that and if only if text similarity between the two and image similarity are greater than limit respectively and limit , from arrive draw a directed edge and if only if , wherein with their timestamp respectively, for time window; To each summit , its vertex weights equal 1 and subtract theme and object between cosine similarity.
Further, describedly adopt with the following method by minimal weight determined set identification decision objects associated with the query:
If the summit of a figure with between have a limit be connected, be then defined as summit determine another summit of figure ; A subset of the vertex set of non-directed graph a determined set, if to each summit , or ? in, or in summit determine , the problem finding inquiry related object collection can be regarded as non-directed graph ( in find minimal weight determined set problem and a given summit weighted undirected graph G, the determined set finding all vertex weights minimum from the determined set of all G:
Step 1: initialization determined set for empty set , set in the middle of defining , be initialized as ;
Step 2: opposite vertexes set in not eachly to be included in in summit , find with adjacent and do not belong to middle set summit, calculate its number ;
Step 3: calculate each weight with ratio, find the summit that ratio is minimum ;
Step 4: will be added to determined set in, will consecutive point be added to middle set in;
Step 5: repeat step 2 to step 5, until determined set in summit tree be greater than determined set maximal value;
Step 6: finally obtain non-directed graph ( minimal weight determined set.
Further, described connect object in determined set by oriented Steiner tree and generate story axle and generate as follows, i.e. determined set approximate solution thus obtain after description topic most representative object, generate a natural story axle, capture time and the structural information of inquiry dependent event:
A given direct graph with weight and vertex subset , find figure middle connection in the one tree of minimum weights on all summits, namely Steiner tree, wherein gathers in summit be referred to as terminal vertex;
When time, Steiner problem is exactly the problem of classical calculating minimum spanning tree; When time, Steiner problem just becomes and solves shortest route problem between 2;
The Steiner tree that problem exports is story axle, and root object couples together to the object in other determined set all by this story axle;
The input of this problem known , wherein represent summit weighted digraph, represent the minimal determining set that said method finds, represent determined set size, represent Steiner tree root, in order to find with for root, cover in the Steiner tree on individual summit , adopt following method:
Step 1: initialization for empty set ;
Step 2: initialization for empty set, initialization in the weights on all summits be ;
Step 3: to each summit , , get each value between 1 to k, calculate if, the weights on middle summit are greater than the weights on middle summit then ;
Step 4: ;
Step 5: , , repeat step 2 to step 5, until ;
Step 6: return .
3, beneficial effect of the present invention.
(1) methods combining image sequential and the text-processing that propose of the present invention, does not process simply by text, improves simple semantic analysis, and provides lively diagram summary to reader.
(2) problem be converted into the optimization problem based on figure and utilize efficient heuristic to solve this problem.
(3) the story axle generated achieves the continuity in temporal continuity and content simultaneously, and the speed that retrieval is extracted improves greatly, for reader provides abundanter information and better result.
Accompanying drawing explanation
Fig. 1 is product process figure of the present invention.
Embodiment
In order to enable the auditor of Patent Office especially the public clearly understand technical spirit of the present invention and beneficial effect, applicant will elaborate below by way of example, but be not all the restriction to the present invention program to the description of embodiment, any conceive according to the present invention done be only pro forma but not substantial equivalent transformation and all should be considered as technical scheme category of the present invention.
Embodiment
The problem generating diagram sequential story axle can be defined as follows:
Input: inquiry theme with the set of individual object , , wherein each object be one and comprise text description (such as, a little paragraph or a word) and timestamp an image.
Export: the diagram sequential story axle that a most representative object can summarizing inquiry associated topic forms.
We will become this question variation the minimal weight connected dominating set problem on various visual angles figure below, and it can be broken down into two optimization problems: 1) searching minimal weight dominant set; 2) use oriented Steiner to set (SteinerTree) and connect dominant set element.
1, various visual angles object diagram builds
Definition: various visual angles figure (Multi-ViewGraph) is a tlv triple , wherein the set on summit, the set of nonoriented edge, it is the set of directed edge.
Known image and the set described with the text of timestamp, we construct a various visual angles object diagram: image is considered as summit , based on the Similarity Measure nonoriented edge of text and image , calculate directed edge based on the difference of timestamp .We use four non-negative arguments , , , define these limits.
For text, we adopt " word bag " (" bag-of-words of standard ") representation.In information retrieval, " word bag " supposition is for a text, ignore its word order and grammer, syntax, it is only regarded as a set of words, or perhaps word combination, in text, the appearance of each word is independently, do not rely on other words whether to occur, in other words when the author of this section of article selects a vocabulary independently not select by the impact of previous sentence in any one position.For image, we adopt color and edge direction to describe (ColorandEdgeDirectivityDescriptor, CEDD) their feature of angle calculation from color and texture.For these two proper vectors, we calculate similarity by cosine tolerance respectively.
Suppose with be in two objects.In order to define , we are greater than these two objects link up that and if only if text similarity between the two and image similarity respectively with a limit with .In order to define , Wo Mencong arrive and if only if to draw a directed edge , wherein with their timestamp respectively.We claim for time window.To each summit , its vertex weights equal 1 to subtract with between cosine similarity.
2, by minimal weight determined set identification decision objects associated with the query
If the summit of a figure with between have a limit be connected, we claim determine another summit of figure .A subset of the vertex set of non-directed graph a determined set, if to each summit , or ? in, or in summit determine .The problem finding inquiry related object collection can be regarded as non-directed graph ( in find minimal weight determined set problem.
Problem 1(minimal weight determined set problem (MWDS)): a given summit weighted undirected graph G, the determined set finding all vertex weights minimum from the determined set of all G.
MWDS is known as NP difficulty, and we adopt following methods to obtain the approximate solution of this problem:
Step 1: initialization determined set for empty set , set in the middle of defining , be initialized as .
Step 2: right in not eachly to be included in in summit , find with adjacent and do not belong to summit, calculate its number .
Step 3: calculate each weight with ratio, find the summit that ratio is minimum .
Step 4: will be added to determined set in, will consecutive point be added to set in.
Step 5: repeat step 2 to step 5, until determined set in summit tree be greater than the determined set maximal value of specifying.
Step 6: finally obtain non-directed graph ( minimal weight determined set.
3, connect object in determined set by oriented Steiner tree and generate story axle
Use said method to obtain determined set approximate solution thus obtain after the most representative object of description topic, we need the natural story axle of generation one, and it can capture time and the structural information of inquiry dependent event.In order to study this problem, we have used the concept of Steiner tree.
Problem 2: a given direct graph with weight and vertex subset , find figure middle connection in the one tree of minimum weights on all summits, i.e. Steiner tree.Wherein gather in summit be referred to as terminal vertex.When time, Steiner problem is exactly the problem of classical calculating minimum spanning tree; When time, Steiner problem just becomes and solves shortest route problem between 2.
The Steiner tree that this problem exports is exactly the story axle that the present invention generates, and root object couples together to the object in other determined set all by this story axle.
The input of this problem known , wherein represent summit weighted digraph, represent the minimal determining set that said method finds, represent determined set size, represent the root of Steiner tree.In order to find with for root, cover in the Steiner tree on individual summit , the present invention uses following method:
Step 1: initialization for empty set .
Step 2: initialization for empty set, initialization in the weights on all summits be .
Step 3: to each summit , , get each value between 1 to k, calculate if, the weights on middle summit are greater than the weights on middle summit then .
Step 4:
Step 5: , , repeat step 2 to step 5, until .
Step 6: return , as the result of problem 2.

Claims (5)

1. be communicated with a Web information extracting method for determined set based on minimal weight in various visual angles figure, it is characterized in that carrying out in accordance with the following steps:
Pre-service: the set of input theme and the object about this theme, wherein each object comprises the subsidiary timestamp of image and text message;
By text and graphical analysis, a various visual angles object diagram is built in conjunction with time sequence information, in figure, each summit represents piece image, and with the textual association of this image is described, have two groups of limits in various visual angles object diagram, nonoriented edge represents the grade of the similarity between object, and directed edge represents the paired sequential relationship according to type, each summit is assigned with a weighted value, the similarity between weighted value representative object and inquiry;
(3) solve minimal weight determined set, thus obtain a group node, be i.e. decision objects;
(4) by using oriented Steiner to set generation story axle, the diagram sequential story axle that the object that associated by timestamp forms is exported.
2. in various visual angles figure according to claim 1, minimal weight is communicated with the Web information extracting method of determined set, and its feature builds in described various visual angles object diagram carries out in accordance with the following steps:
(1) define: various visual angles figure is a tlv triple , wherein the set on summit, the set of nonoriented edge, it is the set of directed edge; According to described known image and the set that describes with the text of timestamp, construct a various visual angles object diagram: image is considered as summit , based on the Similarity Measure nonoriented edge of text and image , calculate directed edge based on the difference of timestamp , use four non-negative arguments , , , define these limits;
The word bag of employing standard represents text; In information retrieval, word bag is namely for a text, ignore its word order and grammer, syntax, it is only regarded as a set of words, whether in text, the appearance of each word is independently, do not rely on other words and occur, in other words when the author of this section of article selects a vocabulary independently not select by the impact of previous sentence in any one position;
For image, we adopt color and edge direction to describe, and from their feature of angle calculation of color and texture, calculate similarity by cosine tolerance.
3. in various visual angles figure according to claim 2, minimal weight is communicated with the Web information extracting method of determined set, it is characterized in that described calculate proper vector similarity by cosine tolerance and adopting with the following method:
Suppose with it is summit in two objects, use nonoriented edge these two objects link up that and if only if text similarity between the two and image similarity are greater than limit respectively and limit , from arrive draw a directed edge and if only if , wherein with their timestamp respectively, for time window; To each summit , its vertex weights equal 1 and subtract theme and object between cosine similarity.
4. in various visual angles figure according to claim 1, minimal weight is communicated with the Web information extracting method of determined set, it is characterized in that adopting with the following method by minimal weight determined set identification decision objects associated with the query in described step (3):
If the summit of a figure with between have a limit be connected, be then defined as summit determine another summit of figure ; A subset of the vertex set of non-directed graph a determined set, if to each summit , or ? in, or in summit determine , the problem finding inquiry related object collection can be regarded as non-directed graph ( in find minimal weight determined set problem and a given summit weighted undirected graph G, the determined set finding all vertex weights minimum from the determined set of all G:
Step 1: initialization determined set for empty set , set in the middle of defining , be initialized as ;
Step 2: opposite vertexes set in not eachly to be included in in summit , find with adjacent and do not belong to middle set summit, calculate its number ;
Step 3: calculate each weight with ratio, find the summit that ratio is minimum ;
Step 4: will be added to determined set in, will consecutive point be added to middle set in;
Step 5: repeat step 2 to step 5, until determined set in summit tree be greater than determined set maximal value;
Step 6: finally obtain non-directed graph ( minimal weight determined set.
5. in various visual angles figure according to claim 4, minimal weight is communicated with the Web information extracting method of determined set, it is characterized in that described connect object in determined set by oriented Steiner tree and generating story axle and generate as follows, i.e. determined set approximate solution thus obtain after description topic most representative object, generate a natural story axle, capture time and the structural information of inquiry dependent event:
A given direct graph with weight and vertex subset , find figure middle connection in the one tree of minimum weights on all summits, namely Steiner tree, wherein gathers in summit be referred to as terminal vertex;
When time, Steiner problem is exactly the problem of classical calculating minimum spanning tree; When time, Steiner problem just becomes and solves shortest route problem between 2;
The Steiner tree that problem exports is story axle, and root object couples together to the object in other determined set all by this story axle;
The input of this problem known , wherein represent summit weighted digraph, represent the minimal determining set that said method finds, represent determined set size, represent Steiner tree root, in order to find with for root, cover in the Steiner tree on individual summit , adopt following method:
Step 1: initialization for empty set ;
Step 2: initialization for empty set, initialization in the weights on all summits be ;
Step 3: to each summit , , get each value between 1 to k, calculate if, the weights on middle summit are greater than the weights on middle summit then ;
Step 4: ;
Step 5: , , repeat step 2 to step 5, until ;
Step 6: return .
CN201410426746.9A 2014-08-27 2014-08-27 Web information extraction method based on minimum weight communication determining set in multi-view image Pending CN105447026A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410426746.9A CN105447026A (en) 2014-08-27 2014-08-27 Web information extraction method based on minimum weight communication determining set in multi-view image

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410426746.9A CN105447026A (en) 2014-08-27 2014-08-27 Web information extraction method based on minimum weight communication determining set in multi-view image

Publications (1)

Publication Number Publication Date
CN105447026A true CN105447026A (en) 2016-03-30

Family

ID=55557219

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410426746.9A Pending CN105447026A (en) 2014-08-27 2014-08-27 Web information extraction method based on minimum weight communication determining set in multi-view image

Country Status (1)

Country Link
CN (1) CN105447026A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886783A (en) * 2017-01-20 2017-06-23 清华大学 A kind of image search method and system based on provincial characteristics
CN108280772A (en) * 2018-01-24 2018-07-13 北京航空航天大学 Story train of thought generation method based on event correlation in social networks
CN109145936A (en) * 2018-06-20 2019-01-04 北京达佳互联信息技术有限公司 A kind of model optimization method and device
CN112766262A (en) * 2021-01-21 2021-05-07 西安理工大学 Identification method for single-layer one-to-many and many-to-one share graphs
CN115329051A (en) * 2022-10-17 2022-11-11 成都大学 Multi-view news information rapid retrieval method, system, storage medium and terminal

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930462A (en) * 2010-08-20 2010-12-29 华中科技大学 Comprehensive body similarity detection method

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101930462A (en) * 2010-08-20 2010-12-29 华中科技大学 Comprehensive body similarity detection method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《PROCEEDINGS OF THE TWENTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886783A (en) * 2017-01-20 2017-06-23 清华大学 A kind of image search method and system based on provincial characteristics
CN108280772A (en) * 2018-01-24 2018-07-13 北京航空航天大学 Story train of thought generation method based on event correlation in social networks
CN108280772B (en) * 2018-01-24 2022-02-18 北京航空航天大学 Story context generation method based on event association in social network
CN109145936A (en) * 2018-06-20 2019-01-04 北京达佳互联信息技术有限公司 A kind of model optimization method and device
CN109145936B (en) * 2018-06-20 2019-07-09 北京达佳互联信息技术有限公司 A kind of model optimization method and device
CN112766262A (en) * 2021-01-21 2021-05-07 西安理工大学 Identification method for single-layer one-to-many and many-to-one share graphs
CN112766262B (en) * 2021-01-21 2024-02-02 西安理工大学 Identification method for single-layer one-to-many and many-to-one share graphs
CN115329051A (en) * 2022-10-17 2022-11-11 成都大学 Multi-view news information rapid retrieval method, system, storage medium and terminal
CN115329051B (en) * 2022-10-17 2022-12-20 成都大学 Multi-view news information rapid retrieval method, system, storage medium and terminal

Similar Documents

Publication Publication Date Title
CN103745000B (en) Hot topic detection method of Chinese micro-blogs
CN106599181B (en) A kind of hot news detection method based on topic model
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
CN102289522B (en) Method of intelligently classifying texts
CN103902988B (en) A kind of sketch shape matching method based on Modular products figure with Clique
CN103455487B (en) The extracting method and device of a kind of search term
CN105447026A (en) Web information extraction method based on minimum weight communication determining set in multi-view image
CN105243129A (en) Commodity property characteristic word clustering method
CN103455562A (en) Text orientation analysis method and product review orientation discriminator on basis of same
CN105306475A (en) Network intrusion detection method based on association rule classification
CN106897914A (en) A kind of Method of Commodity Recommendation and system based on topic model
Lee Unsupervised and supervised learning to evaluate event relatedness based on content mining from social-media streams
CN104133848A (en) Tibetan language entity knowledge information extraction method
CN104298749A (en) Commodity retrieval method based on image visual and textual semantic integration
CN107203520A (en) The method for building up of hotel's sentiment dictionary, the sentiment analysis method and system of comment
CN104199838B (en) A kind of user model constructing method based on label disambiguation
Sadr et al. Unified topic-based semantic models: a study in computing the semantic relatedness of geographic terms
CN110363206A (en) Cluster, data processing and the data identification method of data object
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN107169051B (en) Based on relevant method for searching three-dimension model semantic between ontology and system
CN106777395A (en) A kind of topic based on community's text data finds system
CN103678432B (en) A kind of web page body extracting method based on web page body feature and intermediary's true value
Zhang et al. Ideagraph plus: A topic-based algorithm for perceiving unnoticed events
CN104102654B (en) A kind of method and device of words clustering
CN107463615B (en) Real-time going and dealing recommendation method based on context and user interest in open network

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160330