CN103729422A

CN103729422A - Information fragment associative output method and system

Info

Publication number: CN103729422A
Application number: CN201310712337.0A
Authority: CN
Inventors: 江潮
Original assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Current assignee: WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date: 2013-12-23
Filing date: 2013-12-23
Publication date: 2014-04-16

Abstract

The invention discloses an information fragment associative output method and system. The method includes: recognizing text contents of user-selected multiple information fragments and collecting and storing the text contents of all information fragments obtained; subjecting the text contents of every two information fragments to similarity calculation to obtain similarity between the information fragments; after a user selects one information fragment to be checked out, establishing a document to display the text content of the information fragment, and displaying the text contents of other information fragments in the document according to the sequence of similarity degrees. The text contents of the information fragments recognized are automatically stored during recognizing the information fragments, so that complex operations are greatly simplified; the information fragments are associated, so that thinking energy loss in reading and recognizing is decreased.

Description

The method and system of the associated output of a kind of information fragmentation

Technical field

The present invention relates to a kind of computer realm, in particular to the method and system of the associated output of a kind of information fragmentation.

Background technology

Current, along with Internet era arrival, when needs complete a report or write one piece of document, often to collect much information to information and mostly all in the mode of fragment, be dispersed in different places, after finding, need entire chapter manuscript to copy, paste and wait operation to collect content of text, when fragment information exchange is crossed after systematic collection, bringing another problem is that these large-scale information fragmentation are in disorder, we need to be a large amount of these, in disorder information is carried out consolidation by certain rule, with this, reduce reading, the thinking energy loss that identification brings, further promote the efficiency of fragment consolidation.

Summary of the invention

The present invention aims to provide the method and system of the associated output of a kind of information fragmentation, to solve the information fragmentation of choosing in above-mentioned prior art, is difficult for the problem arranging.

The method that the invention discloses the associated output of a kind of information fragmentation, comprising:

The content of text of multiple information fragmentation that identification user chooses, collects storage by the content of text of all information fragmentation that obtain;

The content of text of any two described information fragmentation is carried out to similarity calculating, obtain the similarity of any two information fragmentation;

User, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, and the content of text of out of Memory fragment is shown in described document with the size order of described similarity.

Preferably, also comprise:

Obtaining after the described similarity of information fragmentation and information fragmentation, for information fragmentation described in each, filter out other information fragmentation within the scope of predefined first threshold with the described similarity of this information fragmentation, by associated with this information fragmentation other information fragmentation that filter out;

In described document, show the content of text of the information fragmentation that described user chooses, and the content of text of other information fragmentation associated with this information fragmentation is shown in described document with the size order of described similarity.

Preferably, the process that described similarity is calculated comprises:

Choose the first information fragment D in described information fragmentation ₁with the second information fragmentation D ₂;

According to the content of text of the content of text of described first information fragment and the second information fragmentation, determine respectively word frequency higher than the crucial character/word of predefined the second threshold values as characteristic item;

Set up the First Characteristic collection of described first information fragment, as follows:

D ₁＝{T ₁₁，W ₁₁;T ₁₂，W ₁₂；……；T _1n，W _1n}；

Wherein, T _1nfor D ₁described characteristic item, W _1nfor the weight definite according to word frequency, n is the sequence number that First Characteristic is concentrated characteristic item;

Set up the Second Characteristic collection of described the second information fragmentation, as follows:

D ₂＝{T ₂₁，W ₂₁；T ₂₂，W ₂₂；……；T _2m，W _2m}；

Wherein, T _1mfor D ₂described characteristic item, W _1mfor the weight definite according to word frequency, n is the sequence number that Second Characteristic is concentrated characteristic item;

Utilize cosine formula to calculate the described similarity of two described information fragmentation, described cosine formula is as follows:

Sim (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k - 1}^{n} w_{1 k} \times w_{2 K}}{\sqrt{(Σ_{k - 1}^{n} w_{1 k}^{2}) (Σ_{k - 1}^{n} w_{2 k}^{2})}};

Wherein, described Sim (D1, D2) is the described similarity of two described information fragmentation, the sequence number that k is characteristic item.

Preferably, also comprise:

For the described all information fragmentation that collect storage are set up index list;

Described user by choosing the described information fragmentation that will check in described index list.

Preferably, user, choose after information fragmentation, identify the information source of each information fragmentation;

Content of text and the information source of each described information fragmentation have mapping relations;

When showing the content of text of described information fragmentation, show the information source of this information fragmentation.

Preferably, described information fragmentation comprises: text formatting and picture format.

Preferably, also comprise:

By user, trigger multiple in an overall hot key, call out and choose accordingly function, choose the described information fragmentation of text formatting or picture format.

Preferably, also comprise

After the content of text of multiple information fragmentation of choosing identification user, the content of text of each described information fragmentation is contrasted, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part;

And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.

The system that the invention also discloses the associated output of a kind of information fragmentation, comprising:

Information identification module, for identifying content of text and the information source of the information fragmentation that user chooses, and puts into corresponding database by the content of text after identification and information source and collects storage;

Described database comprises: for store information fragmentation content of text the first database and for storing second database of information source of information fragmentation; The content of text of same information fragmentation and information source have mapping relations in two databases;

Directory index module, is used to all information fragmentation in described database to set up index list, for user, selects;

Document associations module, for calculating the similarity of every two information fragmentation;

Document output module for by content of text and the information source of the described information fragmentation of user's selection, with the selected document format demonstration of user, and shows with the size order of described similarity the content of text of out of Memory fragment in described document.

Preferably, also comprise:

Parsing module, the overall hot key triggering for identifying user, sends to the steering order of the overall hot key mapping identifying to choose accordingly module, provides user to choose accordingly function;

Information is looked into molality piece, and for comparing between the content of text identifying by described information identification module, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part; And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.

The method and system of the associated output of information fragmentation in the present invention, have the following advantages:

1, the information fragmentation of choosing is automatically stored in database, user directly checks the content of text of its needed information fragmentation;

2, user, check after information fragmentation, associated information fragmentation is exported simultaneously, help user to check;

3, set up index list, user further screens its needed information fragmentation in the information fragmentation of primary election.

Accompanying drawing explanation

Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 shows the first pass figure of embodiment;

Fig. 2 shows the second process flow diagram of embodiment;

Fig. 3 shows the structural representation of embodiment.

Embodiment

Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.

As shown in Figure 3, the invention discloses the system of the associated output of a kind of information fragmentation, comprising:

Parsing module 1, text selection module 2, picture are chosen module 3, information identification module 4, directory index module 5, information fragmentation relating module 10, document output module 7 and information and are looked into molality piece 9;

Parsing module, the overall hot key triggering for identifying user, and the steering order of the overall hot key mapping identifying is sent to and chooses accordingly module, provide user to choose accordingly function;

Overall situation hot key can be an independent button, can be also the combination by multiple independent buttons.

Wherein, user is when choosing needed information fragmentation, and information fragmentation is not only the word that can select, also comprises the picture that can not select word and include fragment information;

Parsing module identifies after the first overall hot key of user's triggering, and parsing module sends to text selection module by the steering order of the first overall hot key mapping;

Text selection module receives after the steering order of the first overall hot key mapping of parsing module transmission, provides user directly to choose the function of the information fragmentation of text formatting;

Parsing module identifies after the second overall hot key of user's triggering, and parsing module sends to picture to choose module the steering order of the second overall hot key mapping;

Picture is chosen after the steering order that module receives the second overall hot key mapping that parsing module sends, and provides user's sectional drawing to choose the function of the information fragmentation of picture format.

After user chooses information fragmentation, the information fragmentation of choosing is sent to information identification module;

Information identification module, the information fragmentation of choosing for receiving user, identifies content of text and the information source of this information fragmentation; For local resource, the memory address that information source is information fragmentation, routine c: 123 information fragmentation place document; Wherein, information fragmentation place document can be various document formats, example: various office documents, text, compiling document etc.; For the resource of network, information source is the network address of information fragmentation, for example: http://wenku.baidu.com/link url=yKLV9Z1UyA3SCZqcZkDM0miWl5LWLgEJvOh_cY-iPQRIOP23sWg2 sNgP_2-is2h_32e2Cr_u3HjVmraorpLE pt8v9J5VGTKEC9dVPi8-Fle; By the information source of information fragmentation, can find fast the document at this information fragmentation place, facilitate user to check, call and choose more about this information fragmentation other parts in its place document.

Wherein, for the content of text that identifies information fragmentation: be directed to the information fragmentation of text formatting, of this information fragmentation itself is as its content of text;

For the information fragmentation of picture format, as follows, obtain content of text wherein:

The picture that step 1, scanning are chosen is also analyzed the picture space of a whole page;

Step 2, picture is carried out to row cutting and character segmentation;

Step 3, gradually dark and gradually detect the shape of word, letter and symbol in this picture under bright two kinds of patterns, the word that shape is remained unchanged, letter and symbol, be labeled as and determine that word mates in text library, the text after output matching; Otherwise, by word undetermined being labeled as of change of shape;

Step 4, according to the shape of word undetermined and front and back certain limit thereof determine the semantic relation of word, determine word undetermined, in text library, mate the text after output matching.

Step 5, combination, export complete content of text.

Wherein, also can adopt ORC recognition technology, for example Han Wang ORC instrument, the text message in identification picture.

Information identification module, carries out separating treatment by the content of text and the information source that identify this information fragmentation, deposits in respectively in corresponding database and collects storage;

Wherein, database comprises: the first database 6 and the second database 8;

In the first database for storing the content of text of information fragmentation;

In the second database for storing the information source of information fragmentation;

And the content of text of same information fragmentation and information source have mapping relations in two databases, by one, can find and the opposing party of its mapping.

Can be by retrieving in the first database and the second database according to content of text and information source, find the information fragmentation of user search word coupling, by document output module output display.

Information fragmentation relating module finds the content of text of every two information fragmentation to carry out similarity calculating in database; For an information fragmentation, according to the threshold value of setting, filter out other information fragmentation in predefined threshold range with this information fragmentation similarity and carry out associated;

Document output module, for by the content of text of described information fragmentation and information source, the document format selected with user show, and show be associated with this information fragmentation content of text and the information source of information fragmentation.

Directory index module, is used to the content of text of the information fragmentation in the first database to set up index list;

Wherein, the title in this index list for example can be, according to certain tactic numbering: the logical number after arrange the front and back of the acquisition time of length, size or information fragmentation by information fragmentation;

The word that can be also the title that compiles voluntarily of user or user's mark in information fragmentation shows; For a picture format information fragmentation, the mode of mark for to choose word by sectional drawing in this picture, and after the identification of information identification module, the title that sets it as index list is used;

Further, user determines key word in information fragmentation, and wherein, this key word can be one or more, determines that the process of key word is: the word of the word that user compiles voluntarily or user mark in information fragmentation shows;

Determine after the key word of information fragmentation, the title of index list corresponding with this information fragmentation this key word is together shown, as the summary of this information fragmentation, show, user is provided clearer, clear and definite definite information fragmentation.

The information fragmentation of the required information fragmentation that user is chosen in index list and this information fragmentation item association, by document output module output display.

Information is looked into molality piece, and for comparing between the content of text identifying by described information identification module, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part; And according to user's selection, described in proceeding, collect to process or retain content of text repeating part described in the selected a copy of it of user and collect processing.

As shown in Figure 1, the invention also discloses the method for the associated output of a kind of information fragmentation, comprising:

The content of text of multiple information fragmentation that S11, identification user choose, collects storage by the content of text of all information fragmentation that obtain;

S12, the content of text of any two described information fragmentation is carried out to similarity calculating, obtain the similarity of any two information fragmentation;

S13, user, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, and the content of text of out of Memory fragment is shown in described document with the size order of described similarity.

Similarity is calculated and is specifically comprised:

D ₁＝{T ₁₁，W ₁₁；T ₁₂，W ₁₂；……；T _1n，W _1n}；

Wherein, T _1nfor D ₁described characteristic item, W _1nfor according to T _1nthe definite weight of word frequency, n is that First Characteristic is concentrated the sequence number of characteristic item;

D ₂＝{T ₂₁，W ₂₁；T ₂₂，W ₂₂；……；T _2m，W _2m}；

Wherein, T _1mfor D ₂described characteristic item, W _1mfor according to T _1mthe definite weight of word frequency, m is that Second Characteristic is concentrated the sequence number of characteristic item;

Co \sin e : Sim (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k - 1}^{n} w_{1 k} \times w_{2 K}}{\sqrt{(Σ_{k - 1}^{n} w_{1 k}^{2}) (Σ_{k - 1}^{n} w_{2 k}^{2})}};

Represent fragment text D1 and D2 with vector space model, be calculated as follows:

\cos (D_{1}, D_{2}) = \frac{d_{1} \cdot d_{2}}{| | d_{1} | | \cdot {| | d}_{2} | |} = \frac{Σ_{i = 0}^{k} (w (d_{1}, t_{i}) \cdot w (d_{2}, t_{i}))}{\sqrt{Σ_{i = 0}^{n} w {(d_{1}, t_{i})}^{2}} \cdot \sqrt{Σ_{i = 0}^{m} w {(d_{2}, t_{j})}^{2}};}

By the above-mentioned similarity that calculates each information fragmentation and other information fragmentation;

Choose and all information fragmentation of this information fragmentation similarity size in threshold values (low, high), associated with this information fragmentation, set up contingency table:

In this contingency table, include that information fragmentation is associated other information fragmentation information, and the information of other information fragmentation sorts according to similarity order from big to small in contingency table;

User, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, below the content of text of this information fragmentation, according to the arrangement of the information fragmentation in contingency table, put in order and show the content of text of other information fragmentation.

For method disclosed by the invention, below announced a preferred embodiment, as shown in Figure 2:

S21, garbage collection;

Wait for that user chooses accordingly function and offers user by triggering specific overall hot key, transferring, and chooses the information fragmentation of corresponding format;

S22, fragment identification;

User, chosen after information fragmentation, the information fragmentation of choosing has been identified, identified content of text and the information source of information fragmentation;

S23, information are looked into heavily;

S24, collect stores processor;

The content of text of all information fragmentation and information source are carried out separate, deposit respectively corresponding database in.

S25, association process:

Calculate the similarity of the content of text of every two information fragmentation, to each information fragmentation, similarity other information fragmentation in threshold range are carried out associated with this information fragmentation;

S26, set up catalogue;

According to the content of text of the information fragmentation in database, set up index list.

Wherein, also comprise: the key word of determining information fragmentation;

Key word is shown as summary in index list.

S27, choose fragment;

User chooses its needed information fragmentation in index list according to key word; Or

In database according to the content of text of information fragmentation or information source as term, in database, retrieve, obtain the information fragmentation retrieving;

S28, output fragment;

The content of text of the information fragmentation that user is chosen in index list or by the content of text of the information fragmentation that obtains of retrieval in database, with the selected document format of user, be unified in one piece of document and show, and according to similarity size order, show content of text and the information source of other information fragmentation associated with this information fragmentation.

The explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims

1. a method for the associated output of information fragmentation, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, also comprises:

3. method according to claim 1, is characterized in that, the process that described similarity is calculated comprises:

D ₁＝{T ₁₁，W ₁₁；T ₁₂，W ₁₂；……；T _1n，W _1n}；

D ₂＝{T ₂₁，W ₂₁；T ₂₂，W ₂₂；……；T _2m，W _2m}；

Wherein, T _1mfor D ₂described characteristic item, W _1mfor the weight definite according to word frequency, m is the sequence number that Second Characteristic is concentrated characteristic item;

Sim (D_{1}, D_{2}) = \cos θ = \frac{Σ_{k - 1}^{n} w_{1 k} \times w_{2 K}}{\sqrt{(Σ_{k - 1}^{n} w_{1 k}^{2}) (Σ_{k - 1}^{n} w_{2 k}^{2})}};

4. method according to claim 1, is characterized in that, also comprises:

5. method according to claim 1, is characterized in that, user, chooses after information fragmentation, identifies the information source of each information fragmentation;

6. method according to claim 1, is characterized in that, described information fragmentation comprises: text formatting and picture format.

7. method according to claim 6, is characterized in that, also comprises:

8. method according to claim 1, is characterized in that, also comprises

9. a system for the associated output of information fragmentation, is characterized in that, comprising:

10. system according to claim 9, is characterized in that, also comprises: