Summary of the invention
The present invention aims to provide the method and system of the associated output of a kind of information fragmentation, to solve the information fragmentation of choosing in above-mentioned prior art, is difficult for the problem arranging.
The method that the invention discloses the associated output of a kind of information fragmentation, comprising:
The content of text of multiple information fragmentation that identification user chooses, collects storage by the content of text of all information fragmentation that obtain;
The content of text of any two described information fragmentation is carried out to similarity calculating, obtain the similarity of any two information fragmentation;
User, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, and the content of text of out of Memory fragment is shown in described document with the size order of described similarity.
Preferably, also comprise:
Obtaining after the described similarity of information fragmentation and information fragmentation, for information fragmentation described in each, filter out other information fragmentation within the scope of predefined first threshold with the described similarity of this information fragmentation, by associated with this information fragmentation other information fragmentation that filter out;
In described document, show the content of text of the information fragmentation that described user chooses, and the content of text of other information fragmentation associated with this information fragmentation is shown in described document with the size order of described similarity.
Preferably, the process that described similarity is calculated comprises:
Choose the first information fragment D in described information fragmentation
1with the second information fragmentation D
2;
According to the content of text of the content of text of described first information fragment and the second information fragmentation, determine respectively word frequency higher than the crucial character/word of predefined the second threshold values as characteristic item;
Set up the First Characteristic collection of described first information fragment, as follows:
D
1={T
11,W
11;T
12,W
12;……;T
1n,W
1n};
Wherein, T
1nfor D
1described characteristic item, W
1nfor the weight definite according to word frequency, n is the sequence number that First Characteristic is concentrated characteristic item;
Set up the Second Characteristic collection of described the second information fragmentation, as follows:
D
2={T
21,W
21;T
22,W
22;……;T
2m,W
2m};
Wherein, T
1mfor D
2described characteristic item, W
1mfor the weight definite according to word frequency, n is the sequence number that Second Characteristic is concentrated characteristic item;
Utilize cosine formula to calculate the described similarity of two described information fragmentation, described cosine formula is as follows:
Wherein, described Sim (D1, D2) is the described similarity of two described information fragmentation, the sequence number that k is characteristic item.
Preferably, also comprise:
For the described all information fragmentation that collect storage are set up index list;
Described user by choosing the described information fragmentation that will check in described index list.
Preferably, user, choose after information fragmentation, identify the information source of each information fragmentation;
Content of text and the information source of each described information fragmentation have mapping relations;
When showing the content of text of described information fragmentation, show the information source of this information fragmentation.
Preferably, described information fragmentation comprises: text formatting and picture format.
Preferably, also comprise:
By user, trigger multiple in an overall hot key, call out and choose accordingly function, choose the described information fragmentation of text formatting or picture format.
Preferably, also comprise
After the content of text of multiple information fragmentation of choosing identification user, the content of text of each described information fragmentation is contrasted, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part;
And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.
The system that the invention also discloses the associated output of a kind of information fragmentation, comprising:
Information identification module, for identifying content of text and the information source of the information fragmentation that user chooses, and puts into corresponding database by the content of text after identification and information source and collects storage;
Described database comprises: for store information fragmentation content of text the first database and for storing second database of information source of information fragmentation; The content of text of same information fragmentation and information source have mapping relations in two databases;
Directory index module, is used to all information fragmentation in described database to set up index list, for user, selects;
Document associations module, for calculating the similarity of every two information fragmentation;
Document output module for by content of text and the information source of the described information fragmentation of user's selection, with the selected document format demonstration of user, and shows with the size order of described similarity the content of text of out of Memory fragment in described document.
Preferably, also comprise:
Parsing module, the overall hot key triggering for identifying user, sends to the steering order of the overall hot key mapping identifying to choose accordingly module, provides user to choose accordingly function;
Information is looked into molality piece, and for comparing between the content of text identifying by described information identification module, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part; And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.
The method and system of the associated output of information fragmentation in the present invention, have the following advantages:
1, the information fragmentation of choosing is automatically stored in database, user directly checks the content of text of its needed information fragmentation;
2, user, check after information fragmentation, associated information fragmentation is exported simultaneously, help user to check;
3, set up index list, user further screens its needed information fragmentation in the information fragmentation of primary election.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
As shown in Figure 3, the invention discloses the system of the associated output of a kind of information fragmentation, comprising:
Parsing module 1, text selection module 2, picture are chosen module 3, information identification module 4, directory index module 5, information fragmentation relating module 10, document output module 7 and information and are looked into molality piece 9;
Parsing module, the overall hot key triggering for identifying user, and the steering order of the overall hot key mapping identifying is sent to and chooses accordingly module, provide user to choose accordingly function;
Overall situation hot key can be an independent button, can be also the combination by multiple independent buttons.
Wherein, user is when choosing needed information fragmentation, and information fragmentation is not only the word that can select, also comprises the picture that can not select word and include fragment information;
Parsing module identifies after the first overall hot key of user's triggering, and parsing module sends to text selection module by the steering order of the first overall hot key mapping;
Text selection module receives after the steering order of the first overall hot key mapping of parsing module transmission, provides user directly to choose the function of the information fragmentation of text formatting;
Parsing module identifies after the second overall hot key of user's triggering, and parsing module sends to picture to choose module the steering order of the second overall hot key mapping;
Picture is chosen after the steering order that module receives the second overall hot key mapping that parsing module sends, and provides user's sectional drawing to choose the function of the information fragmentation of picture format.
After user chooses information fragmentation, the information fragmentation of choosing is sent to information identification module;
Information identification module, the information fragmentation of choosing for receiving user, identifies content of text and the information source of this information fragmentation; For local resource, the memory address that information source is information fragmentation, routine c: 123 information fragmentation place document; Wherein, information fragmentation place document can be various document formats, example: various office documents, text, compiling document etc.; For the resource of network, information source is the network address of information fragmentation, for example: http://wenku.baidu.com/link url=yKLV9Z1UyA3SCZqcZkDM0miWl5LWLgEJvOh_cY-iPQRIOP23sWg2 sNgP_2-is2h_32e2Cr_u3HjVmraorpLE pt8v9J5VGTKEC9dVPi8-Fle; By the information source of information fragmentation, can find fast the document at this information fragmentation place, facilitate user to check, call and choose more about this information fragmentation other parts in its place document.
Wherein, for the content of text that identifies information fragmentation: be directed to the information fragmentation of text formatting, of this information fragmentation itself is as its content of text;
For the information fragmentation of picture format, as follows, obtain content of text wherein:
The picture that step 1, scanning are chosen is also analyzed the picture space of a whole page;
Step 2, picture is carried out to row cutting and character segmentation;
Step 3, gradually dark and gradually detect the shape of word, letter and symbol in this picture under bright two kinds of patterns, the word that shape is remained unchanged, letter and symbol, be labeled as and determine that word mates in text library, the text after output matching; Otherwise, by word undetermined being labeled as of change of shape;
Step 4, according to the shape of word undetermined and front and back certain limit thereof determine the semantic relation of word, determine word undetermined, in text library, mate the text after output matching.
Step 5, combination, export complete content of text.
Wherein, also can adopt ORC recognition technology, for example Han Wang ORC instrument, the text message in identification picture.
Information identification module, carries out separating treatment by the content of text and the information source that identify this information fragmentation, deposits in respectively in corresponding database and collects storage;
Wherein, database comprises: the first database 6 and the second database 8;
In the first database for storing the content of text of information fragmentation;
In the second database for storing the information source of information fragmentation;
And the content of text of same information fragmentation and information source have mapping relations in two databases, by one, can find and the opposing party of its mapping.
Can be by retrieving in the first database and the second database according to content of text and information source, find the information fragmentation of user search word coupling, by document output module output display.
Information fragmentation relating module finds the content of text of every two information fragmentation to carry out similarity calculating in database; For an information fragmentation, according to the threshold value of setting, filter out other information fragmentation in predefined threshold range with this information fragmentation similarity and carry out associated;
Document output module, for by the content of text of described information fragmentation and information source, the document format selected with user show, and show be associated with this information fragmentation content of text and the information source of information fragmentation.
Directory index module, is used to the content of text of the information fragmentation in the first database to set up index list;
Wherein, the title in this index list for example can be, according to certain tactic numbering: the logical number after arrange the front and back of the acquisition time of length, size or information fragmentation by information fragmentation;
The word that can be also the title that compiles voluntarily of user or user's mark in information fragmentation shows; For a picture format information fragmentation, the mode of mark for to choose word by sectional drawing in this picture, and after the identification of information identification module, the title that sets it as index list is used;
Further, user determines key word in information fragmentation, and wherein, this key word can be one or more, determines that the process of key word is: the word of the word that user compiles voluntarily or user mark in information fragmentation shows;
Determine after the key word of information fragmentation, the title of index list corresponding with this information fragmentation this key word is together shown, as the summary of this information fragmentation, show, user is provided clearer, clear and definite definite information fragmentation.
The information fragmentation of the required information fragmentation that user is chosen in index list and this information fragmentation item association, by document output module output display.
Information is looked into molality piece, and for comparing between the content of text identifying by described information identification module, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part; And according to user's selection, described in proceeding, collect to process or retain content of text repeating part described in the selected a copy of it of user and collect processing.
As shown in Figure 1, the invention also discloses the method for the associated output of a kind of information fragmentation, comprising:
The content of text of multiple information fragmentation that S11, identification user choose, collects storage by the content of text of all information fragmentation that obtain;
S12, the content of text of any two described information fragmentation is carried out to similarity calculating, obtain the similarity of any two information fragmentation;
S13, user, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, and the content of text of out of Memory fragment is shown in described document with the size order of described similarity.
Similarity is calculated and is specifically comprised:
Choose the first information fragment D in described information fragmentation
1with the second information fragmentation D
2;
According to the content of text of the content of text of described first information fragment and the second information fragmentation, determine respectively word frequency higher than the crucial character/word of predefined the second threshold values as characteristic item;
Set up the First Characteristic collection of described first information fragment, as follows:
D
1={T
11,W
11;T
12,W
12;……;T
1n,W
1n};
Wherein, T
1nfor D
1described characteristic item, W
1nfor according to T
1nthe definite weight of word frequency, n is that First Characteristic is concentrated the sequence number of characteristic item;
Set up the Second Characteristic collection of described the second information fragmentation, as follows:
D
2={T
21,W
21;T
22,W
22;……;T
2m,W
2m};
Wherein, T
1mfor D
2described characteristic item, W
1mfor according to T
1mthe definite weight of word frequency, m is that Second Characteristic is concentrated the sequence number of characteristic item;
Utilize cosine formula to calculate the described similarity of two described information fragmentation, described cosine formula is as follows:
Wherein, described Sim (D1, D2) is the described similarity of two described information fragmentation, the sequence number that k is characteristic item.
Represent fragment text D1 and D2 with vector space model, be calculated as follows:
By the above-mentioned similarity that calculates each information fragmentation and other information fragmentation;
Choose and all information fragmentation of this information fragmentation similarity size in threshold values (low, high), associated with this information fragmentation, set up contingency table:
In this contingency table, include that information fragmentation is associated other information fragmentation information, and the information of other information fragmentation sorts according to similarity order from big to small in contingency table;
User, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, below the content of text of this information fragmentation, according to the arrangement of the information fragmentation in contingency table, put in order and show the content of text of other information fragmentation.
For method disclosed by the invention, below announced a preferred embodiment, as shown in Figure 2:
S21, garbage collection;
Wait for that user chooses accordingly function and offers user by triggering specific overall hot key, transferring, and chooses the information fragmentation of corresponding format;
S22, fragment identification;
User, chosen after information fragmentation, the information fragmentation of choosing has been identified, identified content of text and the information source of information fragmentation;
S23, information are looked into heavily;
After the content of text of multiple information fragmentation of choosing identification user, the content of text of each described information fragmentation is contrasted, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part;
And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.
S24, collect stores processor;
The content of text of all information fragmentation and information source are carried out separate, deposit respectively corresponding database in.
S25, association process:
Calculate the similarity of the content of text of every two information fragmentation, to each information fragmentation, similarity other information fragmentation in threshold range are carried out associated with this information fragmentation;
S26, set up catalogue;
According to the content of text of the information fragmentation in database, set up index list.
Wherein, also comprise: the key word of determining information fragmentation;
Key word is shown as summary in index list.
S27, choose fragment;
User chooses its needed information fragmentation in index list according to key word; Or
In database according to the content of text of information fragmentation or information source as term, in database, retrieve, obtain the information fragmentation retrieving;
S28, output fragment;
The content of text of the information fragmentation that user is chosen in index list or by the content of text of the information fragmentation that obtains of retrieval in database, with the selected document format of user, be unified in one piece of document and show, and according to similarity size order, show content of text and the information source of other information fragmentation associated with this information fragmentation.
The explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.