CN103729422A - Information fragment associative output method and system - Google Patents

Information fragment associative output method and system Download PDF

Info

Publication number
CN103729422A
CN103729422A CN201310712337.0A CN201310712337A CN103729422A CN 103729422 A CN103729422 A CN 103729422A CN 201310712337 A CN201310712337 A CN 201310712337A CN 103729422 A CN103729422 A CN 103729422A
Authority
CN
China
Prior art keywords
information
text
content
fragmentation
information fragmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310712337.0A
Other languages
Chinese (zh)
Inventor
江潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Original Assignee
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd filed Critical WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310712337.0A priority Critical patent/CN103729422A/en
Publication of CN103729422A publication Critical patent/CN103729422A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information fragment associative output method and system. The method includes: recognizing text contents of user-selected multiple information fragments and collecting and storing the text contents of all information fragments obtained; subjecting the text contents of every two information fragments to similarity calculation to obtain similarity between the information fragments; after a user selects one information fragment to be checked out, establishing a document to display the text content of the information fragment, and displaying the text contents of other information fragments in the document according to the sequence of similarity degrees. The text contents of the information fragments recognized are automatically stored during recognizing the information fragments, so that complex operations are greatly simplified; the information fragments are associated, so that thinking energy loss in reading and recognizing is decreased.

Description

The method and system of the associated output of a kind of information fragmentation
Technical field
The present invention relates to a kind of computer realm, in particular to the method and system of the associated output of a kind of information fragmentation.
Background technology
Current, along with Internet era arrival, when needs complete a report or write one piece of document, often to collect much information to information and mostly all in the mode of fragment, be dispersed in different places, after finding, need entire chapter manuscript to copy, paste and wait operation to collect content of text, when fragment information exchange is crossed after systematic collection, bringing another problem is that these large-scale information fragmentation are in disorder, we need to be a large amount of these, in disorder information is carried out consolidation by certain rule, with this, reduce reading, the thinking energy loss that identification brings, further promote the efficiency of fragment consolidation.
Summary of the invention
The present invention aims to provide the method and system of the associated output of a kind of information fragmentation, to solve the information fragmentation of choosing in above-mentioned prior art, is difficult for the problem arranging.
The method that the invention discloses the associated output of a kind of information fragmentation, comprising:
The content of text of multiple information fragmentation that identification user chooses, collects storage by the content of text of all information fragmentation that obtain;
The content of text of any two described information fragmentation is carried out to similarity calculating, obtain the similarity of any two information fragmentation;
User, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, and the content of text of out of Memory fragment is shown in described document with the size order of described similarity.
Preferably, also comprise:
Obtaining after the described similarity of information fragmentation and information fragmentation, for information fragmentation described in each, filter out other information fragmentation within the scope of predefined first threshold with the described similarity of this information fragmentation, by associated with this information fragmentation other information fragmentation that filter out;
In described document, show the content of text of the information fragmentation that described user chooses, and the content of text of other information fragmentation associated with this information fragmentation is shown in described document with the size order of described similarity.
Preferably, the process that described similarity is calculated comprises:
Choose the first information fragment D in described information fragmentation 1with the second information fragmentation D 2;
According to the content of text of the content of text of described first information fragment and the second information fragmentation, determine respectively word frequency higher than the crucial character/word of predefined the second threshold values as characteristic item;
Set up the First Characteristic collection of described first information fragment, as follows:
D 1={T 11,W 11;T 12,W 12;……;T 1n,W 1n};
Wherein, T 1nfor D 1described characteristic item, W 1nfor the weight definite according to word frequency, n is the sequence number that First Characteristic is concentrated characteristic item;
Set up the Second Characteristic collection of described the second information fragmentation, as follows:
D 2={T 21,W 21;T 22,W 22;……;T 2m,W 2m};
Wherein, T 1mfor D 2described characteristic item, W 1mfor the weight definite according to word frequency, n is the sequence number that Second Characteristic is concentrated characteristic item;
Utilize cosine formula to calculate the described similarity of two described information fragmentation, described cosine formula is as follows:
Sim ( D 1 , D 2 ) = cos θ = Σ k - 1 n w 1 k × w 2 K ( Σ k - 1 n w 1 k 2 ) ( Σ k - 1 n w 2 k 2 ) ;
Wherein, described Sim (D1, D2) is the described similarity of two described information fragmentation, the sequence number that k is characteristic item.
Preferably, also comprise:
For the described all information fragmentation that collect storage are set up index list;
Described user by choosing the described information fragmentation that will check in described index list.
Preferably, user, choose after information fragmentation, identify the information source of each information fragmentation;
Content of text and the information source of each described information fragmentation have mapping relations;
When showing the content of text of described information fragmentation, show the information source of this information fragmentation.
Preferably, described information fragmentation comprises: text formatting and picture format.
Preferably, also comprise:
By user, trigger multiple in an overall hot key, call out and choose accordingly function, choose the described information fragmentation of text formatting or picture format.
Preferably, also comprise
After the content of text of multiple information fragmentation of choosing identification user, the content of text of each described information fragmentation is contrasted, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part;
And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.
The system that the invention also discloses the associated output of a kind of information fragmentation, comprising:
Information identification module, for identifying content of text and the information source of the information fragmentation that user chooses, and puts into corresponding database by the content of text after identification and information source and collects storage;
Described database comprises: for store information fragmentation content of text the first database and for storing second database of information source of information fragmentation; The content of text of same information fragmentation and information source have mapping relations in two databases;
Directory index module, is used to all information fragmentation in described database to set up index list, for user, selects;
Document associations module, for calculating the similarity of every two information fragmentation;
Document output module for by content of text and the information source of the described information fragmentation of user's selection, with the selected document format demonstration of user, and shows with the size order of described similarity the content of text of out of Memory fragment in described document.
Preferably, also comprise:
Parsing module, the overall hot key triggering for identifying user, sends to the steering order of the overall hot key mapping identifying to choose accordingly module, provides user to choose accordingly function;
Information is looked into molality piece, and for comparing between the content of text identifying by described information identification module, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part; And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.
The method and system of the associated output of information fragmentation in the present invention, have the following advantages:
1, the information fragmentation of choosing is automatically stored in database, user directly checks the content of text of its needed information fragmentation;
2, user, check after information fragmentation, associated information fragmentation is exported simultaneously, help user to check;
3, set up index list, user further screens its needed information fragmentation in the information fragmentation of primary election.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, forms the application's a part, and schematic description and description of the present invention is used for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 shows the first pass figure of embodiment;
Fig. 2 shows the second process flow diagram of embodiment;
Fig. 3 shows the structural representation of embodiment.
Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
As shown in Figure 3, the invention discloses the system of the associated output of a kind of information fragmentation, comprising:
Parsing module 1, text selection module 2, picture are chosen module 3, information identification module 4, directory index module 5, information fragmentation relating module 10, document output module 7 and information and are looked into molality piece 9;
Parsing module, the overall hot key triggering for identifying user, and the steering order of the overall hot key mapping identifying is sent to and chooses accordingly module, provide user to choose accordingly function;
Overall situation hot key can be an independent button, can be also the combination by multiple independent buttons.
Wherein, user is when choosing needed information fragmentation, and information fragmentation is not only the word that can select, also comprises the picture that can not select word and include fragment information;
Parsing module identifies after the first overall hot key of user's triggering, and parsing module sends to text selection module by the steering order of the first overall hot key mapping;
Text selection module receives after the steering order of the first overall hot key mapping of parsing module transmission, provides user directly to choose the function of the information fragmentation of text formatting;
Parsing module identifies after the second overall hot key of user's triggering, and parsing module sends to picture to choose module the steering order of the second overall hot key mapping;
Picture is chosen after the steering order that module receives the second overall hot key mapping that parsing module sends, and provides user's sectional drawing to choose the function of the information fragmentation of picture format.
After user chooses information fragmentation, the information fragmentation of choosing is sent to information identification module;
Information identification module, the information fragmentation of choosing for receiving user, identifies content of text and the information source of this information fragmentation; For local resource, the memory address that information source is information fragmentation, routine c: 123 information fragmentation place document; Wherein, information fragmentation place document can be various document formats, example: various office documents, text, compiling document etc.; For the resource of network, information source is the network address of information fragmentation, for example: http://wenku.baidu.com/link url=yKLV9Z1UyA3SCZqcZkDM0miWl5LWLgEJvOh_cY-iPQRIOP23sWg2 sNgP_2-is2h_32e2Cr_u3HjVmraorpLE pt8v9J5VGTKEC9dVPi8-Fle; By the information source of information fragmentation, can find fast the document at this information fragmentation place, facilitate user to check, call and choose more about this information fragmentation other parts in its place document.
Wherein, for the content of text that identifies information fragmentation: be directed to the information fragmentation of text formatting, of this information fragmentation itself is as its content of text;
For the information fragmentation of picture format, as follows, obtain content of text wherein:
The picture that step 1, scanning are chosen is also analyzed the picture space of a whole page;
Step 2, picture is carried out to row cutting and character segmentation;
Step 3, gradually dark and gradually detect the shape of word, letter and symbol in this picture under bright two kinds of patterns, the word that shape is remained unchanged, letter and symbol, be labeled as and determine that word mates in text library, the text after output matching; Otherwise, by word undetermined being labeled as of change of shape;
Step 4, according to the shape of word undetermined and front and back certain limit thereof determine the semantic relation of word, determine word undetermined, in text library, mate the text after output matching.
Step 5, combination, export complete content of text.
Wherein, also can adopt ORC recognition technology, for example Han Wang ORC instrument, the text message in identification picture.
Information identification module, carries out separating treatment by the content of text and the information source that identify this information fragmentation, deposits in respectively in corresponding database and collects storage;
Wherein, database comprises: the first database 6 and the second database 8;
In the first database for storing the content of text of information fragmentation;
In the second database for storing the information source of information fragmentation;
And the content of text of same information fragmentation and information source have mapping relations in two databases, by one, can find and the opposing party of its mapping.
Can be by retrieving in the first database and the second database according to content of text and information source, find the information fragmentation of user search word coupling, by document output module output display.
Information fragmentation relating module finds the content of text of every two information fragmentation to carry out similarity calculating in database; For an information fragmentation, according to the threshold value of setting, filter out other information fragmentation in predefined threshold range with this information fragmentation similarity and carry out associated;
Document output module, for by the content of text of described information fragmentation and information source, the document format selected with user show, and show be associated with this information fragmentation content of text and the information source of information fragmentation.
Directory index module, is used to the content of text of the information fragmentation in the first database to set up index list;
Wherein, the title in this index list for example can be, according to certain tactic numbering: the logical number after arrange the front and back of the acquisition time of length, size or information fragmentation by information fragmentation;
The word that can be also the title that compiles voluntarily of user or user's mark in information fragmentation shows; For a picture format information fragmentation, the mode of mark for to choose word by sectional drawing in this picture, and after the identification of information identification module, the title that sets it as index list is used;
Further, user determines key word in information fragmentation, and wherein, this key word can be one or more, determines that the process of key word is: the word of the word that user compiles voluntarily or user mark in information fragmentation shows;
Determine after the key word of information fragmentation, the title of index list corresponding with this information fragmentation this key word is together shown, as the summary of this information fragmentation, show, user is provided clearer, clear and definite definite information fragmentation.
The information fragmentation of the required information fragmentation that user is chosen in index list and this information fragmentation item association, by document output module output display.
Information is looked into molality piece, and for comparing between the content of text identifying by described information identification module, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part; And according to user's selection, described in proceeding, collect to process or retain content of text repeating part described in the selected a copy of it of user and collect processing.
As shown in Figure 1, the invention also discloses the method for the associated output of a kind of information fragmentation, comprising:
The content of text of multiple information fragmentation that S11, identification user choose, collects storage by the content of text of all information fragmentation that obtain;
S12, the content of text of any two described information fragmentation is carried out to similarity calculating, obtain the similarity of any two information fragmentation;
S13, user, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, and the content of text of out of Memory fragment is shown in described document with the size order of described similarity.
Similarity is calculated and is specifically comprised:
Choose the first information fragment D in described information fragmentation 1with the second information fragmentation D 2;
According to the content of text of the content of text of described first information fragment and the second information fragmentation, determine respectively word frequency higher than the crucial character/word of predefined the second threshold values as characteristic item;
Set up the First Characteristic collection of described first information fragment, as follows:
D 1={T 11,W 11;T 12,W 12;……;T 1n,W 1n};
Wherein, T 1nfor D 1described characteristic item, W 1nfor according to T 1nthe definite weight of word frequency, n is that First Characteristic is concentrated the sequence number of characteristic item;
Set up the Second Characteristic collection of described the second information fragmentation, as follows:
D 2={T 21,W 21;T 22,W 22;……;T 2m,W 2m};
Wherein, T 1mfor D 2described characteristic item, W 1mfor according to T 1mthe definite weight of word frequency, m is that Second Characteristic is concentrated the sequence number of characteristic item;
Utilize cosine formula to calculate the described similarity of two described information fragmentation, described cosine formula is as follows:
Co sin e : Sim ( D 1 , D 2 ) = cos θ = Σ k - 1 n w 1 k × w 2 K ( Σ k - 1 n w 1 k 2 ) ( Σ k - 1 n w 2 k 2 ) ;
Wherein, described Sim (D1, D2) is the described similarity of two described information fragmentation, the sequence number that k is characteristic item.
Represent fragment text D1 and D2 with vector space model, be calculated as follows:
cos ( D 1 , D 2 ) = d 1 · d 2 | | d 1 | | · | | d 2 | | = Σ i = 0 k ( w ( d 1 , t i ) · w ( d 2 , t i ) ) Σ i = 0 n w ( d 1 , t i ) 2 · Σ i = 0 m w ( d 2 , t j ) 2 ;
By the above-mentioned similarity that calculates each information fragmentation and other information fragmentation;
Choose and all information fragmentation of this information fragmentation similarity size in threshold values (low, high), associated with this information fragmentation, set up contingency table:
In this contingency table, include that information fragmentation is associated other information fragmentation information, and the information of other information fragmentation sorts according to similarity order from big to small in contingency table;
User, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, below the content of text of this information fragmentation, according to the arrangement of the information fragmentation in contingency table, put in order and show the content of text of other information fragmentation.
For method disclosed by the invention, below announced a preferred embodiment, as shown in Figure 2:
S21, garbage collection;
Wait for that user chooses accordingly function and offers user by triggering specific overall hot key, transferring, and chooses the information fragmentation of corresponding format;
S22, fragment identification;
User, chosen after information fragmentation, the information fragmentation of choosing has been identified, identified content of text and the information source of information fragmentation;
S23, information are looked into heavily;
After the content of text of multiple information fragmentation of choosing identification user, the content of text of each described information fragmentation is contrasted, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part;
And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.
S24, collect stores processor;
The content of text of all information fragmentation and information source are carried out separate, deposit respectively corresponding database in.
S25, association process:
Calculate the similarity of the content of text of every two information fragmentation, to each information fragmentation, similarity other information fragmentation in threshold range are carried out associated with this information fragmentation;
S26, set up catalogue;
According to the content of text of the information fragmentation in database, set up index list.
Wherein, also comprise: the key word of determining information fragmentation;
Key word is shown as summary in index list.
S27, choose fragment;
User chooses its needed information fragmentation in index list according to key word; Or
In database according to the content of text of information fragmentation or information source as term, in database, retrieve, obtain the information fragmentation retrieving;
S28, output fragment;
The content of text of the information fragmentation that user is chosen in index list or by the content of text of the information fragmentation that obtains of retrieval in database, with the selected document format of user, be unified in one piece of document and show, and according to similarity size order, show content of text and the information source of other information fragmentation associated with this information fragmentation.
The explanation of above embodiment is just for helping to understand method of the present invention and core concept thereof; , for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention meanwhile.

Claims (10)

1. a method for the associated output of information fragmentation, is characterized in that, comprising:
The content of text of multiple information fragmentation that identification user chooses, collects storage by the content of text of all information fragmentation that obtain;
The content of text of any two described information fragmentation is carried out to similarity calculating, obtain the similarity of any two information fragmentation;
User, choose after the information fragmentation that will check, set up the content of text that document shows this information fragmentation, and the content of text of out of Memory fragment is shown in described document with the size order of described similarity.
2. method according to claim 1, is characterized in that, also comprises:
Obtaining after the described similarity of information fragmentation and information fragmentation, for information fragmentation described in each, filter out other information fragmentation within the scope of predefined first threshold with the described similarity of this information fragmentation, by associated with this information fragmentation other information fragmentation that filter out;
In described document, show the content of text of the information fragmentation that described user chooses, and the content of text of other information fragmentation associated with this information fragmentation is shown in described document with the size order of described similarity.
3. method according to claim 1, is characterized in that, the process that described similarity is calculated comprises:
Choose the first information fragment D in described information fragmentation 1with the second information fragmentation D 2;
According to the content of text of the content of text of described first information fragment and the second information fragmentation, determine respectively word frequency higher than the crucial character/word of predefined the second threshold values as characteristic item;
Set up the First Characteristic collection of described first information fragment, as follows:
D 1={T 11,W 11;T 12,W 12;……;T 1n,W 1n};
Wherein, T 1nfor D 1described characteristic item, W 1nfor the weight definite according to word frequency, n is the sequence number that First Characteristic is concentrated characteristic item;
Set up the Second Characteristic collection of described the second information fragmentation, as follows:
D 2={T 21,W 21;T 22,W 22;……;T 2m,W 2m};
Wherein, T 1mfor D 2described characteristic item, W 1mfor the weight definite according to word frequency, m is the sequence number that Second Characteristic is concentrated characteristic item;
Utilize cosine formula to calculate the described similarity of two described information fragmentation, described cosine formula is as follows:
Sim ( D 1 , D 2 ) = cos θ = Σ k - 1 n w 1 k × w 2 K ( Σ k - 1 n w 1 k 2 ) ( Σ k - 1 n w 2 k 2 ) ;
Wherein, described Sim (D1, D2) is the described similarity of two described information fragmentation, the sequence number that k is characteristic item.
4. method according to claim 1, is characterized in that, also comprises:
For the described all information fragmentation that collect storage are set up index list;
Described user by choosing the described information fragmentation that will check in described index list.
5. method according to claim 1, is characterized in that, user, chooses after information fragmentation, identifies the information source of each information fragmentation;
Content of text and the information source of each described information fragmentation have mapping relations;
When showing the content of text of described information fragmentation, show the information source of this information fragmentation.
6. method according to claim 1, is characterized in that, described information fragmentation comprises: text formatting and picture format.
7. method according to claim 6, is characterized in that, also comprises:
By user, trigger multiple in an overall hot key, call out and choose accordingly function, choose the described information fragmentation of text formatting or picture format.
8. method according to claim 1, is characterized in that, also comprises
After the content of text of multiple information fragmentation of choosing identification user, the content of text of each described information fragmentation is contrasted, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part;
And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.
9. a system for the associated output of information fragmentation, is characterized in that, comprising:
Information identification module, for identifying content of text and the information source of the information fragmentation that user chooses, and puts into corresponding database by the content of text after identification and information source and collects storage;
Described database comprises: for store information fragmentation content of text the first database and for storing second database of information source of information fragmentation; The content of text of same information fragmentation and information source have mapping relations in two databases;
Directory index module, is used to all information fragmentation in described database to set up index list, for user, selects;
Document associations module, for calculating the similarity of every two information fragmentation;
Document output module for by content of text and the information source of the described information fragmentation of user's selection, with the selected document format demonstration of user, and shows with the size order of described similarity the content of text of out of Memory fragment in described document.
10. system according to claim 9, is characterized in that, also comprises:
Parsing module, the overall hot key triggering for identifying user, sends to the steering order of the overall hot key mapping identifying to choose accordingly module, provides user to choose accordingly function;
Information is looked into molality piece, and for comparing between the content of text identifying by described information identification module, in the situation that detecting content of text repetition, whether prompting user proceeds to collect processing by content of text repeating part; And according to user's selection, described in proceeding, collect and process or retain a described content of text repeating part and collect processing.
CN201310712337.0A 2013-12-23 2013-12-23 Information fragment associative output method and system Pending CN103729422A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310712337.0A CN103729422A (en) 2013-12-23 2013-12-23 Information fragment associative output method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310712337.0A CN103729422A (en) 2013-12-23 2013-12-23 Information fragment associative output method and system

Publications (1)

Publication Number Publication Date
CN103729422A true CN103729422A (en) 2014-04-16

Family

ID=50453496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310712337.0A Pending CN103729422A (en) 2013-12-23 2013-12-23 Information fragment associative output method and system

Country Status (1)

Country Link
CN (1) CN103729422A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528581A (en) * 2015-09-15 2017-03-22 阿里巴巴集团控股有限公司 Text detection method and apparatus
CN108846031A (en) * 2018-05-28 2018-11-20 同方知网数字出版技术股份有限公司 Project similarity comparison method for power industry

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070198912A1 (en) * 2006-02-23 2007-08-23 Xerox Corporation Rapid similarity links computation for tableof contents determination
CN102184256A (en) * 2011-06-02 2011-09-14 北京邮电大学 Clustering method and system aiming at massive similar short texts
CN102831119A (en) * 2011-06-15 2012-12-19 日电(中国)有限公司 Short text clustering equipment and short text clustering method
CN103092931A (en) * 2012-12-31 2013-05-08 武汉传神信息技术有限公司 Multi-strategy combined document automatic classification method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070198912A1 (en) * 2006-02-23 2007-08-23 Xerox Corporation Rapid similarity links computation for tableof contents determination
CN102184256A (en) * 2011-06-02 2011-09-14 北京邮电大学 Clustering method and system aiming at massive similar short texts
CN102831119A (en) * 2011-06-15 2012-12-19 日电(中国)有限公司 Short text clustering equipment and short text clustering method
CN103092931A (en) * 2012-12-31 2013-05-08 武汉传神信息技术有限公司 Multi-strategy combined document automatic classification method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
吴勇: "面向短消息的文本聚类研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
吴夙慧 等: "文本聚类中文本表示和相似度计算研究综述", 《情报科学》 *
盛魁: "改进的K_近邻算法在中文网页分类的应用", 《佳木斯大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106528581A (en) * 2015-09-15 2017-03-22 阿里巴巴集团控股有限公司 Text detection method and apparatus
CN106528581B (en) * 2015-09-15 2019-05-07 阿里巴巴集团控股有限公司 Method for text detection and device
CN108846031A (en) * 2018-05-28 2018-11-20 同方知网数字出版技术股份有限公司 Project similarity comparison method for power industry

Similar Documents

Publication Publication Date Title
US10867256B2 (en) Method and system to provide related data
US20190034835A1 (en) Method and system to provide related data
US6606625B1 (en) Wrapper induction by hierarchical data analysis
CN103365924A (en) Method, device and terminal for searching information
CN103425687A (en) Retrieval method and system based on queries
US20130013616A1 (en) Systems and Methods for Natural Language Searching of Structured Data
Nguyen et al. A lattice-based approach for mathematical search using formal concept analysis
CN105468605A (en) Entity information map generation method and device
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
CN102023989A (en) Information retrieval method and system thereof
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN103744841A (en) Information fragment translating method and system
CN103294820B (en) WEB page classifying method and system based on semantic extension
CN103116635A (en) Field-oriented method and system for collecting invisible web resources
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
Tuan et al. Cate: context-aware timeline for entity illustration
CN114241501A (en) Image document processing method and device and electronic equipment
CN104778232B (en) Searching result optimizing method and device based on long query
CN103744884A (en) Method and system for collating information fragments
CN103729422A (en) Information fragment associative output method and system
CN117056477A (en) Case data retrieval method, device, equipment and readable storage medium
Amin et al. An efficient web-based wrapper and annotator for tabular data
Moumtzidou et al. Discovery of environmental nodes in the web
CN107256260A (en) A kind of intelligent semantic recognition methods, searching method, apparatus and system
CN115270777A (en) Contract document information extraction method, device and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 430070 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant after: Language network (Wuhan) Information Technology Co., Ltd.

Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant before: Wuhan Transn Information Technology Co., Ltd.

COR Change of bibliographic data
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20140416