CN101246484A - Electric text similarity processing method and system convenient for query - Google Patents

Electric text similarity processing method and system convenient for query Download PDF

Info

Publication number
CN101246484A
CN101246484A CNA2007101641489A CN200710164148A CN101246484A CN 101246484 A CN101246484 A CN 101246484A CN A2007101641489 A CNA2007101641489 A CN A2007101641489A CN 200710164148 A CN200710164148 A CN 200710164148A CN 101246484 A CN101246484 A CN 101246484A
Authority
CN
China
Prior art keywords
content
subclass
texts
text
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2007101641489A
Other languages
Chinese (zh)
Inventor
刘二中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CNA2007101641489A priority Critical patent/CN101246484A/en
Publication of CN101246484A publication Critical patent/CN101246484A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed is an electronic text processing method convenient for query and search, and retrieval or search system of similar contents comparing process device comprising keyword search. The invention makes a comparison among similar contents in the delimitation limit of various text keyword search, determining and classifying similar contents depending on whether which have similarities, so as to perform the executions such as separating subsets, arranging various sequences or forming directories, sorting, displaying interface and the like. The invention can considerably improve the convenience and tightness of information retrieval or network information search.

Description

A kind of similarity processing method of the e-text of being convenient to inquire about and system
(1) technical field
The present invention relates to computing machine and search engine about e-text processing and retrieval or search technique.
(2) background technology
Over 20 years, the Computer Database retrieval technique has had the progress of network technologies such as very big development, particularly internet, makes the scale of the database that people can share reach astronomical figure.The user finds information needed or file for convenience, classification or catalogue retrieval system have occurred.This technology is more suitable in the maturation classification field that people are very familiar to, but in magnanimity information field widely, is difficult to set up also be difficult to grasp and use.
The retrieval technique and the search engine technique that with the keyword search are core are that the user has brought facility.This system can obtain inquiry's keyword query request by the interactive interface on the client computer and communication network or communication line, in text index storehouse or text library, inquire about, and carry out the correlation analysis of keyword request and text, obtain correlated results and ordering, be provided to interactive interface via communication network or circuit again.This search system uses very convenient rapid, but questions record or index sum that the return result comprises are still very huge, are difficult to consult one by one.
For the potential Query Result to inquiry's most worthy can be come the front to make things convenient for the inquiry as far as possible, the 6th, 285, No. 999 United States Patent (USP)s have proposed to carry out based on webpage hyperlink structure analysis (Page link) technology of Search Results ordering, other ordering techniques have been surpassed, adopted by Google company, obtain unprecedented success.
Yet this technology and other various ordering techniques only are the efficient that has improved keyword search on statistical significance, can not guarantee that Query Result that everyone wishes can both come the front of huge concordance list.We still can not guarantee and can none find the content of expectation on forward position with omitting, accomplish not only tightly but also more convenient.Simultaneously, we but helplessly read the irrelevant information that all main contents repeat again and again before reading the information of expectation.
In order to address this problem, people attempt to develop various new search engine techniques always over past ten years.One of them important aspect is to attempt to measure and utilize similarity between different files of the magnanimity that has responded same keyword query item or the webpage, it is divided into different classes of so that retrieval and consulting.But there is very big defective in this class technology.
The firstth, calculated amount is excessive, when the more and amount of text of the every piece of content of text that particularly needs comparison is huge, and the computing time that needs are more.Some that propose are improved technology targetedly, the technology of No. 6990628 United States Patent (USP) as Yahoo relevant " measuring the e-text similarity ", the technology of the Chinese patent CN1112647 C of IBM Corporation " the response inquiry is to carry out the system and method for classification to the document in the collection of document ", the technology of the Chinese patent CN1220159C of Fudan University " the quick similar to search method of a kind of higher-dimension vector data ", the technology of the Chinese patent CN1269064 C of Hewlett-Packard about " document and information retrieval method and equipment ", the Chinese patent CN1209726C of company of Baidu about " recognition methods of mirror image and accurate mirror image website on a kind of internet " only homepage is carried out ratio of similitude technology, first above-mentioned defective has been made very limited improvement.
Second kind of defective is to utilize the result of similarity processing often very limited to inquiry's help, it is significantly concomitant because although similar each other file exists, but also there is certain difference, and inquiry's information of interest is very possible just in the difference part, and the difference of crucial part tends to obviously influence the classification of text.Comprise whether a certain difference or something in common that the prior art of No. 6990628 United States Patent (USP) can not discern between two texts be critical, thereby the Search Results that this class technology provides is both tightly also convenient inadequately inadequately.
Therefore, people press for and a kind ofly not only tightly but also the technology of keyword search automotive engine system efficiently can accelerate the concentrated result's that the inquiry obtains expecting speed greatly, and guarantee the tightness of search.This also becomes unsolved for many years global problem.
(3) summary of the invention
The e-text that an object of the present invention is to provide a kind of computing machine or search engine is handled and method or the system retrieving or search for, can will contain the different texts or the information of same keyword query item in a large number, more have reason the degree of similarity of the text core content valued according to the inquiry and classify or handle.Another object of the present invention provides the e-text of a kind of computing machine or search engine and handles and method or the system retrieving or search for, the different texts or the information that contain same keyword in a large number can be carried out refining treatment, less overlapping with the different texts that its core content is similar, and the mode of the less omission of the dissimilar text of core content, relevant information is listed convenient inquiry.Another object of the present invention provides the e-text of a kind of computing machine or search engine and handles and more effective, simpler and more direct, more economical method or the system retrieving or search for, be inquiry's service, make it when carrying out keyword retrieval, can dwindle the hunting zone rapidly, significantly reduce or reject all kinds of irrelevant informations or duplicate message, obtain desired result exactly.
One aspect of the present invention has provided the method that a plurality of e-texts are handled of a kind of computing machine utilization, comprising:
[i] obtains a plurality of e-texts that contain same keyword query item;
[ii] determines the same scope of getting of drawing of the contiguous content of keyword query item described in each content of text, and the contiguous content of described keyword query item is the contiguous with it content of drawing in the scope of getting outside the keyword query item described in the content of text;
Whether the contiguous content of the described keyword query item of the different texts of [iii] regulation belongs to similar criterion, this standard comprises at least or comprises indirectly the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts that wherein same section can be meant mutually the same words or root or character or phrase;
Whether [iv] belongs to similar criterion according to the contiguous content of [iii] described described keyword query item to different texts, whether the contiguous content of described keyword query item of determining these texts belongs to similar each other, and whether belong to similar each other according to the contiguous content of the described keyword query item of these texts these texts are classified, and these texts are handled with the identical or different of its classification;
Described e-text or text refer to can be file, text or webpage or summary or questions record or title or index or chapters and sections or the paragraph in the devices such as the database of computing machine or database or information storage device or internet or server or search engine or data processor or comprise literal or the information of character content.
Wherein, described keyword query item generally is meant the content that should contain that can be proposed by the inquiry in the result for retrieval text.And the contiguous content of keyword query item generally belongs to and appears at certain contiguous regulation of keyword query item in the content of text of keyword search results without the inquiry and stroke get content in the scope.Its particular content, than in the text away from the content of keyword query item, should more can influence the concrete purpose of this keyword query item in the text, more help related text is made more appropriate classification and processing.
In [iii] of method of the present invention whether the contiguous content of described keyword query item is belonged to similar criterion, can also according to or one or more with reference in following assessment factor or the principle:
Whether identical from the contiguous content of the described keyword query item of different texts;
From mutually the same part in the contiguous content of the described keyword query item of different texts respectively in original text with respect to the front and back position of former keyword query item or the difference size of distance;
From the difference size of the order in original text respectively of each mutually the same part in the contiguous content of the described keyword query item of different texts;
From mutually the same part in the contiguous content of the described keyword query item of different texts respectively in original text with the size of the distance of former keyword query item;
The size of the numerical value that the computing method of utilizing vector space model provide for the similarity degree each other from the contiguous content of described keyword query item of different texts;
Perhaps, provide one or more objective functions (objective function) and be close to the corresponding similarity degree of content or draw the judgement that whether belongs to similar with the described keyword query item that draws from different texts to one or more or other factor weightings in the above assessment factor.
Described disposal route of the present invention can also comprise:
Make the partial content of corresponding text or text have identical or different distributing position or storage mode, perhaps be divided into identical or different subclass, perhaps obtain identical or different subclass mark, perhaps make its index have identical or different mark or index entry at database, perhaps has identical or different arranged mode, perhaps have identical or different display mode or position at interactive interface, perhaps allow part subclass at least respectively have the similar keyword query item of text in one or more questions records or summary or text or the subclass contiguous in perhaps wherein same section stride subclass combination or ordering or show at interactive interface.
Disposal route of the present invention, can comprise and divide similar subclass: a plurality of texts or textual portions content can be divided into a plurality of similar subclass, the contiguous content of the described keyword query item of each text in the same similar subclass or textual portions content belongs to similar.
The text of same similar subclass, more likely corresponding inquiry's interest close to certain direction more helps retrieval.
Disposal route of the present invention also can comprise and divide identical core subclass; A plurality of texts or textual portions content can be divided into a plurality of identical core subclass, require the contiguous content of described keyword query item of each text in the same identical core subclass or textual portions content all identical.
When needing, disposal route of the present invention also can comprise to be segmented similar subclass.
In case of necessity, disposal route of the present invention can comprise that similar subclass divides again: can divide similar subclass or divide on the basis of identical core subclass, the fresh content that the contiguous content of original keyword query item in a plurality of texts in an existing similar subclass or the identical core subclass or the textual portions content is drawn the certain nearby sphere outside the scope of getting carries out similarity relatively again, whether similar according to it, these texts or textual portions content are divided into the similar subclass of a plurality of next stage.
Disposal route of the present invention also can comprise arranges dissimilar sequence: can arrange out dissimilar sequence from a plurality of texts, the contiguous content of the described keyword query item of different texts in the same dissimilar sequence or textual portions content can be all or is not belonged to similar basically; In the whole or most texts or textual portions content in the perhaps same dissimilar sequence, the contiguous content of the keyword query item of neither one text or textual portions content belongs to similar or identical with one or other texts more than the defined amount or the contiguous content of keyword query item of textual portions content.
In case of necessity, method of the present invention can comprise: arrange core content sequence inequality: can arrange out core content sequence inequality from a plurality of texts, the contiguous content of the described keyword query item of different texts in the same core content sequence inequality or textual portions content can be all or is all belonged to incomplete same basically; In the full text or most text or textual portions content in the perhaps same core content sequence inequality, the contiguous content of the keyword query item of neither one text or textual portions content belongs to identical with one or other texts more than the defined amount or the contiguous content of keyword query item of textual portions content.
Disposal route of the present invention also can comprise: catalogue marshalling or arrange the sequence of the similar content of different subclass: can with the text separately of the similar subclass of each division in the contiguous content of described keyword query item total similar or identical content or partial content as clauses and subclauses, assemble catalogue or sequence, perhaps assemble tree-shaped catalogue as clauses and subclauses together with the text separately of the next stage subclass of each similar subclass total similar or identical interior perhaps partial content in the contiguous content of described keyword query item.
When needing, disposal route of the present invention can comprise: arrange representative series: can respectively take out one or more texts from each similar subclass or identical core subclass, with these texts or textual portions content composition sequence.
This disposal route also can comprise sequence recompression: can be in the dissimilar sequence of existing arrangement.Perhaps in the representative series, perhaps the described keyword query item of a plurality of texts of the sequence of catalogue marshalling or the similar content of different subclass or textual portions content is close to content, with the comparatively loose similar criterion that whether belongs to, carry out comparatively loose similarity relatively, in existing sequence, produce the new similar subclass of a plurality of texts or textual portions content or the representative series or the catalogue marshalling of dissimilar sequence or more refining.
In case of necessity, disposal route of the present invention can comprise that also identical core divides polymerization again: at first arrange core content sequence inequality, and then to the contiguous content of the described keyword query item of a plurality of texts in the sequence that obtains or textual portions content, carry out similarity relatively with whether belonging to similar criterion, in existing sequence, produce the new similar subclass of a plurality of texts or textual portions content or the representative series or the catalogue marshalling of dissimilar sequence or more refining.
When needing, disposal route of the present invention also can comprise: showing interface and operation.
Disposal route of the present invention can also comprise: the mark number.
When needing, disposal route of the present invention can comprise: determine ordering: for the arrangement of the some elements in a plurality of elements that above-mentioned catalogue or sequence or subclass contained or DISPLAY ORDER or position can be at random, also can partially or completely depend on following wherein some or a plurality of because of rope;
The mean values of the Page link value of associated subset or related text or speech section or interior perhaps information or place text, clicking rate, keyword occurrence rate, subordinate's number of subsets or subordinate's text number, subclass clicking rate, text Page link value or mxm., in existing website or system Search Results ordering, bid, spell mode, stroke, source scoring, time of receipt and other or the like factor;
Perhaps decide by corresponding target function value.
Another aspect of the present invention is a kind of data retrieval system, comprising:
Data processing unit 23 and the input block 21, output unit 22 and the text database 26 that are attached thereto, its data processing unit can receive keyword query by input block 21, from text database or 27 collect and handle relevant data from the internet in case of necessity, send output unit with result for retrieval;
Its characteristics are: this data processing unit 23 comprises storer 24 and the contiguous content processing apparatus 25 of keyword;
The contiguous content processing apparatus of described keyword can
[i] obtains a plurality of e-texts that contain same keyword query item;
[ii] determines the same scope of getting of drawing of the contiguous content of keyword query item described in each content of text, and the contiguous content of described keyword query item is the contiguous with it content of drawing in the scope of getting outside the keyword query item described in the content of text;
Whether the contiguous content of the described keyword query item of the different texts of [iii] regulation belongs to similar criterion, this standard comprises at least or comprises indirectly the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts that wherein same section can be meant mutually the same words or root or character or phrase;
Whether [iv] belongs to similar criterion according to the contiguous content of [iii] described described keyword query item to different texts, whether the contiguous content of described keyword query item of determining these texts belongs to similar each other, and whether belong to similar each other according to the contiguous content of the described keyword query item of these texts these texts are classified, and these texts are handled with the identical or different of its classification;
Processing mode can comprise following one or more:
Divide similar subclass, divide identical core subclass, the segmentation of similar subclass, similar subclass divide, arrange dissimilar sequence again, arrange core content sequence inequality, the catalogue marshalling or arrange different similar contents sequence, arrange that representative series, sequence recompression, identical core divide that polymerization again, content launch, mark number, determine ordering, showing interface and operation.
Described data retrieval system can be made of computing machine or server or search engine system.
Another aspect of the present invention be a kind of response user via the interactive interface requirement, the search engine system of desired Search Results is provided, comprising:
Server, this server is via the client computer coupling at communication network or circuit and described interactive interface place;
Be positioned at the search engine of server, described search engine comprises: the database that comprises keyword index, and requestor, this requestor can require according to the keyword that the inquiry proposes to inquire about and the related data the results list that inquires is offered interactive interface at described database;
Its characteristics are:
Described requestor or search engine also comprise the contiguous content comparison process of keyword device, can
[i] obtains a plurality of e-texts that contain same keyword query item;
[ii] determines the same scope of getting of drawing of the contiguous content of keyword query item described in each content of text, and the contiguous content of described keyword query item is the contiguous with it content of drawing in the scope of getting outside the keyword query item described in the content of text;
Whether the contiguous content of the described keyword query item of the different texts of [iii] regulation belongs to similar criterion, this standard comprises at least or comprises indirectly the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts that wherein same section can be meant mutually the same words or root or character or phrase;
Whether [iv] belongs to similar criterion according to the contiguous content of [iii] described described keyword query item to different texts, whether the contiguous content of described keyword query item of determining these texts belongs to similar each other, and whether belong to similar each other according to the contiguous content of the described keyword query item of these texts these texts are classified, and these texts are handled with the identical or different of its classification;
Processing mode can comprise following one or more:
Divide similar subclass, divide identical core subclass, the segmentation of similar subclass, similar subclass divide, arrange dissimilar sequence again, arrange core content sequence inequality, the catalogue marshalling or arrange different similar contents sequence, arrange that representative series, sequence recompression, identical core divide that polymerization again, content launch, mark number, determine ordering, showing interface and operation.
Above-described search engine system can be the search system for internet customer service that is positioned at the internet, also can be computerized information library searching system independently.Described server 5 is Computer Storage and treating apparatus, can be single, also can be in groups a plurality of or decentralized configuration.Described client computer 3 can be PC or workstation or other computer installations, when needing, can dispose suitable browser.
Another aspect of the present invention can be to store the computer-readable medium (computer-readeble medium) of the instruction that can be carried out by one or more treating apparatus, described instruction is a kind of to a plurality of classification and disposal routes that contain the e-text of same keyword query item in order to realize, can comprise:
Obtain a plurality of instructions that contain the e-text of same keyword query item;
Determine the same instruction of drawing the scope of getting of the contiguous content of keyword query item described in each content of text, the contiguous content of described keyword query item is the contiguous with it content of drawing in the scope of getting outside the keyword query item described in the content of text;
Whether the contiguous content of the described keyword query item of the different texts of regulation belongs to the instruction of similar criterion, this standard comprises at least or comprises indirectly the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts that wherein same section can be meant mutually the same words or root or character or phrase;
Relevant to whether the contiguous content of the described keyword query item of different texts is belonged to similar criterion, whether the contiguous content of described keyword query item of determining these texts belongs to similar each other, and whether belong to similar each other according to the contiguous content of the described keyword query item of these texts these texts are classified, and with the identical or different instruction that these texts are handled of its classification.
Of the present invention the contiguous contents of different text key word query terms are carried out ratio of similitude and be treated to the search technique of core, the notice of text classification is focused on the contiguous core content of keyword query item, science more, accurately, in classification, the catalogue prompting, constantly dwindle same keyword search results scope aspect, have certain tightness and obviously surmount the convenient of prior art with efficient, to satisfy numerous surfers or information search user active demand for a long time greatly, even can help people that documents and materials are carried out more full and accurate content analysis and retrieval.
(4) description of drawings
Figure 1 shows that structured flowchart according to an embodiment of search engine system of the present invention.
Figure 2 shows that the synoptic diagram of a kind of data retrieval system of the present invention.
Figure 3 shows that keyword described in definite content of text of the present invention is close to drawing of content and gets the synoptic diagram of scope mode.
Figure 4 shows that the processing operating process block diagram of one embodiment of the present of invention.
Figure 5 shows that the schematic flow sheet of " identical core is divided polymerization again " processing mode that one embodiment of the present of invention are showed.
Figure 6 shows that the processing operating process block diagram of a data retrieval system embodiment of the present invention.
Fig. 7 is the tree-shaped catalogue synoptic diagram of the similar subclass of two-stage that contains a plurality of texts of same keyword query item.
(5) embodiment
Below, exemplarily the method that a plurality of e-texts are handled to a kind of computing machine utilization provided by the invention is specifically described.
If use method of the present invention, at first need
[i] obtains a plurality of e-texts that contain same keyword query item.
Described e-text or text refer to can be file, text or webpage or summary or questions record or title or index or chapters and sections or the paragraph in the devices such as the database of computing machine or database or information storage device or internet or server or search engine or data processor or comprise literal or the information of character content.
Further [ii] determines that the same scope of getting of drawing of the contiguous content of keyword query item described in each content of text, the contiguous content of described keyword query item are the contiguous with it content of drawing in the scope of getting outside the keyword query item described in the content of text again.Specifically can be by the scope of getting of drawing of the contiguous content of keyword query item (being called for short " the contiguous content of keyword ") in computing machine or artificial acquiescence, predetermined, selected, regulation or conversion and each content of text of adjustment.This stroke got the general obvious part that is less than most original described text one page content lengths of scope.If it is excessive to draw the scope of getting, almost be helpless to and the be closely related classification of content of keyword query item away from the part of keyword, also can greatly increase amount of calculation.The contiguous content of keyword query item draw the scope of getting for example can by this keyword query item (abbreviation keyword) in the regulation text before or this keyword query item after or the speech of the vicinities of these keyword query item front and back or the unified quantity or the length of word or symbol or notional word or root or phrase determine, in general, suggestion is adopted 100 letters or 30 Chinese characters or 20 concrete length of a certain unification below the speech in a sort operation process, be preferably a certain concrete length (for example 5 speech) of 1 to 10 speech or 1 to 60 letter, help improving data processing speed like this, and control the quantity of similar subclass.
Stroke scope of getting of the contiguous content of described keyword query item of the present invention can comprise the content of keyword query item back, can also comprise the content of keyword query item front when needed.It is considered herein that in different language environments, the words of keyword (being the keyword query item) front may be also very important to the influence of text core content classification.
For example can stipulate that stroke scope of getting is equally " each 1 speech of keyword front and back " or " preceding 4 speech of keyword " or " 10 words behind the keyword " or " preceding 2 speech of keyword+back 3 speech " or " 4 phrases behind the keyword " or " preceding 20 letters of keyword+back 30 alphabetical interior complete words " etc.
The accompanying drawing of this instructions (Fig. 3) has provided the example of 5 kinds of modes of drawing the scope of getting of the contiguous content of regulation keyword, and the keyword query item all is " Bu Lin ".Wherein: 31 the scope of getting of drawing is: " preceding 3 words of keyword ", 32 the scope of getting of drawing is: " 4 words behind the keyword ", 33 the scope of getting of drawing is: " 5 words behind the preceding 2+ of keyword ", 34 the scope of getting of drawing is: " 6 words behind the preceding 4+ of keyword ", 35 the scope of getting of drawing is: ignore function word and auxiliary word " 1 speech behind the preceding 1+ of keyword ".
The scope of getting of drawing of the contiguous content of described keyword query item also can be by judging and choose the phrase or the sentence mode at this keyword query item place in the text, or other modes as cursor click place with as described in the distance of keyword query item determine, perhaps change to determine according near the punctuate the keyword or symbol or space or font or its.Under special circumstances, the size of the content of very short and small text also may be less than the scope of getting of drawing at the predetermined contiguous content of keyword of general text, and can compare the contiguous content of the keyword of whole short and small text and other text this moment.In same processing procedure, the mode of different text being drawn the contiguous context of keyword described in the different texts of getting should be identical.
Next step also needs the contiguous content of described keyword query item of the different texts of [iii] regulation whether to belong to similar criterion.This standard comprises at least or comprises indirectly the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts that wherein same section can be meant mutually the same words or root or character or phrase.
For example belong to similar criterion and can require to be not less than 80%, or be decided to be 100% from the ratio of the total speech number of shared this vicinity content of speech mutually the same in the contiguous content of the described keyword query item of different texts.
Described this standard " comprising indirectly ... " be meant: when needing, this standard may not have direct regulation to the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts, but the actual effect of this standard is equal to and has also comprised this requirement; In other words, if what or proportion from mutually the same part in the contiguous content of the described keyword query item of different texts do not reach to a certain degree, the requirement of the other types of this standard or index (for example utilizing the numerical value of vector space model method calculating from the degree of being relative to each other of the contiguous content of described keyword query item of different texts) also can not satisfy or reach.
In case of necessity, the mutually the same part of this standard indication can be ignored the prefix of some speech each other or suffix or having or not or difference of some function word or measure word or number or non-notional word or punctuate or space each other.
Next step of present technique needs [iv] whether to belong to similar criterion according to the contiguous content of [iii] described described keyword query item to different texts, whether the contiguous content of described keyword query item of determining these texts belongs to similar each other, and whether belong to similar each other according to the contiguous content of the described keyword query item of these texts these texts are classified, and these texts are handled with the identical or different of its classification.
For example, the same stroke scope of getting of having determined the contiguous content of described keyword query item of different texts is 5 speech in keyword query item back, can specify by procedure stipulation or acquiescence or by the inquiry: if from continuous 5 the contiguous speech in the described keyword query item back of different texts, the mutually the same speech of different texts is at least 4 or proportion and is not less than 80%, the contiguous content of the described keyword query item of these texts then belongs to similar each other, and these texts then belong to the similar classification of the contiguous content of the same item of keyword query each other; Otherwise related text does not then belong to this similar classification of contiguous content of keyword query item each other.For instance, if the keyword query item is " development area ", comprise so " ... the development area industry of developing science and technology ... " text with comprise " ... development area scientific development and technical industry ... " text and comprise " ... development area industry development and science and technology are ... " text belong to same similar classification; And comprise " ... the development area is just in the developing high-tech industry ... " text with comprise " ... the development area new and high technology promotes industry development ... " text belong to another similar classification.
In general, by above principle, can will contain a large amount of different texts of same keyword query item, be divided into many different classifications, the contiguous content of the keyword query item of the different texts among certain classification respectively has its specific same section up to specification.This further handles us or retrieves very favourable.
Under the situation of needs, when whether the contiguous content of the described keyword query item of the different texts of regulation belongs to similar criterion, also can with reference to can also be simultaneously according to or one or more with reference in other assessment factors or the principle.
For example, require investigation whether identical from the contiguous content of the described keyword query item of different texts.If the similarity degree is the highest, belong to similar or identical.At this moment, be actually in the contiguous speech or the whole similarities and differences that compare the keyword query item in abutting connection with the speech section.Make the classification of text more strict like this.The above-mentioned keyword query item that comprises is the several texts that belong to same similar classification in " development area ", and the contiguous content of their keyword query item just can not be identical at last.
Perhaps, also require to investigate from mutually the same part in the contiguous content of the described keyword query item of different texts respectively in original text with respect to the front and back position of former keyword query item or the difference size of distance, this difference is more little, and relevant contiguous content is similar more each other.
For example, can stipulate: the alternate position spike of same speech distributing position in different texts on average can not surpass the width of 3 words.Like this, we can judge and comprise " ... the evolution of development area new high-tech industry ... " text with comprise " ... the development area promotes Hi-tech Industry Development ... " text belong to a classification, and think and comprise " ... new and high technology in the development area industry development process ... " text do not belong to this classification, because same speech distributing position difference in different texts is excessive, alternate position spike has on average surpassed the width of 3 words.
We or it is also conceivable that from the difference size of the order in original text respectively of each mutually the same part in the contiguous content of the described keyword query item of different texts.This difference is more little, and relevant contiguous content is similar more each other.
For example, can stipulate; Put in order between the same words and identical will surpass 1/2nd at least.Can judge like this and comprise " ... the development area industry of developing science and technology ... " text with comprise " ... the state of development of development area science and technology industry ... " text belong to same similar classification because between them most same words put in order identical; And comprise " ... development of development area technical industry and scientific management ... " text because of bigger with the former word order difference, it is inequality to surpass putting in order of 1/2nd same words, thereby does not belong to this classification.
Also can consider simultaneously from mutually the same part in the contiguous content of the described keyword query item of different texts respectively in original text with the size of the distance (can weigh) of former keyword query item with the number of words of being separated by.This distance is more little, and similarity degree is high more.For example ratio of similitude than the time, can stipulate mutually the same part respectively in original text with the distance (can weigh) of former keyword query item with the number of words of being separated by, be no more than the contiguous content of keyword query item on an average and delimit half or other ratios of length (number of words), just belong to similar.
This is another operable outstanding feature of content of the present invention, according to this method, can judge: those are very little to the contribution of the similarity of different text core contents at the corresponding keyword query item of former text middle distance far identical element or same section, even can ignore.This is consistent with the feature of drawing the scope of getting that the present invention limits the contiguous content of keyword.
When needing, the computing method that also can utilize vector all the fashion (or vector) spatial model influence criterion for the size of the numerical value that the degree of being relative to each other (similarity degree) from the contiguous content of described keyword query item of different texts provides.Adopting said method need be seen the contiguous content of each associative key query term as the synthetic resultant vector of resolute (vector) by various speech or word correspondence, calculate the degree of correlation between the corresponding resultant vector of the contiguous content of keyword query item of different texts again, reach the numerical value of regulation, it is similar that related content belongs to.Obviously, the contiguous content of different keyword query items must contain the identical speech of some, and corresponding resultant vector just may have certain degree of correlation.What therefore, to the requirement of the degree of correlation between the corresponding resultant vector of the contiguous content of different keyword query item, in fact just comprised requirement from or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts.The particular content of the similarity calculation method of vector (or vector) spatial model obtains describing in No. 6990628 United States Patent (USP) and Chinese patent application 200610072588.7 and other many documents, is existing known technology.
Also can provide one or more objective functions (obiective function) to one or more or other factor weightings in the above assessment factor:
For example a kind of target function value can be expressed as F (x 1, x 2X n),
In simple example comparatively, can make
F(x 1,x 2…x n)=F 1(x 1)+F 2(x 2)+……+F n(x n);
Wherein, x 1, x 2... x nCan be simultaneously when whether the contiguous content of described keyword query item that is respectively the different texts of regulation belongs to similar criterion according to or the various factors of reference.
Can stipulate the due respective range of this functional value, to draw from the contiguous content of the described keyword query item of different texts whether belong to similar judgement.
For a text that contains a plurality of same keyword query items, can select or select the contiguous content of keyword query item that wherein the similarity degree is higher at random and handle; Also can be divided into several portions handles respectively.
Whether the contiguous content of the described keyword query item of different texts is belonged to similar when judging, obviously, for same certification this, the requirement of described similar judgement is high more, all similar each other text number may be few more, otherwise similar text number may be many more.
If described keyword query item is to constitute with top by 2 that can not be connected or 2, can be only the contiguous content of a part in the text be carried out similarity relatively or assessment or judge, also can carry out similarity respectively relatively or assessment to the contiguous content of a plurality of parts in the text, again will be respectively relatively or the result of assessment integrate and assess or judge.
According to after whether the contiguous content of the described keyword of a plurality of texts being belonged to similar judgement these texts being classified, can also more handle.
Can make the partial content of corresponding text or text in computing machine or computer-readable medium or storer or database, have identical or different distributing position or storage mode, perhaps be divided into identical or different subclass, perhaps obtain identical or different subclass mark, perhaps make its index have identical or different mark or index entry at database, perhaps has identical or different arranged mode, perhaps have identical or different display mode or position at interactive interface, perhaps allow part subclass at least respectively have the similar keyword query item of text in one or more questions records or summary or text or the subclass contiguous in perhaps wherein same section stride subclass combination or ordering or show at interactive interface.
For example can divide similar subclass: specifically, a plurality of texts or textual portions content can be divided into a plurality of similar subclass, the contiguous content of the described keyword query item of each text in the same similar subclass or textual portions content belongs to similar.The described keyword query item of text in the same similar subclass or textual portions content can be close to the similar part or the identical component of content, as the mark or the title of this subclass, perhaps as they mark or index entries at the index at database or interface.For example above-mentioned comprising " ... the development area industry of developing science and technology ... " text with comprise " ... development area scientific development and technical industry ... " text and comprise " ... development area industry development and science and technology are ... " text belong to same similar subclass, the mark of this subclass can be " science; technology; industry, development ".
Textual portions content described here can be incomplete text or the information such as text snippet or questions record or statement that contain the contiguous content of described keyword.
The text of same similar subclass, more likely corresponding inquiry's interest close to certain direction more helps retrieval.
Also can divide identical core subclass: just, a plurality of texts or textual portions content can be divided into a plurality of identical core subclass, require the contiguous content of described keyword query item (obviously part except the nearby sphere of being divided) of each text in the same identical core subclass or textual portions content all identical.
For example, get scope dictates and be 2 speech in " development area " back for same stroke of the contiguous content of described keyword query item, comprise so " ... development area industry development and science and technology are ... " text with comprise " ... the process of development area industry development with ... " text and comprise " ... the planning of development area industry development ... " text and comprise " ... development area industry development speed is satisfactory ... " text belong to same identical core subclass, the mark of this subclass can be " industry development " or " development area industry development ".
The similar subclass that obtains is also further carried out similar subclass segmentation with identical core subclass: promptly dividing similar subclass or dividing on the basis of identical core subclass, use about the contiguous content of keyword query item and whether belong to similar stricter criterion or more decision factor, (for example increase the requirement that puts in order in the decision factor newly for identical speech, perhaps increase requirement or other requirements newly, perhaps change into and do not ignore its difference by the difference of ignoring function word originally for identical speech and keyword query item mean distance) a plurality of texts in existing any similar subclass or the identical core subclass or textual portions content are divided into the higher subclass of a plurality of next stage similarity degrees.
In case of necessity, also can carry out similar subclass divides again: just can divide similar subclass or divide on the basis of identical core subclass, the fresh content that the contiguous content of original keyword query item in a plurality of texts in an existing similar subclass or the identical core subclass or the textual portions content is drawn the certain nearby sphere outside the scope of getting carries out similarity relatively again, whether similar according to it, these texts or textual portions content are divided into the similar subclass of a plurality of next stage.For example, when dividing similar subclass originally, only the content to 4 contiguous speech of different text key word query terms compares, and has obtained certain and has had the similar subclass of 300 texts; The content of the 5th to the 7th speech that these texts keyword query item separately is contiguous may not all belong to similar or identical, thereby, if whether similar or identically compare again, can mark off some different next stage subclass again according to the content of the 5th to the 7th contiguous speech of their keyword query items separately.
Can repeatedly the branch again of identical core subclass or the segmentation of similar subclass be gone on when needing.
Obviously, investigate under the situation about coming to the same thing in other factors, the contiguous content of the described keyword query item of relevant a plurality of texts to draw the scope of getting big more, similarity degree is high more between the text of same similar subclass.
Also can arrange dissimilar sequence when handling related text: can arrange out dissimilar sequence from a plurality of texts, the contiguous content of the described keyword query item of text in the same dissimilar sequence or textual portions content can be all or is not belonged to similar basically; In the whole or most texts or textual portions content in the perhaps same dissimilar sequence, the contiguous content of the keyword query item of neither one text or textual portions content belongs to similar or identical with one or other texts more than the defined amount or the contiguous content of keyword query item of textual portions content.
In case of necessity, also can comprise: arrange core content sequence inequality: can arrange out core content sequence inequality from a plurality of texts, the contiguous content of the described keyword query item of text in the same core content sequence inequality or textual portions content can be all or is all belonged to incomplete same basically; In the full text or most text or textual portions content in the perhaps same core content sequence inequality, the contiguous content of the keyword query item of neither one text or textual portions content belongs to identical with one or other texts more than the defined amount or the contiguous content of keyword query item of textual portions content.
Disposal route of the present invention also can comprise: catalogue marshalling or arrange the sequence of the similar content of different subclass: can with the text separately of the similar subclass of each division in the contiguous content of described keyword query item total similar or identical content or partial content as clauses and subclauses, assemble catalogue or sequence, perhaps assemble tree-shaped catalogue as the next stage clauses and subclauses together with the text separately of the next stage subclass of each similar subclass total similar or identical interior perhaps partial content in the contiguous content of described keyword query item.
For example, in Fig. 7, we represent the keyword query item with K, represent in the text speech of contiguous content with capitalization, and we have provided the synoptic diagram of the tree-shaped catalogue example of the similar subclass of two-stage (the similar subclass segmentation) sign of a plurality of texts that contain K or clauses and subclauses.
Wherein, the length of 1 grade of nearby sphere of keyword query item is 3 speech (keyword query item back 1-3 speech) in the text, and the length of 2 grades of nearby spheres is 1 grade of nearby sphere, 3 speech (keyword query item back 4-6 speech) afterwards.The bracket the inside is 3 the total respectively speech in the contiguous content of described keyword query item of text separately of corresponding similar subclass, as subclass sign or the clauses and subclauses in the catalogue, 1 grade of subclass sign of representative in left side among Fig. 7, the right side 2 grades of subclass signs of representative, small size figure is represented the amount of text that respective subset comprises.
Obviously, similarly catalogue can help the inquiry more promptly to find interested subclass and text.
When needing, disposal route of the present invention can comprise: arrange representative series: can respectively take out one or more texts from each similar subclass or identical core subclass, with these texts or textual portions content composition sequence.
More than the sequence that obtains of several processing when being presented in interactive interface, can help the inquiry in less length, see the general picture of the various different keyword core contents of unduplicated or less repetition, and when interesting, again related content launched.
Disposal route of the present invention also allows sequence recompression: promptly can be in the dissimilar sequence of existing arrangement, perhaps in the representative series, perhaps the described keyword query item of a plurality of texts of the sequence of catalogue marshalling or the similar content of different subclass or textual portions content is close to content, with the comparatively loose similar criterion that whether belongs to, carry out comparatively loose similarity relatively, in existing sequence, produce the new similar subclass of a plurality of texts or textual portions content or the representative series or the catalogue marshalling of dissimilar sequence or more refining.
For example, we belong to similar criterion in the contiguous content of the keyword query item to different texts that produces existing certain dissimilar sequence time institute foundation, require to have in 8 contiguous speech of keyword query item at least 7 identical with the corresponding contiguous content of another text, this sequence contains 560 of mutual dissimilar text snippets, number is too much, is difficult to take an overall view of; If we according to " have in contiguous 8 the speech contents of keyword query item at least 6 with the corresponding keyword query item of another text snippet to be close to the speech of content identical; promptly belong to similar " comparatively loose standard, these 560 text summaries are carried out once the processing of " arranging dissimilar sequence " again, will obtain the new sequence of having only a summary surplus in the of 200 probably that a number greatly reduces.
Although the existing webpage similarity analysis of the efficiency ratio of method of the present invention sorting technique greatly improves, if the same keyword query item webpage that faces is millions of, the calculated amount that ratio of similitude process relates to is still too big.For this reason, the present invention has proposed breakthrough disposal route again, can supply to select for use:
Here it is, and identical core is divided polymerization again: at first arrange core content sequence inequality, and then to the contiguous content of the described keyword query item of a plurality of texts in the sequence that obtains or textual portions content, carry out similarity relatively with whether belonging to similar criterion (looser), in existing sequence, produce the representative series or the catalogue marshalling of new similar subclass a plurality of texts or the textual portions content or dissimilar sequence or more refining than the criterion that core content is whether identical.
For instance, at first obtain the core content sequence inequality of a plurality of text snippets, wherein the part summary is respectively:
…KXYZ…、…KPQR…、…KMNL…、…KMLN…、…KXZY…、…KYXZ…、…KZYX…、…KLMN…、…KRPQ…、…KLNM…、…KRQP…,
Wherein K represents the keyword query item that each text has jointly, and a speech respectively represented in other letters.
If each summary of this sequence is carried out the contiguous content of keyword query item to carry out ratio of similitude (standard is " each speech is mutually the same respectively, and order can be different ", just can obtain new comprising
KXYZ ... ... KXZY ... ... KYXZ ... ... KZYX ... similar subclass,
Comprise KLMN ... ... KLNM ... ... KMNL ... ... KMLN ... similar subclass,
And comprise ... KPQR ... ... KRPQ ... ... KRQP ... similar subclass;
Also can obtain a new dissimilar sequence, above-mentioned original sequence member is only remaining to be comprised respectively ... KXYZ ... ... KLMN ... ... KPQR ... several text snippets;
Perhaps obtain comprising the catalogue of subclass marks (or title) such as " (X, Y, Z) ", " (L, M, N) ", " (P, Q, R) ".
Sequence that this method obtains or catalogue marshalling result be identical basically with adopting equally comparatively loose similar criterion to arrange the result of dissimilar series processing at the very start, yet calculated amount may reduce several magnitude.
The words that need, can carry out showing interface and operation: the showing at interactive interface for information about of appointment that can make the processing procedure that comprises processing mode and result, allow the inquiry to carry out relevant selection or the indication of handling at interactive interface, can utilize cursor to click or keyboard selection or indication, can be as required, make in subclass in corresponding catalogue or sequence or the subclass or clauses and subclauses or project or text or the textual portions summary perhaps or questions record or word correspondence more detailed content, perhaps the catalogue of the subclass of next stage or sequence or more detailed content show at interactive interface.
For example, the inquiry has found interested content in each similar subset name catalogue that is presented on interactive interface or dissimilar sequence, can click corresponding title or clauses and subclauses, the more detailed catalogue of corresponding similar subclass or respective entries place subclass or interior perhaps text are presented or link.
The inquiry selects for convenience, this method can also allow to mark number: can allow in described sequence or catalogue or clauses and subclauses or text or questions record or the summary example or near the contiguous content of keyword query item that they comprised, have the number of subsets arranged side by side of its corresponding number of subsets side by side or subordinate's number of subsets or text number or related term or speech section place subclass or contained subordinate's number of subsets or the prompting of textual data purpose.
The words that need also should have the method for determining ordering, in fact, for the arrangement of the some elements in a plurality of elements that above-mentioned catalogue or sequence or subclass contained or DISPLAY ORDER or position can be at random, also can partially or completely depend on following wherein some or a plurality of factor:
Size or the height of clicking rate or the height of keyword occurrence rate of the Page link value of its contained or place text,
The perhaps size of the mean values of the text Page link value of the height of what or this subclass clicking rate of subordinate's number of subsets of this subclass or subordinate's textual data purpose or this subclass,
The perhaps size of the mean values of the text Page link value of the height of what or place subclass clicking rate of subordinate's number of subsets of this element place subclass or subordinate's textual data purpose or place subclass,
The perhaps size of the Page link value of text that the Page link value of this subclass is the highest or other text example,
The perhaps clicking rate of the clicking rate of this subclass text the highest or that the keyword occurrence rate is the highest or other text example or the height of keyword occurrence rate,
The perhaps ordering of related text in other search websites or searching system Search Results in this element or the associated subset,
Investor's relevant payment of perhaps relevant element or the height of bidding,
The perhaps lexicographic order of the spelling of the speech of coherent element or word or phonetic or order of strokes,
The perhaps source web of text or link website or linked web pages or unit or people's scoring,
The perhaps related text time order and function of including or new and old,
The same subclass that perhaps whether belongs to certain one-level,
Perhaps decide by a kind of target function value, target function value depends on the weighted value of one or more variablees, and the part or all of variable of this objective function is represented above-mentioned listed wherein some or a plurality of factors respectively.For example a target function value can be expressed as
F(y 1,y 2…y n),
For example can make F (y 1, y 2Y n)=F 1(y 1)+F 2(y 2)+... + F n(y n);
Wherein, y 1, y 2... y nBe respectively some or a plurality of factors (variable) or other factors of the concrete sorting position of being mentioned in the preamble summary of the invention part of decision.Because (as the US6285999 patent) has many concrete ordering disposal routes can reference in the prior art, no longer describes in detail herein.
It may be noted that in case of necessity with in the upper type one or more and can make up utilization or utilization repeatedly.
This disposal route is imbody below in the embodiment of search system or searching system partly also.
A embodiment illustrated in fig. 1 is example one an internet search engine system that carries out the computer data system of e-text disposal route of the present invention.It comprises: be located at the search engine 8 on the server 5 that has storer 6 and processor 7, this search engine 8 is connected with the client computer 3 that has interactive interface 2 by the communication network 4 of internet; This search engine 8 has contiguous content comparison process device 10 of database 9, requestor 11 and keyword or module, and is connected with index constructor 13 with data acquisition unit 12; Data acquisition unit 12 for the text library of database 9 from the internet or other information sources collect and increase text, the text analyzing of 13 pairs of text libraries of index constructor obtains text index and offers the keyword index storehouse of database 9;
Client applications browser on the client computer 3 of embodiment A (IntemetExplorer of Microsoft) allows user 1 to retrieve html documents (comprising the Web list) by communication network 4 from server 5.It is mutual with the Web list that retrieves that interactive interface on the client computer 3 (UI) 2 allows users 1 to utilize monitor, keyboard or mouse, and the submission searching request makes one's options and receives Search Results.
The way of search of embodiment A can be referring to FB(flow block) shown in Figure 4:
Work beginning 41, requestor receives user 1 keyword query item request 42,10 pairs of texts that contain this keyword query item that obtain from database 9 of the contiguous content comparison process device of keyword, draw the scope of getting (for example before the keyword behind the 2+ 3 speech) according to the contiguous content of described keyword query item of predetermined user acquiescence same, according to selected or predetermined standard whether the contiguous content of their described keyword query item is belonged to similar and judge.For example, predetermined criterion herein is: have 4 or 5 identical with the contrast text in 5 speech of this scope, be considered as similarly, compare with this and to classify 43.If the inquiry needs, also can in this criterion, increase requirement to the limit of the difference of the order of same words, or increase requirement to the limit of same words and the distance keyword query item or other requirement of in aforesaid disposal route, mentioning or reference factors.
Produce in classification on the basis of subclass (divide similar subclass or divide identical core subclass), the contiguous content comparison process of keyword device 10 will provide and show the catalogue or the representative series 44 of each similar subclass or identical core subclass.When needing, the contiguous content comparison process of this keyword device 10 also can be arranged and show dissimilar sequence, arrange core content sequence inequality.Herein, the sign of each subclass in catalogue for example can be same 4 speech that all have in the contiguous content of each text key word of this subclass.
Read subclass denotation directory or representative series, the user is easy to determine the interest place, can click and launch related content and make related text show 45, perhaps because oversize polymerization again (sequence recompression or identical core the are divided polymerization again) display operation that carries out of catalogue, or, promptly the title catalogue of subordinate's subclass of segmenting or branching away or the representative series of each subordinate's subclass are shown because the amount of text of interesting subclass is segmented (similar subclass segmentation) display operation too greatly or divided (similar subclass is divided again) display operation or subordinate's subclass display operation 46 again.So carry out similar operations, or return preceding step 48 or return 47 to beginning 41.In above-mentioned processing procedure, also can mark relevant entry or text number, determine sequence permutation.
Another search engine Embodiment B has adopted unique efficient similarity comparison process method one foregoing " identical core is divided polymerization again " disposal route.Referring to Fig. 5:
After the contiguous content comparison process of the keyword of search engine Embodiment B device 10 obtains a large amount of texts 51 of same keyword query item, stroke scope of getting of for example determining the contiguous content of keyword query item is keyword " 5 speech behind the preceding 2+ ", (52), in the similarity assessment and judgement 53 o'clock carried out between the text, adopted the requirement of " this content must be identical ", mark off the more identical core subclass 54 of number, thereby " the representative sequence of different core content " 55 that obtains or the length of subclass catalogue are longer.
In fact, this is that the contiguous content of keyword query item (7 speech are long) is neither omitted also unduplicated representative sequence, and it is indifferent to that the long core content of 8 speech that comprises keyword generally can allow the consultant judge to have.This represents the entry number of sequence, can make that reading over keyword search results becomes possibility than original usually millions of entry number decline several magnitude.
If in the face of hundreds of results still feel difficulty, need to select " loose similar criterion " 56, allow the clauses and subclauses of these sequences member or catalogue carry out similar combination again, carry out identical core and divide polymerization again 57, obtain the similar subclass that subclass quantity reduces several times or tens of times, and corresponding less " refining sequence or catalogue marshalling " and storage demonstration 58, select for use for the inquiry.When needing, the inquiry can launch the content of the interior perhaps text of associated subset by clicking cursor.
Figure 2 shows that another Embodiment C, it is a kind of data retrieval system, partly form by data processing unit 23 and input block 21 (forming), output unit 22 (forming) and the text database 26 etc. that are attached thereto by display screen, printer etc. by keyboard, mouse etc., wherein input block 21 and output unit 22 constitute the interactive interfaces that inquiries and this system are linked up jointly, and this data processing unit 23 comprises storer 24 and the contiguous content processing apparatus 25 of keyword.This data processing unit 23 can receive the keyword query that the inquiry proposes by input block 21,27 collect relevant datas from text database 26 or internet, and a large amount of texts that contain same keyword query item that obtain by contiguous 25 pairs of the content processing apparatus of its keyword that comprises carry out aforesaid classification and handle and send output unit 22 with result for retrieval.
Figure 6 shows that the processing operating process block diagram of this data retrieval system Embodiment C.The concrete course of work is as follows:
Searching system work begins 61, the user imports keyword query item request 62, the text that contains this keyword query item that the contiguous content comparison process of keyword device 25 obtains from storer 24 or text database 26, whether the same scope 63 (for example 5 speech behind the keyword query item) of getting of drawing according to the contiguous content of described keyword query item of predetermined user acquiescence belongs to similar according to selected or predetermined standard to the contiguous content of their described keyword and judges (core content relatively).The predetermined criterion of this embodiment is: have 4 or 5 identical with the contrast text in 5 speech of this scope, be considered as similarly, carry out core content comparison 64 with this.If the inquiry needs, also can in this criterion, increase requirement to the limit (order as speech over half is identical) of the difference of the order of same words, or other requirement of in aforesaid disposal route, mentioning or reference factors.
On the basis of match stop, the contiguous content comparison process of keyword device 25 will produce each similar subclass or identical core subclass 65, the perhaps further catalogue or the representative series 66 of arranging dissimilar sequence or arrangement core content sequence inequality or providing and showing them.When needing, the amount of text of subclass in case of interest is too big, the contiguous content comparison process of this keyword device 25 also can carry out similar subclass segmentation or similar subclass is divided 67 operations again, and arranges and show corresponding dissimilar sequence or core content sequence 66 inequality.Among this embodiment, the sign of each subclass in catalogue for example can be same 4 or 5 speech that all have in the contiguous content of each text key word of respective subset.Present embodiment can be that foregoing marks corresponding number or determines ordering 69 by predetermined or selected standard also.
Read subclass denotation directory or representative series, the user is easy to determine the interest place, can carry out showing interface and operation 68, launch related content and make related text show 71, perhaps carry out polymerization again (sequence recompression or identical core divide again polymerase 17 0) operation, and the title catalogue of the subclass that obtains or the representative series of each subclass are shown because catalogue is oversize.
So carry out similar operations, or return and repeatedly carry out preceding step to text display 71 or return 72 and operate beginning 71, to finish or to carry out retrieval and inquisition work once more.In above-mentioned processing procedure, all can mark relevant entry or text number at any time, determine sequence permutation 69.
The technical characterictic that above embodiment provides all is suggestive, does not allow to be used for limiting the scope that the present invention includes.

Claims (18)

1, the method that a plurality of e-texts are handled of a kind of computing machine utilization comprises:
[i] obtains a plurality of e-texts that contain same keyword query item;
[ii] determines the same scope of getting of drawing of the contiguous content of keyword query item described in each content of text, and the contiguous content of described keyword query item is the contiguous with it content of drawing in the scope of getting outside the keyword query item described in the content of text;
Whether the contiguous content of the described keyword query item of the different texts of [iii] regulation belongs to similar criterion, this standard comprises at least or comprises indirectly the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts that wherein same section can be meant mutually the same words or root or character or phrase;
Whether [iv] belongs to similar criterion according to the contiguous content of [iii] described described keyword query item to different texts, whether the contiguous content of described keyword query item of determining these texts belongs to similar each other, and whether belong to similar each other according to the contiguous content of the described keyword query item of these texts these texts are classified, and these texts are handled with the identical or different of its classification;
Described e-text or text refer to can be file, text or webpage or summary or questions record or title or index or chapters and sections or the paragraph in the devices such as the database of computing machine or database or information storage device or internet or server or search engine or data processor or comprise literal or the information of character content.
2, in accordance with the method for claim 1, wherein whether [iii] belongs to similar criterion to the contiguous content of described keyword query item, can also according to or one or more with reference in following assessment factor or the principle:
Whether identical from the contiguous content of the described keyword query item of different texts;
From mutually the same part in the contiguous content of the described keyword query item of different texts respectively in original text with respect to the front and back position of former keyword query item or the difference size of distance;
From the difference size of the order in original text respectively of each mutually the same part in the contiguous content of the described keyword query item of different texts;
From mutually the same part in the contiguous content of the described keyword query item of different texts respectively in original text with the size of the distance of former keyword query item;
The size of the numerical value that the computing method of utilizing vector space model provide for the similarity degree each other from the contiguous content of described keyword query item of different texts;
Perhaps, provide one or more objective functions (obiective function) and be close to the corresponding similarity degree of content or draw the judgement that whether belongs to similar with the described keyword query item that draws from different texts to one or more or other factor weightings in the above assessment factor.
3, in accordance with the method for claim 1, wherein:
Above-mentioned [iv] described processing can comprise:
Make the partial content of corresponding text or text have identical or different distributing position or storage mode, perhaps be divided into identical or different subclass, perhaps obtain identical or different subclass mark, perhaps make its index have identical or different mark or index entry at database, perhaps has identical or different arranged mode, perhaps have identical or different display mode or position at interactive interface, perhaps allow part subclass at least respectively have the similar keyword query item of text in one or more questions records or summary or text or the subclass contiguous in perhaps wherein same section stride subclass combination or ordering or show at interactive interface.
4, in accordance with the method for claim 1, wherein:
Above-mentioned [iv] described processing, can comprise: divide similar subclass: a plurality of texts or textual portions content can be divided into a plurality of similar subclass, the contiguous content of the described keyword query item of each text in the same similar subclass or textual portions content belongs to similar.
5, in accordance with the method for claim 1, wherein:
Above-mentioned [iv] described processing, can comprise: divide identical core subclass: a plurality of texts or textual portions content can be divided into a plurality of identical core subclass, require the contiguous content of described keyword query item of each text in the same identical core subclass or textual portions content all identical.
6, according to claim 1 or 4 or 5 described methods, wherein:
Above-mentioned [iv] described processing, can comprise: similar subclass segmentation: can divide similar subclass or divide on the basis of identical core subclass, use about the contiguous content of keyword query item whether belong to similar stricter criterion or more decision factor, a plurality of texts in existing any similar subclass or the identical core subclass or textual portions content are divided into the higher subclass of a plurality of next stage similarity degrees.
7, according to claim 1 or 4 or 5 described methods, wherein:
Above-mentioned [iv] described processing, can comprise: similar subclass is divided again: can divide similar subclass or divide identical core subclass. the basis on, the fresh content that the contiguous content of original keyword query item in a plurality of texts in an existing similar subclass or the identical core subclass or the textual portions content is drawn the certain nearby sphere outside the scope of getting carries out similarity relatively again, whether similar according to it, these texts or textual portions content are divided into the similar subclass of a plurality of next stage.
8, in accordance with the method for claim 1, wherein:
Above-mentioned [iv] described processing, can comprise: arrange dissimilar sequence: can arrange out dissimilar sequence from a plurality of texts, the contiguous content of the described keyword query item of different texts in the same dissimilar sequence or textual portions content can be all or is not belonged to similar basically; In the whole or most texts or textual portions content in the perhaps same dissimilar sequence, the contiguous content of the keyword query item of neither one text or textual portions content belongs to similar or identical with one or other texts more than the defined amount or the contiguous content of keyword query item of textual portions content.
9, in accordance with the method for claim 1, wherein:
Above-mentioned [iv] described processing, can comprise: arrange core content sequence inequality: can arrange out core content sequence inequality from a plurality of texts, the contiguous content of the described keyword query item of different texts in the same core content sequence inequality or textual portions content can be all or is all belonged to incomplete same basically; In the full text or most text or textual portions content in the perhaps same core content sequence inequality, the contiguous content of the keyword query item of neither one text or textual portions content belongs to identical with one or other texts more than the defined amount or the contiguous content of keyword query item of textual portions content.
10, in accordance with the method for claim 1, wherein:
Above-mentioned [iv] described processing, can comprise: catalogue marshalling or arrange the sequence of the similar content of different subclass: can with the text separately of the similar subclass of each division in the contiguous content of described keyword query item total similar or identical content or partial content as clauses and subclauses, assemble catalogue or sequence, perhaps assemble tree-shaped catalogue as clauses and subclauses together with the text separately of the next stage subclass of each similar subclass total similar or identical interior perhaps partial content in the contiguous content of described keyword query item.
11, in accordance with the method for claim 1, wherein:
Above-mentioned [iv] described processing that these texts are carried out can comprise: arrange representative series: can respectively take out one or more texts from each similar subclass or identical core subclass, with these texts or textual portions content composition sequence.
12, according to claim 1 or 8 or 9 or 10 or 11 described methods, wherein:
Above-mentioned [iv] described processing, can comprise: sequence recompression: can be in the dissimilar sequence of existing arrangement, perhaps in the representative series, perhaps the described keyword query item of a plurality of texts of the sequence of catalogue marshalling or the similar content of different subclass or textual portions content is close to content, with the comparatively loose similar criterion that whether belongs to, carry out comparatively loose similarity relatively, in existing sequence, produce the new similar subclass of a plurality of texts or textual portions content or the representative series or the catalogue marshalling of dissimilar sequence or more refining.
13, according to claim 1 or 9 described methods, wherein:
Above-mentioned [iv] described processing, can comprise: identical core is divided polymerization again: at first arrange core content sequence inequality, and then to the contiguous content of the described keyword query item of a plurality of texts in the sequence that obtains or textual portions content, carry out similarity relatively with whether belonging to similar criterion, in existing sequence, produce the new similar subclass of a plurality of texts or textual portions content or the representative series or the catalogue marshalling of dissimilar sequence or more refining.
14, according to claim 1 or 2 or 3 described methods, wherein:
Above-mentioned [iv] described processing, can comprise: showing interface and operation: the showing at interactive interface for information about of appointment that can make the processing procedure that comprises processing mode and result, allow the inquiry to carry out relevant selection or the indication of handling at interactive interface, can utilize cursor to click or keyboard selection or indication, can be as required, make in subclass in corresponding catalogue or sequence or the subclass or clauses and subclauses or project or text or the textual portions summary perhaps or questions record or word correspondence more detailed content, perhaps the catalogue of the subclass of next stage or sequence or more detailed content show at interactive interface.
15, according to claim 1 or 2 or 3 described methods, wherein:
Above-mentioned [iv] described processing that these texts are carried out, can comprise: the mark number: can allow in described sequence or catalogue or clauses and subclauses or text or questions record or the summary example or near the contiguous content of keyword query item that they comprised, have the number of subsets arranged side by side of its corresponding number of subsets side by side or subordinate's number of subsets or text number or related term or speech section place subclass or contained subordinate's number of subsets or the prompting of textual data purpose.
16, according to claim 1 or 2 or 3 or 8 or 9 or 10 or 11 described methods, wherein:
Above-mentioned [iv] described processing, can comprise: determine ordering: for the arrangement of the some elements in a plurality of elements that above-mentioned catalogue or sequence or subclass contained or DISPLAY ORDER or position can be at random, also can partially or completely depend on following wherein some or a plurality of factor:
Size or the height of clicking rate or the height of keyword occurrence rate of the Page link value of its contained or place text,
The perhaps size of the mean values of the text Page link value of the height of what or this subclass clicking rate of subordinate's number of subsets of this subclass or subordinate's textual data purpose or this subclass,
The perhaps size of the mean values of the text Page link value of the height of what or place subclass clicking rate of subordinate's number of subsets of this element place subclass or subordinate's textual data purpose or place subclass,
The perhaps size of the Page link value of text that the Page link value of this subclass is the highest or other text example,
The perhaps clicking rate of the clicking rate of this subclass text the highest or that the keyword occurrence rate is the highest or other text example or the height of keyword occurrence rate,
The perhaps ordering of related text in other search websites or searching system Search Results in this element or the associated subset,
Investor's relevant payment of perhaps relevant element or the height of bidding,
The perhaps lexicographic order of the spelling of the speech of coherent element or word or phonetic or order of strokes,
The perhaps source web of text or link website or linked web pages or unit or people's scoring,
The perhaps related text time order and function of including or new and old,
The same subclass that perhaps whether belongs to certain one-level,
Perhaps decide by a kind of target function value, target function value depends on the weighted value of one or more variablees, and the part or all of variable of this objective function is represented above-mentioned listed wherein some or a plurality of factors respectively.
17, a kind of data retrieval system comprises:
Data processing unit and the input block, output unit and the text database that are attached thereto, its data processing unit can receive keyword query by input block, from text database or in case of necessity from internet collection and processing relevant data, send output unit with result for retrieval;
Its characteristics are: this data processing unit comprises storer and the contiguous content processing apparatus of keyword;
The contiguous content processing apparatus of described keyword can
[i] obtains a plurality of e-texts that contain same keyword query item;
[ii] determines the same scope of getting of drawing of the contiguous content of keyword query item described in each content of text, and the contiguous content of described keyword query item is the contiguous with it content of drawing in the scope of getting outside the keyword query item described in the content of text;
Whether the contiguous content of the described keyword query item of the different texts of [iii] regulation belongs to similar criterion, this standard comprises at least or comprises indirectly the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts that wherein same section can be meant mutually the same words or root or character or phrase;
Whether [iv] belongs to similar criterion according to the contiguous content of [iii] described described keyword query item to different texts, whether the contiguous content of described keyword query item of determining these texts belongs to similar each other, and whether belong to similar each other according to the contiguous content of the described keyword query item of these texts these texts are classified, and these texts are handled with the identical or different of its classification;
Processing mode can comprise following one or more:
The sequence of divide similar subclass, divide identical core subclass, similar subclass is segmented, dissimilar sequence is divided, arranged to similar subclass again, arrange core content sequence inequality, different similar contents being organized into groups or being arranged to catalogue, arrangement representative series, sequence recompression, identical core are divided polymerization again, showing interface and operation, mark number, are determined to sort.
18, a kind of response user provides the search engine system of desired Search Results via the interactive interface requirement, comprises;
Server, this server is via the client computer coupling at communication network or circuit and described interactive interface place;
Be positioned at the search engine of server, described search engine comprises: the database that comprises keyword index, and requestor, this requestor can require according to the keyword that the inquiry proposes to inquire about and the related data the results list that inquires is offered interactive interface at described database;
Its characteristics are:
Described requestor or search engine also comprise the contiguous content comparison process of keyword device, can
[i] obtains a plurality of e-texts that contain same keyword query item;
[ii] determines the same scope of getting of drawing of the contiguous content of keyword query item described in each content of text, and the contiguous content of described keyword query item is the contiguous with it content of drawing in the scope of getting outside the keyword query item described in the content of text;
Whether the contiguous content of the described keyword query item of the different texts of [iii] regulation belongs to similar criterion, this standard comprises at least or comprises indirectly the requirement from what or proportion of mutually the same part in the contiguous content of the described keyword query item of different texts that wherein same section can be meant mutually the same words or root or character or phrase;
Whether [iv] belongs to similar criterion according to the contiguous content of [iii] described described keyword query item to different texts, whether the contiguous content of described keyword query item of determining these texts belongs to similar each other, and whether belong to similar each other according to the contiguous content of the described keyword query item of these texts these texts are classified, and these texts are handled with the identical or different of its classification;
Processing mode can comprise following one or more:
The sequence of divide similar subclass, divide identical core subclass, similar subclass is segmented, dissimilar sequence is divided, arranged to similar subclass again, arrange core content sequence inequality, different similar contents being organized into groups or being arranged to catalogue, arrangement representative series, sequence recompression, identical core are divided polymerization again, showing interface and operation, mark number, are determined to sort.
CNA2007101641489A 2007-02-15 2007-10-08 Electric text similarity processing method and system convenient for query Pending CN101246484A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2007101641489A CN101246484A (en) 2007-02-15 2007-10-08 Electric text similarity processing method and system convenient for query

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
CN200710079309 2007-02-15
CN200710079309.4 2007-02-15
CNA2007101641489A CN101246484A (en) 2007-02-15 2007-10-08 Electric text similarity processing method and system convenient for query

Publications (1)

Publication Number Publication Date
CN101246484A true CN101246484A (en) 2008-08-20

Family

ID=39946940

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2007101641489A Pending CN101246484A (en) 2007-02-15 2007-10-08 Electric text similarity processing method and system convenient for query

Country Status (1)

Country Link
CN (1) CN101246484A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999508A (en) * 2011-09-13 2013-03-27 腾讯科技(深圳)有限公司 Method and system for sequencing search results
CN103136281A (en) * 2011-12-05 2013-06-05 英顺源(上海)科技有限公司 Web search result display system and method thereof
CN103218371A (en) * 2012-01-20 2013-07-24 华为终端有限公司 Information aggregation method and device
CN103235827A (en) * 2013-05-13 2013-08-07 济南政和科技有限公司 Method for automatically classifying and screening scientific and technological information
CN108021640A (en) * 2017-11-29 2018-05-11 有米科技股份有限公司 Keyword expanding method and device based on associated application
CN109219811A (en) * 2016-05-23 2019-01-15 微软技术许可有限责任公司 Relevant paragraph searching system
CN103902552B (en) * 2012-12-25 2019-03-26 深圳市世纪光速信息技术有限公司 The method for digging and device of stop words, searching method and device, evaluating method and device
CN110019660A (en) * 2017-08-06 2019-07-16 北京国双科技有限公司 A kind of Similar Text detection method and device
CN112131348A (en) * 2020-09-29 2020-12-25 四川财经职业学院 Method for preventing repeated declaration of project based on similarity of text and image
CN113591853A (en) * 2021-08-10 2021-11-02 北京达佳互联信息技术有限公司 Keyword extraction method and device and electronic equipment
CN116433197A (en) * 2023-06-13 2023-07-14 建信金融科技有限责任公司 Information reporting method, device, reporting end and storage medium
CN117573727A (en) * 2024-01-17 2024-02-20 湖南天承信息技术有限公司 Practitioner health physical examination information retrieval system

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999508A (en) * 2011-09-13 2013-03-27 腾讯科技(深圳)有限公司 Method and system for sequencing search results
CN102999508B (en) * 2011-09-13 2016-05-11 腾讯科技(深圳)有限公司 Search result ordering method and system
CN103136281A (en) * 2011-12-05 2013-06-05 英顺源(上海)科技有限公司 Web search result display system and method thereof
CN103218371B (en) * 2012-01-20 2017-04-26 华为终端有限公司 information aggregation method and device
CN103218371A (en) * 2012-01-20 2013-07-24 华为终端有限公司 Information aggregation method and device
WO2013107297A1 (en) * 2012-01-20 2013-07-25 华为终端有限公司 Information aggregation method and device
CN103902552B (en) * 2012-12-25 2019-03-26 深圳市世纪光速信息技术有限公司 The method for digging and device of stop words, searching method and device, evaluating method and device
CN103235827B (en) * 2013-05-13 2016-04-20 政和科技股份有限公司 A kind of method of scientific and technical information automatic classification screening
CN103235827A (en) * 2013-05-13 2013-08-07 济南政和科技有限公司 Method for automatically classifying and screening scientific and technological information
CN109219811A (en) * 2016-05-23 2019-01-15 微软技术许可有限责任公司 Relevant paragraph searching system
CN109219811B (en) * 2016-05-23 2022-03-29 微软技术许可有限责任公司 Related paragraph retrieval system
CN110019660A (en) * 2017-08-06 2019-07-16 北京国双科技有限公司 A kind of Similar Text detection method and device
CN108021640B (en) * 2017-11-29 2019-08-16 有米科技股份有限公司 Keyword expanding method and device based on associated application
CN108021640A (en) * 2017-11-29 2018-05-11 有米科技股份有限公司 Keyword expanding method and device based on associated application
CN112131348B (en) * 2020-09-29 2022-08-09 四川财经职业学院 Method for preventing repeated declaration of project based on similarity of text and image
CN112131348A (en) * 2020-09-29 2020-12-25 四川财经职业学院 Method for preventing repeated declaration of project based on similarity of text and image
CN113591853A (en) * 2021-08-10 2021-11-02 北京达佳互联信息技术有限公司 Keyword extraction method and device and electronic equipment
CN113591853B (en) * 2021-08-10 2024-04-19 北京达佳互联信息技术有限公司 Keyword extraction method and device and electronic equipment
CN116433197A (en) * 2023-06-13 2023-07-14 建信金融科技有限责任公司 Information reporting method, device, reporting end and storage medium
CN116433197B (en) * 2023-06-13 2023-09-12 建信金融科技有限责任公司 Information reporting method, device, reporting end and storage medium
CN117573727A (en) * 2024-01-17 2024-02-20 湖南天承信息技术有限公司 Practitioner health physical examination information retrieval system
CN117573727B (en) * 2024-01-17 2024-03-26 湖南天承信息技术有限公司 Practitioner health physical examination information retrieval system

Similar Documents

Publication Publication Date Title
CN101246484A (en) Electric text similarity processing method and system convenient for query
US9323827B2 (en) Identifying key terms related to similar passages
US7895595B2 (en) Automatic method and system for formulating and transforming representations of context used by information services
CN100501745C (en) Convenient method and system for electronic text-processing and searching
US7403932B2 (en) Text differentiation methods, systems, and computer program products for content analysis
US9384245B2 (en) Method and system for assessing relevant properties of work contexts for use by information services
KR101375940B1 (en) Systems and methods for providing advanced search result page content
CN100462972C (en) Document-based information and uniform resource locator (URL) management method and device
US20180004850A1 (en) Method for inputting and processing feature word of file content
US10909202B2 (en) Information providing text reader
US7024405B2 (en) Method and apparatus for improved internet searching
CN1954321A (en) Query rewriting with entity detection
US20090119283A1 (en) System and Method of Improving and Enhancing Electronic File Searching
Weber et al. Investigating textual case-based XAI
CN107870915A (en) Instruction to search result
Sivakumar Effectual web content mining using noise removal from web pages
US20140358969A1 (en) Method for searching in a database
CN103136356A (en) Processing method for search engine end-user to input prompt messages of reference documents
Ahamed et al. Deduce user search progression with feedback session
Cunningham et al. Knowledge management and human language: crossing the chasm
KR101120040B1 (en) Apparatus for recommending related query and method thereof
KR101124213B1 (en) system of customized news-later service using ontology
More et al. Graph-Based Multi-document Text Summarization Using NLP
KR102594717B1 (en) Priority-centered selection document adoption system based on multiple search keywords and drive method of the Same
Zhao et al. Improving academic homepage identification from the web using neural networks

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20080820