WO2008098467A1 - Convenient method and system of electric text processing and retrieve - Google Patents

Convenient method and system of electric text processing and retrieve Download PDF

Info

Publication number
WO2008098467A1
WO2008098467A1 PCT/CN2008/000190 CN2008000190W WO2008098467A1 WO 2008098467 A1 WO2008098467 A1 WO 2008098467A1 CN 2008000190 W CN2008000190 W CN 2008000190W WO 2008098467 A1 WO2008098467 A1 WO 2008098467A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
keyword
different
segment
same
Prior art date
Application number
PCT/CN2008/000190
Other languages
French (fr)
Chinese (zh)
Inventor
Erzhong Liu
Original Assignee
Erzhong Liu
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN 200710087104 external-priority patent/CN101063975A/en
Application filed by Erzhong Liu filed Critical Erzhong Liu
Publication of WO2008098467A1 publication Critical patent/WO2008098467A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • the present invention relates to techniques for computer and search engine processing and retrieval of electronic text.
  • a search engine-centric search system is generally located on one or more servers or other computer devices, and is composed of a text (page) library, a text index library, an index constructor that obtains a text index based on text analysis of the text library, and a search. It is composed of parts such as a querier that generates search results, and often comes with a data collection server that collects and adds text to the text library from the Internet or other information sources.
  • the system can obtain the query query keyword query request through the interactive interface on the client and the communication network or communication line, query in the text index library or the text library, and perform correlation analysis between the keyword request and the text. The relevant results are obtained and sorted, and then provided to the interactive interface via a communication network or line.
  • This kind of search system is very convenient and quick to use, but the total number of indexes included in the return result is still very large and difficult to check one by one.
  • this technique only improves the efficiency of keyword search in a statistical sense, and does not guarantee that everyone's desired query results will be ranked in front of a large index table. For example, we use the "Google" Chinese website to search for the word “Bulin” and get nearly 300,000 indexes. We still can't guarantee that we can see the desired content in the front position without any omission, which is both strict and convenient. At the same time, before we read the expected information, we were reluctant to read irrelevant information that repeated the main content.
  • U.S. Patent No. 7089236 can perform semantic analysis on keywords proposed by a queryer, and present different possible semantics on the interactive interface to help the inquirer narrow down the search range.
  • the technology of Chinese Patent Application No. 200510081867.5 which is similar to the above, uses the webpage category information to disperse the keyword search results of the search engine.
  • the problem with these two technologies is that, first of all, it is necessary to build a very complex and large but impossible accurate classification database. It is very difficult for the machine to judge which one or which semantics or categories of a certain keyword or text belong to a certain keyword. Its reliability is not high. It is likely that there is a gap between different semantics or categories of a keyword that is likely to overlap. If you increase the level of classification, the overlap will result in a surge in storage space. At the same time, the queriers of keyword search face unfamiliar fields, and it is difficult to accurately grasp many semantics or classifications. These have seriously affected the improvement of query efficiency.
  • the object of the present invention is to provide a technique for electronic text processing and retrieval or search of a computer or a search engine, which can quickly and rigorously narrow the search range when the user performs keyword search and faces a large number of search results, or Eliminate all kinds of irrelevant information or duplicate information, and accurately get the desired result with little omission.
  • One aspect of the present invention provides a computer-implemented method of processing a plurality of electronic texts containing the same keyword, including:
  • the corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or ⁇ at least some of the subsets have one or more adjacent segments or texts to be combined or sorted or sorted across subsets Interface display
  • the text may be an electronic file or a web page or their abstract or index or title or title.
  • the above-mentioned adjacent segment or indirectly adjacent segment may be preceded by a keyword or may be followed by a keyword; generally a segment composed of one or more words or words or even roots in the text content, including a certain part when needed Some characters, such as abbreviations, punctuation, etc.; in some cases, to judge the same or different words, you can also omit the prefix or suffix of some words or some virtual words or non-real words or punctuation or spaces difference.
  • the adjectives or quantifiers or quantifiers or adjectives or adverbs may or may not be considered or differentiated, even with or without regard to the presence or absence of articles or conjunctions.
  • the adjacency segment may refer to one of the words (such as the preceding word) or the adjacent segment of the plurality of words.
  • the number of words or characters contained in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined or the querier agrees or defaults or selects.
  • judging the length of a segment is similar to judging whether the two segments are the same or different. You can also omit or ignore the prefix or suffix of certain words or some virtual words or auxiliary words or numerals or quantifiers. Or non-real words or punctuation or spaces or even the presence or difference of adjectives or adverbs.
  • the benefits of the method of the invention for retrieval are significant.
  • the inquirer is interested in a certain adjacent segment of the keyword, it is easy to obtain all the texts of the category containing the adjacent segment, and instead, it is easy for him to skip the text.
  • the key point of the present invention is that the contiguous content of the keyword is most likely to determine the specific meaning or direction or direction or direction of the keyword in the text, which should be of most interest to the searcher.
  • the method can completely avoid the phenomenon of "content overlap and blank of different categories or subsets" which is difficult to avoid by using the classification retrieval method. This phenomenon will eventually result in a multi-level classification subset system. The consequences of being difficult to use. This determines that the search effect of the method or system of the present invention will be significantly improved.
  • the processing method may further include: for the different texts belonging to one or some of the same first-level subset or higher subsets or the content thereof containing the same keyword and the adjacent segment, according to the same
  • the keywords are identical or different from other adjacent segments of the adjacent segment, and some or all of the text is divided into the same or different lower or multi-level subsets of the above subsets or correspondingly the same or different processes are performed.
  • the described processing method allows for the merging or separation of successive adjacent segments to reduce or increase the subset hierarchy.
  • the processing method may further include:
  • the processing method may further include: a keyword adjacent segment or an indirect adjacent segment in the directory or tree directory or sequence, if there is only one adjacent segment in the next level or the next level, the segment It may be distributed or stored or displayed together with its next level or next level of adjacent words in its original location.
  • the processing method may further include:
  • the plurality of texts contain the same keyword and the same adjacent segment or a group of adjacent segments, the classification being the same according to the adjacent segment of the keyword or the adjacent segment of the adjacent segment in the respective text. Whether or not, or further step by step, according to whether the adjacent segments of the following levels are the same or not.
  • the processing method may further include, in the above-mentioned text or directory or sentence or example sentence or abstract instance or in the vicinity of the keyword or the adjacent word segment or the indirectly adjacent word segment they contain, may have their corresponding juxtaposition subsets.
  • the processing method of the present invention may further include:
  • a sequence of a plurality of text or text partial contents containing the same keyword which contain adjacent words composed of a plurality of words, which are different from each other, or substantially different from each other. In other words, only one or more of the contents of the text or text portion containing the same adjacent segment are represented.
  • the representative information sequence of the same keyword can be reduced by about half of the number to replace the original mass information, which makes reading more convenient.
  • the processing method of the present invention may further include:
  • the method may further include: respectively, respectively, the respective respective directories or sequences or their charts as independent texts, respectively, so that the independent texts respectively have a title or a title containing a corresponding keyword, or respectively have corresponding keywords a title or title that corresponds to the same adjacent segment or adjacent segment group or similar subset name;
  • the title or the title of the separate text of the catalog or sequence or their chart is provided with a prompt character or a combination thereof that prompts its associated feature in or near the title.
  • the processing method may further include: causing a keyword search result sorting device of the related search engine or the search system to have a prompt character for prompting the feature in the title or the title of the independent text of the directory or sequence or
  • the combination has the ability to recognize, if desired, to place independent text or portions of the title or in the title with or without the corresponding character or combination thereof in front of the search results.
  • the processing method may further include: distributing part or all of the catalogue text or the keyword index data of the title or the title (for example, a keyword index inverted list) in the search engine or the retrieval system.
  • the present technique allows a querier to indicate text or graphics or symbols in a directory or sequence or other content on an interactive interface, such as clicking a cursor to determine or expand or link related content.
  • the processing method of the present invention may further include: in the processing method or the directory, the specific sorting position of the parallel subset or the parallel adjacent word segment or the parallel text or the parallel text portion or the representative sequence information, Partial or complete depending on the relevant subset or related text or paragraph or content or information or the text's page link value, click rate, keyword occurrence rate, number of subordinate subsets or subordinate text, subset click rate, The average or maximum value of the text page link value, the sorting of the search results on the website or system, the bid, the spelling, the stroke, the source rating, the time of inclusion, and others, etc., or by The corresponding objective function value is determined.
  • the processing method of the present invention may further comprise: allowing for additional or reduced additional keywords that are or may not be available, or increasing or decreasing time or geographic or language or other type or scope or requirement, in the method or result of the processing. Limits, get further refined results or more broad results.
  • Another aspect of the present invention is a computer data system including a storage device, wherein the storage device or data portion thereof contains part or all of a keyword index or text summary or text data in the following manner Distribution:
  • Still another aspect of the present invention is a computer data system including a storage device, wherein the data structure of the partial or total keyword index contained in the storage device or the data portion thereof includes at least: Word segment
  • One or more adjacent segments are composed of a predetermined number of adjacent segments adjacent to each other in the corresponding text content or the keywords in the text summary, in order: adjacency segment 1, adjacent segment 2 , ... adjacent segment N;
  • the ID segment refers to an address segment
  • the system may allow the computer data system to search or transform the corresponding index or by transforming according to the combination of the keyword segment and one or more of the adjacent segments included in the search specification or the number of combined segments. content.
  • the present invention may comprise a computer readable medium having the data structure of the above keyword index, or stored in a real A computer readable medium of instructions for data structures that utilize the above-described keyword indexing.
  • the present invention can be a computer-readable medium storing instructions executable by one or more processing devices for implementing a method of processing a plurality of electronic texts containing the same keywords. , can include:
  • the text is divided into the same or different subsets or categories or correspondingly the same or different processing.
  • the corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or ⁇ at least some of the subsets have one or more adjacent segments or texts to be combined or sorted or sorted across subsets Interface display.
  • Yet another aspect of the present invention is to provide a search method for a search engine to provide a desired result of a queryer, the search engine system responding to a keyword query request submitted by a queryer via an interactive interface, from a system-related information source.
  • the database searches for and provides a text or text summary or index or related information that meets the above keyword requirements; this search method is characterized in that the method includes:
  • the system receives the query query keyword query request via the interactive interface
  • the number of words or characters included in the adjacent segment or the manner in which the adjacent segment is cut off is predetermined by the above system or agreed by the queryer or default or selected.
  • the different adjacent segments or different key statements are sorted out.
  • n 1 ⁇ 2 generates search results according to the obtained key sentences, SP : and retrieves or processes or arranges or organizes different or different texts or texts or texts of the same or different key sentences for the queryer to interact with The interface is selected.
  • the above operations can be performed by the system in advance or at the time of inquiry.
  • the search method may further include:
  • the number of words or characters included in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined by the above system or agreed by the inquirer or default or selected.
  • SP Generate search results based on key statements obtained by extensions (which can be expanded multiple times if needed), SP: Retrieve or process different indexes or text summaries or texts or titles containing the same or different extended key statements as described or Arrange or organize or store separately for the queryer to select via the interactive interface.
  • the above operations can be performed by the system in advance or at the time of inquiry, as needed.
  • the cut-off position of the adjacent segment mentioned above may, for example, specify the number of words of each adjacent segment, for example, the number of words is one.
  • the processing method or data system or search method may be used for an Internet search engine system, or may be used for a local or independent computer information base search system, such as a digital library system, a digital database of a document database. system. In this way, the original preliminary results of the keyword search will be subdivided into a subset system, which is convenient for users to choose.
  • the search method may also include a grouping operation as needed:
  • This grouping has a certain unity or representativeness, which can help users to select a small amount of interactive interface screen information.
  • the length of the key statement we choose reaches a certain level, the core content of the index or the title or summary or the sequence of the title sequence will not be repeated.
  • the search method may further include:
  • the search method may also include:
  • the index of the ID segment (address segment), including a text summary or title if necessary, and stored.
  • the corresponding index or abstract or title or text can be retrieved and provided in the store according to the keyword segments and the adjacent segments it contains.
  • the text address which can be a database address or an internet address or URL or a form that represents a hash of the URL field, can provide a link to access or open the text.
  • the search method may further include:
  • a tree-like directory (Fig. 8) that reflects the sequential or juxtaposed relationship between adjacent segments of the keyword in the text or text summary with the same keyword, or a key statement reflecting the different levels of expansion of the keyword.
  • a tree directory of successive or juxtaposed relationships ; used for querying.
  • the search method may also include a selected operation:
  • the system is allowed to determine the corresponding key statement according to the above text or text summary or the title or key sentence or the adjacent segment directory of the page of the queryer or the selection column or the frame of the queryer, and Performing a grouping operation or catalog display of various different extended key sentences or extended adjacent paragraphs or indexes or text summaries or texts or titles corresponding to the key statement, or performing corresponding index or text summary or text or bibliography Sorting the presentation, or performing a removal operation based on the determined corresponding key statement, culling or moving the page or other plurality of pages containing the entry or index or text summary or text or title of the key statement.
  • the search method may also include an ignore operation:
  • the queryer browses the index or text according to the operation of the page or the page on the interactive interface when the queryer browses the index or text summary or the title or text sequence containing the original keyword or contains the original key sentence.
  • the immediate position of the sequence if it can be determined, an index or text summary or a bibliography or text containing a certain key sentence in a certain range in front of the position or the key statement itself has not been opened or linked for a certain number of times, or not Being clicked or followed or prompted to be retained in other ways, the removal operation is performed according to the key statement, and the page or other pages containing the key statement or the index or text summary or the title or text are removed or moved. .
  • the similar file information of the file that the querier does not pay attention to for a long time in the reading process can be moved backward or eliminated from the following sequence, thereby reducing the trouble of excessive useless information.
  • Another aspect of the present invention is to provide a search engine system that provides a desired search result in response to a request by the querier 1 via the interactive interface 2, including:
  • the server is connected to the client where the interaction interface 2 is located via the communication network 4 or the line
  • search engine 8 of the server 5 comprising: a database 9 including a keyword index, and a querier 11 capable of requesting the database in accordance with a keyword request by the querier 9 querying and providing the queryed related data result list to the interactive interface 2 ;
  • the library 9 also includes a text summary or text containing keywords that can be included in the key Within the word index;
  • the querier 11 or the search engine 8 includes a keyword expansion component 10 that can perform an expansion operation on the keywords of the query: the above keywords that appear in the text content containing the above keywords or in the text summary together with Adjacent segments, as different extended key statements, and list or list different adjacent segments of the above keywords for the queryer to select via the interactive interface, or will contain the same or different said extended key statements Different indexes or text summaries or texts are retrieved or processed or arranged or organized for the queryer to select via the interactive interface;
  • the keyword expansion unit 10 can also perform one or more expansion operations on key statements and perform corresponding queries or processing.
  • the "keyword of the query” described herein may refer to "a keyword being queried” or "a keyword to be queried”.
  • the search engine system described above may be a search system for the online customer service located on the Internet, or may be a stand-alone computer information base search system.
  • the server 5 is a computer storage and processing device, which may be a single device or a plurality of groups or distributed configurations.
  • the client 3 can be a personal computer or workstation or other computer device, and an appropriate browser can be configured as needed.
  • the search engine system may further allow: the search engine includes an index construction component 13 for performing text in the database or text obtained by the data collection server 12 attached to the search engine from the Internet 4 or other information source.
  • the analysis generates an index corresponding to at least the keyword segment and the adjacent segment and the text ID segment, and stores the corresponding text.
  • the number of words in each adjacent segment can be simply specified here, for example, the number of words is one.
  • the search engine system may further comprise a tree of adjacent segments reflecting the relationship between adjacent segments of the keyword in the text or text summary having the same keyword (Fig. 8), or including a key reflecting the key A tree-like directory of the relationship between key statements of different levels of word expansion.
  • the present invention may further comprise a tree of adjacent segments having a relationship between different levels of adjacent words in the text or text summary or the title of the keyword, or a key statement reflecting the different levels of expansion of the keyword.
  • tree has between computer readable media relations (computer-readable medium) 0
  • the computer readable medium stores instructions which can be executed by one or more processing means to reflect the tree directory.
  • the above tree directory can actually reflect the parallel relationship of peer segments or key statements at the same time.
  • the corresponding key segment of the tree directory of the adjacent segment tree directory may also display the number of subsets or the number of subsequent segments thereof. Contains the number of files.
  • the search engine system may further include a user graphical interaction interface (FIG. 10), which allows the queryer to add additional query information, and the interface may include a dialog box or selection box 51 to receive the queryer pair.
  • Choice of operation mode or mode; its interface can contain key statements that can be clicked or adjacent words or sentences or paragraphs or operation commands or selected text or symbols or graphics, allowing the querier to add additional query information.
  • the search technology with keywords and adjacent words as the core of the invention has the rigorousness of dictionary and the convenience of obviously surpassing the prior art in dividing and continuously narrowing the range of search results of the same keyword, and can often be hundreds of times.
  • Figure 1 is a schematic diagram showing an example of the number of words contained in an adjacent segment or the way in which adjacent segments are intercepted.
  • Figure 2 is a schematic diagram showing an example of a tree directory of different adjacent segments or corresponding subsets of the same keyword.
  • FIG. 3 is a schematic diagram showing examples of different similar subsets of the same keyword and subsets of different adjacent segments of the lower level.
  • FIG. 4 is a block diagram showing the structure of an embodiment of a search system in accordance with the present invention.
  • FIG. 5 is a schematic diagram showing the generation of key statements in accordance with one embodiment of the present invention.
  • FIG. 6 is a schematic diagram showing another key sentence generation manner according to an embodiment of the present invention.
  • FIG. 7 is a flow chart showing an exemplary operation of a user at an interactive interface according to an embodiment of the present invention.
  • FIG. 8 is a block diagram showing a different level of adjacent words of a keyword according to an embodiment of the present invention.
  • a schematic diagram of a tree of adjacent segments in a sequential relationship A schematic diagram of a tree of adjacent segments in a sequential relationship.
  • FIG. 9 is a flow chart showing the operation of a search engine according to an embodiment of the present invention.
  • FIG. 10 shows a cursor click (selected operation) and a display display during a search process according to an embodiment of the present invention. A partial screen shot showing the results.
  • Figure 11 shows an exemplary flow chart of one embodiment of the method of the present invention.
  • a computer-implemented method for processing a plurality of electronic texts containing the same keyword provided by the present invention includes, by way of example only:
  • the texts may be electronic files or documents or web pages or their abstracts or indexes or titles or topics, or may be databases, works, or dictionaries. , manuals, and various information content of patent documents.
  • the adjacent word segment generally refers to a directly adjacent segment, and may also be an indirectly adjacent segment when necessary; a direct adjacency refers to the adjacent segment having no text between the original text and the above keyword.
  • the interval, and the indirect adjacent segment adjacent segment means that the adjacent segment has a small amount of text (one or two words) between the original text content and the above keyword, and the large interval will obviously affect the use effect of the method.
  • the adjacent segment may be preceded by a keyword or may be followed by a keyword; generally a segment consisting of one or more words or words or even roots in the text content, and including certain characters when needed, such as Abbreviations, punctuation, etc.
  • the number of words or characters included in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined by the computer system or agreed by the querier or selected or selected, or may be The position and manner of the cursor indication on the selection bar of the interactive interface presentation or the text summary or text or related content of a specific index is determined.
  • Figure 1 and Figure 5 show the number of words in the adjacent segment or the way in which the adjacent segments are intercepted.
  • the keyword is "search engine.” 101 means “the way to intercept the first 2 real words of the keyword”; 102 means “the way to capture the first 2 words after the keyword”; 103 means the “2 words after the keyword is intercepted”; 104 means “after the keyword is intercepted”
  • the first comma or the period before the actual word "105; 105 means “the way to intercept the keyword after the first comma or the word before the period" is not less than 2 words.
  • the length of the segment may be judged, or the prefix or suffix of certain words or the presence or absence of certain words or auxiliary words or numerals or quantifiers or non-real words or punctuation or spaces may be omitted or not considered. Or different (see the following embodiment A), the presence or absence or difference or difference of the adjectives or adverbs therein may even be omitted or not considered.
  • the above-mentioned adjacent segment may refer to each adjacent segment in which one word (such as a preceding word) or a plurality of words. In the latter case, it may be necessary to compare adjacent segments of different parts of the keyword to determine whether the adjacent words of different texts are identical.
  • the so-called “identical” means that the two segments are exactly the same; but in some cases, the same or different words are judged, and the prefix or suffix of some words may be omitted or not considered.
  • the presence or absence or difference or difference of certain function words or auxiliary words or numerals or quantifiers or non-real words or punctuation or spaces may even omit or disregard the presence or absence or difference or difference of adjectives or adverbs.
  • the adjacent segment of the keyword in each text content is the same as or different from other texts.
  • the queryer can directly refer to a certain adjacent word of the keyword.
  • the interest of the segment obtained or skips all texts of the category containing the adjacent segment by type.
  • the corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or allow at least some of the subsets to have one or more adjacent segments or texts to be combined or sorted across the subset or in the interactive interface. Show.
  • the corresponding identical or different processes described herein may also include: the respective texts having the same or different distribution locations or storage methods, or obtaining the same or different subset markers, or having their indices having the same or different markers or indexes Items, either having the same or different arrangement, or having the same or different display modes or positions in the interactive interface, or allowing at least some of the subsets to have one or more adjacent segments or texts to be combined or sorted across subsets or Displayed in the interactive interface. (See the contents of Example A and Figure 10 below)
  • Utilizing the processing method of the present invention may also allow for the merging or separation of successive adjacent segments to reduce or increase the subset hierarchy. For example, if the above keyword is a plurality of texts of "search engine”, if we first intercept the first adjacent segment in the manner of "1 word before word + after 4 words", the number of first-level subsets obtained should be Equal to the sum of the number of sub-level subsets divided by the previous method, the results are similar, but the subset level is reduced.
  • Figures 2 and 8 are two examples of such directories.
  • Figure 2 reflects the first or multi-level directory or tree directory of the adjacent words of the plurality of texts of the "search engine” or the parallel or sequential relationship of the adjacent segments of the segments. (The example shown in Figure 8 will be explained later)
  • the keyword is "search engine”, represented by the symbol "K”, where the first adjacent segment is intercepted in the manner of "1 word before word + word after 1 word", the second adjacency The word segment is the 3 word segment behind it, and the third adjacent segment is the 2 word segment behind it.
  • the content may be a sentence or example sentence or a summary instance or a bibliographic or representative text containing the contiguous segment.
  • the keywords and adjacent segments in the sentence or example sentence or summary text or representative text may have fonts or colors or features different from other content; wherein the word segment or statement or example sentence or summary instance or Representative text can be juxtaposed across subsets.
  • each subset or subordinate subset can represent the subset by the corresponding contiguous segment or the statement or example sentence or summary instance containing the segment, so that on a limited interaction interface, more
  • the representative content of the subset forms a directory or sequence, and the queryer can select the subset of interest and the included text by clicking on the representative content. For example, if we click on the "Google Company" section after "Different K Technology 672" in the directory shown in Figure 2, we can get the phrase "Different K Technology by Yahoo! Google Company". A subset of all text directories or related content. The same result can be obtained if we click on an adjacent segment in the relevant content of the relevant sequence or directory.
  • the present technique allows a queryer to indicate text or graphics or symbols in a directory or sequence or other content on an interactive interface, such as clicking a cursor, determining or expanding or linking related content.
  • the method can be arranged: for the keyword adjacent column or indirect adjacent segment in the directory or tree directory or sequence, if there is only one adjacent segment in the next level or the next level Alternatively, the segment may be distributed or stored or displayed in its original location along with its next or next level of adjacent segments.
  • the text portion content may refer to a digest or index or a bibliography or an example sentence or a phrase or the like containing the same keyword. It can also be said that the keyword adjacent segments of the plurality of words (two or more words) contained in the contents of the text or the text portion of the representative sequence are different from each other or substantially different from each other. Multiple words can generally better reflect the inclusion of the core content of the keyword.
  • the towel method can condense millions of online information with keywords into a representative information sequence with the same number of keywords reduced by 2 and 3 orders of magnitude, and the core content of each message (a few near the keyword) The content of the adjacent words) is neither repeated nor missing.
  • This is also a very effective way to refine the core content of a web page, and there has been a significant improvement over the way that only technology can only eliminate mirrored web pages.
  • the certain similarity requirements include at least the requirement for the number or proportion of the same word or word or phrase or character contained in different adjacent segments.
  • Figure 3 constitutes another example of similarity subsets, which is formed by similar comparisons based on the sequence of different adjacent words (taken by the following three words).
  • the keyword similarity requirements of different adjacent segments are preset by the system as "must have the same 3 words, and the order of each other is not limited.”
  • the keyword of each text in this example is "search engine”, which is represented by the symbol "K”, in which the names of similar subsets composed of different adjacent segments of the keyword are represented by their common words, respectively, in uppercase letters. representative.
  • the different adjacent segments of the first similar subset shown in Figure 3 contain X, Y, Z these three words.
  • the adjacent segments of the same similar subset may constitute different first adjacent segments due to the different order of the shared words, and constitute the next subset of the similar subset.
  • a subset containing a certain adjacent segment or an indirectly adjacent segment may be classified as a lower subset of the above-mentioned similar subset in which the adjacent segment is located, or a text containing a certain adjacent segment or a portion thereof The content is classified into the above-mentioned similar subset in which the adjacent segment is located.
  • the similar subset can be regarded as being compiled on the basis of the original directory or sequence of adjacent segments, so the number of similar subsets or the length of the catalog is significantly smaller than the original catalog or the original sequence. It is more convenient to see the main components of the related adjacent segments (several parallel independent words) by the names of similar subsets in the directory. If you are interested, you can open the similar subsets and get the subordinate subsets to which they belong. Different keyword segments and related information.
  • the technique can adopt a more efficient method: specifying that the number of words of the adjacent segment of the keyword is one of 2 to 10, for example, six, so that after processing, different adjacent segments can be obtained or
  • the directory if needed (such as too much content, more than a few hundred) can be further similarly compared to get a different similar subset of the directory or sequence (such as reduced to dozens). This is very convenient for the query.
  • the method may also use the respective respective directories or sequences or their charts as separate texts, so that the independent texts respectively have a title or an inscription containing the corresponding keywords, or respectively have corresponding keywords and corresponding The title or title of the same adjacent segment or adjacent segment group or similar subset name.
  • the search engine system is relatively easy to place the corresponding title or title according to each independent text during data collection or when generating an index or keyword query. Keywords or key statements (ie, keywords along with their adjacent segments) are among the search results of the query.
  • prompt characters or combinations thereof there may be unique prompt characters or combinations thereof specified by the provider of the catalogue chart text, or may be accompanied by prompting the chart or text to form or publish time or order. Or prompt the number of characters in the directory shown in the chart or a combination thereof.
  • the keyword search result sorting means of the relevant search engine or retrieval system have the ability to recognize the prompt characters or combinations thereof in the title or the title of the independent text of the directory or sequence that have their characteristics.
  • the processing method may further distribute keyword index data (such as keyword inversion table) of some or all of the provided catalogue text or its title or title to a dedicated index database or index of the search engine or the retrieval system.
  • keyword index data such as keyword inversion table
  • the keyword search can be directly performed in the special database or the special area, which can significantly improve the search speed and avoid the search results. Information that is not related to the directory or sequence appears.
  • the ordering of the contents of the above various directories or sequences obtained by the method of the present invention may sometimes be randomly distributed, or may be performed using a well-known sorting technique, or a parallel subset or a parallel adjacent segment may be used as needed. Or the specific ordering position of one of the indirect contiguous segments or side-by-side text or side-by-side sentences or example sentences or summary instances or representative sequence information, depending in part or in whole on one or more of the following factors:
  • the specific sort position can be determined by an objective function value, which depends on one or more variables, and some or all of the variables of the objective function can respectively represent one or more of the factors listed above.
  • an objective function value can be expressed as F (x l x 2 ⁇ ' ⁇ ⁇ )
  • F ( Xl , x 2 - x n ) F (xi) + F (x 2 ) + + F (x n ) ; where , x 2 ⁇ .
  • factors (variables) or other factors that determine the specific ranking position as mentioned in the previous section of the invention. Since there are many specific treatment methods in the prior art (for example, US Pat. No. 6,285,999), it will not be described in detail here.
  • the processing method of the present invention may also allow for the addition or subtraction of additional keywords that are or may not be available, or increase or decrease time or geographic or linguistic or other types or ranges or requirements, if desired. Limits, get further refined results or more broad results.
  • the present invention allows for comparison of contiguous segments of more stringent requirements (eg, not ignoring differences in function words) for the content of subsets obtained by comparing adjacent segments under loose requirements (eg, ignoring differences in lexical words in adjacent segments) , and divide the next level of the subset or get a more detailed list of adjacent segments or corresponding information; or reverse the operation.
  • more stringent requirements eg, not ignoring differences in function words
  • adjacent segments under loose requirements eg, ignoring differences in lexical words in adjacent segments
  • Add or subtract a keyword such as "China"
  • change the time such as within six months or within two years
  • region Hebei or Baoding or North China
  • language such as English or Spanish
  • other types such as items
  • Still another aspect of the present invention is another computer data system including a storage device, wherein the data structure of the memory device or a portion of the data index contained in the data portion thereof includes at least:
  • One or more adjacent segments are composed of a predetermined number of adjacent segments adjacent to each other in the corresponding text content or the keywords in the text summary, in order: adjacency segment 1, adjacent segment 2 , ... the adjacent segment N; the corresponding text ID segment, or the ID segment of its associated information, (where the ID segment refers to the address segment);
  • a summary segment or a title segment of the keyword contained in the corresponding text may be included.
  • keyword indexing is established to facilitate search or retrieval systems for keyword retrieval.
  • the same text often has multiple indexes of different keywords.
  • the index data structure of a text for the keyword "Yangtze River” is as follows:
  • search engine searches for "Yangtze River” or searches for the long search term “Yangtze River Basin” or the longer “Yangtze River Basin Hydraulics"
  • the search engine searches for "Yangtze River” or searches for the long search term "Yangtze River Basin” or the longer “Yangtze River Basin Hydraulics”
  • the text facilitates the specific implementation of the invention. That is, the system may allow the computer data system to search or transform the corresponding number according to the combination of the keyword segment and one or more of the adjacent segments or the number of combined segments included in the search specification. Index or content.
  • each adjacent segment in the index is the length of a word
  • the computer can easily get the index of the keyword segment and the adjacent segment content that meet the query requirements.
  • the above address can be a database text address, or a web page address or other address.
  • the computer data system can also be a search engine system. (See Example B below)
  • the invention may comprise a computer readable medium comprising a data structure having the above keyword index.
  • the present invention may also be a computer data system including a storage device, which may arrange for some or all of the keyword index or text summary or text data contained in the data portion or the data portion thereof to be distributed in the following manner: A summary or text containing the same keyword and the same or different index or text summary or text of the adjacent words, located in the same or different subset of the same keyword set.
  • extension key statement is the keyword together with one or more levels of adjacent segments
  • adjacent segments of the statement are the same or different index data , the same or different lower-level or multi-level subset distribution areas in the same subset.
  • individual indexes of individual text or textual parts of the same keyword may be included, and if necessary, a table of contents of various contiguous segments of the keyword may be included (or Index of the subset directory table or the multi-level contiguous segment tree table (or multi-level subset table) or the corresponding example sequence sequence table or summary instance sequence table, all or part of which is concentrated or continuously arranged in the key
  • the centralized storage area corresponding to the word For example, the following embodiment A
  • each index described herein constitutes at least an address segment of the indexed storage object (such as text, directory table, sequence table, etc.).
  • the relevant index can be conveniently or continuously accessed, the address or number of the indexed address segment (ID segment) is obtained, and the relevant directory or text or other content is accessed or extracted.
  • each index of each text or text portion content having the same keyword may be further included, and may include a directory table or a tree table table of various next-level or multi-level adjacent segments as needed.
  • the index of the corresponding example sentence sequence table or summary instance sequence table is respectively distributed in whole or in part or continuously arranged in a centralized storage area corresponding to different adjacent word segments of the keyword.
  • the computer data system may be a search engine system, which may more conveniently query or process or provide the user with the same subset of the queryed keywords and adjacent segments and data of the lower or more subsets. .
  • the present invention may include data of a keyword index or text digest or text distributed in a manner that is readable in the manner described above.
  • the related computer processing device starts work 61, receives the keyword query 62 submitted by the querier, obtains a large amount of text containing the keyword, and determines the contiguousness of the keyword according to the preset or the querier designation.
  • the obtained subset can be subdivided 66, for example, according to the same or not of the next-level adjacent segment, the next-level subset is divided, or the stricter required adjacent segment is compared, and the division is performed.
  • First-level subsets can also arrange representative sequences or adjacent segments not identical sequences or arrange corresponding directories 67, (including the number of texts labeled with the corresponding subset and appropriate sorting 71), display 70 in the interface for the queryer to choose Action, expand related subsets or content or display related text 72.
  • Embodiment A shown in Fig. 4 is an example of a computer data system capable of executing the computer electronic text processing method of the present invention.
  • An Internet search engine system capable of providing extended key sentence search. It comprises: a search engine 8 arranged on a server 5 with a memory 6 and a processor, the search engine 8 being connected to a client 3 with an interactive interface 2 via a communication network 4 of the Internet; the search engine 8 has a database 9.
  • the querier 11 and the keyword expansion unit 10 or module are connected to the data collector 12 and the index constructor 13;
  • the data collector 12 collects and adds text to the text library of the database 9 from the Internet or other information sources, and the index constructor 13 analyzes the text of the text library to obtain a text index and provides the keyword index library of the database 9;
  • Each index obtained by the index constructor 13 according to the analysis of the text includes a keyword segment, six single adjacent segments, an ID segment of the corresponding text, a text title segment, and a text summary segment, so that the search engine can Find the desired text index according to the required keyword segment, or with the required one or more single adjacent segments, to get the text
  • the title segment or the text summary segment or the ID segment of the corresponding text which can be easily linked to the original text when needed.
  • the index of the keyword index library is distributed according to the similarities and differences of adjacent word segments at various levels, so as to facilitate retrieval or extraction.
  • the corresponding adjacent segment directory, the adjacent segment tree directory ( Figure 8), and the key statement directory are also pre-stored.
  • the client application browser (Microsoft Internet Explorer) on client 3 of embodiment A allows user 1 to retrieve HTML documents (including web forms) from server 5 over communication network 4.
  • the interactive interface (UI) 2 on client 3 allows user 1 to interact with the retrieved web form using a monitor, keyboard or mouse, submit search requests, make selections and receive search results.
  • An important issue in the search method of the present invention is the way in which the adjacent segments are selected (or the way in which the keywords are joined to the adjacent segments), i.e., the manner in which the key statements are generated.
  • An exemplary key statement of the embodiment A shown in Fig. 5 is to incrementally add adjacent words (in this case, words) one by one along the keyword 21 in the text summary.
  • 22 is a level 1 key statement
  • 23 is a level 2 key statement
  • 24 is a level 3 key statement
  • 25 is a level 4 key statement.
  • Figure 6 shows the key statement generation method of another embodiment B.
  • the first level adjacent word segment is located in front of the keyword 21, and the second level adjacent word segment and other adjacent word segments are located behind the keyword.
  • 22 is a level 1 key statement
  • 23 is a level 2 key statement
  • 24 is a level 3 key statement
  • 25 is a level 4 key statement.
  • the length of the adjacent segments of the key statements at all levels can also be pre-specified or arranged by the queryer at the time of the search or by default.
  • the system of the H example A can calculate the number of words in the adjacent segment and compare the similarity of the adjacent segments, and can choose not to count the virtual words, quantifiers, punctuation, spaces, etc., and merge them into the adjacent real words.
  • This example can have specific regulations for Western languages.
  • adjectives or adverbs may not be counted, if desired, even when calculating the number of words in a contiguous segment and comparing the similarity of adjacent segments.
  • the query request of the user 1 is authenticated by the querier 11, and a query is made in the database 9 according to the proposed keyword request and the related data result list is queried for provision to the interaction.
  • the keyword expansion component 10 is supplemented by the querier 11.
  • the key statement corresponding to the keyword, the corresponding example sentence, and the adjacent segment tree structure directory are temporarily stored or processed to meet the requirements. Then, the search or display needs; if the content is not arranged in the database 9 or the keyword expansion unit 10, the keyword expansion unit 10 will establish it based on the keyword query data of the querier 11.
  • an index or a file sequence containing the keyword may be used. Find a (for example, the first) index or file view keyword and adjacent words or phrases that are adjacent segments (according to the predetermined length), store them as the first key statement; then find the second index or Does the file look at the words or phrases adjacent to the keyword that are the same as the first one? If they are different, they are stored in order, and the same is discarded; then the third index or file is compared and compared with the first two...
  • each subset is formed. Otherwise, the index or file sequence is respectively retrieved by the querier 11 with each key sentence as a standard, and the corresponding index can be obtained.
  • Each subset If in each sequence of index or subset of files, a variety of Level 2 adjacency segments are searched for as described above, various Level 2 key phrases and corresponding lower level subsets are obtained... and so on. If one (for example, the first one) or several indexes or digests are selected as the example sentences in each of the obtained subsets, the desired directory and example sentence sequences are obtained, thereby completing the grouping operation.
  • the directory and the example sentence sequence can be arranged according to the size of an objective function value.
  • the objective function value is the value of the largest text of the objective function among the corresponding subsets of the corresponding entries, which is equal to the sum of the text link value of the text and the recent click rate.
  • the example sentence can be extracted from the text with the largest subset objective function value.
  • the ordering of the sequence of information such as a table of contents or an example sentence or a bibliography or abstract may be determined based on the size of an objective function value F (Xi , ⁇ 2 ⁇ ' ⁇ ⁇ n ).
  • the objective function value can be equal to the corresponding bid.
  • the keyword index of the embodiment A can adopt a system distributed according to each subset, and does not occupy more storage space than other existing keyword indexing libraries, which is one of its outstanding advantages.
  • the keyword index library does not adopt a subset distribution, and since the index data structure includes a keyword item and several adjacent term items, the querier 11 is based on the keyword segment and one or more The key statements of the adjacent segment combination can directly search and display the indexes that should belong to the corresponding subset.
  • the embodiment B only the tree directory of the adjacent words or key sentences needs to be arranged, and the original traditional keyword index database may not be changed.
  • the indexing system of the present invention can also employ a conventional keyword inverted table indexing system or method.
  • a neighboring segment tree directory similar to that shown in FIG. 8 can be used to reflect the relationship between adjacent segments of different levels of keywords, and display on the screen will facilitate the user to understand the overall state of each subset or subset of levels.
  • This figure omits the corresponding number of texts for each subset.
  • the key words are "Bulin", the adjacent segment 1, the adjacent segment 2, the adjacent segment 3, and the adjacent segment 4 are all composed of a single real word, which also represent the common adjacent segments respectively contained in the subsets of each level. .
  • Embodiment A may perform a selected operation, that is, allowing the system to determine a corresponding key statement according to a queryer's text or summary on a page of the interactive interface or a directory or a cursor indication of a selection bar, and corresponding to the key statement
  • a selected operation that is, allowing the system to determine a corresponding key statement according to a queryer's text or summary on a page of the interactive interface or a directory or a cursor indication of a selection bar, and corresponding to the key statement
  • the key statement's entry or index or text summary or text culling or moving position that is, allowing the system to determine a corresponding key statement according to a queryer's text or summary on a page of the interactive interface or a directory or a cursor indication of a selection bar, and
  • Fig. 10 is a partial screen diagram showing a cursor click and a display result (i.e., a selection operation for grouping) in the search process according to an embodiment of the present invention.
  • the search box 51 is for inputting keywords (in this case, "Bulin")
  • 52 is the two options for clicking operations: 'click to expand' or 'click to remove', in this case, 'click to expand' is selected.
  • the object for clicking here is the summary 55 shown in the summary bar 53 on the screen.
  • the search method of Embodiment A further includes ignoring the operation, that is, the finder can browse the page of the interactive interface 2 (such as page change) or correspondingly on the page when browsing the index or text summary or text sequence containing the original key sentence. Recording or analyzing data on "itemous clicks" or “ignoring clicks” made on items or content, key statements that have been ignored or not noticed in a certain reading time or space, and related indexes and abstracts in the following Perform the removal.
  • the querier 11 can query the database according to the requirement and provide the related data result list of the query to the interaction interface 2 ; If you wish to extend a keyword, keyword expansion member 10 will generate the appropriate key phrases, and provide the desired data extraction or by the interrogator 11 searches.
  • module 41 The system starts working according to module 41, and queries for keyword search request (42), and returns no (48); and then, according to module 43, is there a keyword expansion operation request? If not, then execution 44 provides an ordinary search result sequence display, and if so, execution 45 queries the user 1's request via the prompt box on the screen of the interactive interface 2; then performs the corresponding operation, providing the corresponding module according to module 46. The information continues to query the selection and needs of the user 1. After a few iterations, the corresponding search information is provided in accordance with the module 47, and the execution of the module 48 returns or ends 49 as intended by the user 1.
  • the operation flow of the user 1 corresponding to the search engine 8 in the interactive interface can be represented by a map: After the interactive interface 2 is started to start working (31), the keyword (32) is selected, and the normal browsing can be performed (34). You can also choose to expand the search (33); if you choose (33), that is, using the extended key sentence search technique, you need to select the appropriate operation mode by cursor click: For example, select the length of the first adjacent segment of the keyword (the included words) quantity). Its length is short, the number of corresponding key statements (subsets) is small, but the content of each subset is complex; its length is long, the corresponding key statement types (subsets) are more, and the core of each subset The content is more singular or concentrated.
  • the single sequence index or summary group sequence as mentioned above will be a "refined sequence" whose core content is basically non-repeating and there are not many omissions. , total document The amount may be reduced by several orders of magnitude.
  • the number of single indexes in the first level will be more.
  • the system allows the use of click operations to appropriately reduce the number of contiguous words or contiguous segments of a key statement, which can greatly reduce the number of singularities or key sentences or abstracts or example sentences at the first level or level.
  • the system will automatically perform group operations according to the original, for example, each adjacent segment of the word or the length of the double word, and present the result (35) ).
  • the user 1 can select 37 to directly open the link text in the result, or select the appropriate extended key statement in the result presented on the screen according to 36 (see FIG. 10), and obtain further search results displayed by the module 38. (The next level of the subset directory and other content).
  • the above-mentioned key sentence search technology will be combined with the existing keyword search technology, and when respecting the index sorting inside the subset, or when selecting each example sentence in the grouping operation, pay attention to respect or maintain the relevant file.
  • the technique of the present invention includes the principle of sorting technology search based on the above basic method and basic structure.
  • the embodiment B and the embodiment C are substantially the same as the embodiment A except for the points indicated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and system for processing the same keywords in a plurality of electronic texts by the computer, including: obtaining a plurality of electronic texts with the same keywords, defining the number of the key words in the adjacent phrases, or the truncating way of the adjacent words, classifying the text and other texts into the same or different subsets or catalogues, executing the same or different processing for them, according to whether the adjacent words of the key words in each text content of a part of or all the texts are the same as those of other texts or not. The method is capable of forming the multiple stages of the subsetssystem or catalogues or example sequences with the non-repeated core contents from a great lot of the results of the retrieve of the keywords, helping the user reduce the search region rapidly and accurately, obtaining the desired enquiring result completely and accurately.

Description

电子文本处理与检索的便捷方法和系统  Convenient method and system for electronic text processing and retrieval
( 1 ) 技术领域  (1) Technical field
本发明涉及计算机及搜索引擎关于电子文本处理与检索的技术。  The present invention relates to techniques for computer and search engine processing and retrieval of electronic text.
(2) 背景技术  (2) Background technology
数十年来, 计算机数据库检索技术有了极大发展, 特别是万维网等网络技术的进展, 使得人们可以共享的数据库的规模达到了天文数字。 为了方便用户找到所需信息或文件, 出现了分类或目录检索系统。 这种技术在人们十分熟悉的成熟分类领域里比较适用, 但在 更广泛的海量信息领域里, 难于建立也难于掌握和使用。  Over the decades, computer database retrieval technology has developed greatly, especially the advancement of network technologies such as the World Wide Web, which has enabled the scale of databases that people can share to reach astronomical figures. In order to facilitate the user to find the information or documents required, a classification or directory retrieval system has appeared. This technique is applicable in the mature classification field that people are very familiar with, but in the broader field of massive information, it is difficult to establish and difficult to master and use.
以关键词搜索为核心的搜索引擎技术为用户带来了便利。 以搜索引擎为核心的搜索系 统一般位于一个或多个服务器或其他计算机装置上, 由文本 (页面) 库、 文本索引库、 根 据对文本库的文本分析得到文本索引的索引构造器, 以及接受査询生成搜索结果的査询器 等部分组成, 往往还附带有为文本库从互联网或其他信息源搜集和增添文本的数据采集服 务器。 该系统可以通过客户机上的交互界面以及通讯网络或通讯线路得到査询者的关键词 査询请求, 在文本索引库或文本库中进行査询, 并进行关键词请求与文本的相关性分析, 得到相关结果并排序, 再经由通讯网络或线路提供到交互界面。 这种搜索系统使用起来十 分便利迅速, 但返还结果包含的索引总数仍然十分庞大, 难于逐一査阅。  Search engine technology with keyword search as the core brings convenience to users. A search engine-centric search system is generally located on one or more servers or other computer devices, and is composed of a text (page) library, a text index library, an index constructor that obtains a text index based on text analysis of the text library, and a search. It is composed of parts such as a querier that generates search results, and often comes with a data collection server that collects and adds text to the text library from the Internet or other information sources. The system can obtain the query query keyword query request through the interactive interface on the client and the communication network or communication line, query in the text index library or the text library, and perform correlation analysis between the keyword request and the text. The relevant results are obtained and sorted, and then provided to the interactive interface via a communication network or line. This kind of search system is very convenient and quick to use, but the total number of indexes included in the return result is still very large and difficult to check one by one.
人们还发展了将关键词与指向有关文本的锚内容(anchor text)描述相比较来确定相关 性的技术, 仍然不能使检索者十分满意。 为了能将潜在的对査询者最有价值的査询结果尽 量排在前面以方便査询者,第 6,285,999号美国专利提出了基于网页超级链接结构分析(佩 奇链接)来进行搜索结果排序的技术, 超过了其他排序技术, 被 Google公司采用, 获得空 前成功。  Techniques have also been developed to compare keywords to anchor text descriptions of related texts to determine relevance, and still do not satisfy the searcher. In order to be able to rank the most valuable query results to the inquirer as much as possible to facilitate the queryer, U.S. Patent No. 6,285,999 proposes a ranking based on webpage hyperlink structure analysis (Page link) for ranking search results. Technology, surpassing other sorting techniques, was adopted by Google and achieved unprecedented success.
然而, 该技术以及其他各种排序技术, 仅仅是在统计学意义上提高了关键词搜索的效 率, 并不能保证每个人希望的査询结果都能排在庞大索引表的前面。 例如, 我们利用 "谷 歌"中文网站搜索 "布林"一词, 可以得到近 30万条索引。我们仍然不能保证可以无一遗 漏地在靠前的位置上査阅到期望的内容, 做到既严密又比较方便。 同时, 我们在读到期望 的信息之前, 却无奈地读到种种主要内容一再重复的无关信息。  However, this technique, as well as various other sorting techniques, only improves the efficiency of keyword search in a statistical sense, and does not guarantee that everyone's desired query results will be ranked in front of a large index table. For example, we use the "Google" Chinese website to search for the word "Bulin" and get nearly 300,000 indexes. We still can't guarantee that we can see the desired content in the front position without any omission, which is both strict and convenient. At the same time, before we read the expected information, we were reluctant to read irrelevant information that repeated the main content.
为了解决这一问题, 近十年来人们一直试图发展各种新的搜索引擎技术, 例如, 第 6421675号美国专利涉及的 "按照重要性的优先次序列表"的技术, 第 6256633号美国专 利涉及的 "根据使用者査询数据的历史形成动态对象表"的技术, CN1151457号中国专利 的"与其他査询者共享査询信息" 技术, 第 6990628号美国专利有关"测量电子文本相似 性"的技术。 这些技术具有某些优点, 但效果十分有限。  In order to solve this problem, in the past ten years, people have been trying to develop various new search engine technologies, for example, the "Priority List of Importance" technology related to U.S. Patent No. 6,421,675, which is incorporated herein by reference. The technique of forming a dynamic object table based on the history of the user's query data, the technique of "communicating information with other inquirers" by the Chinese patent CN1151457, and the technique of measuring the similarity of electronic texts in the US Patent No. 6990628. These techniques have certain advantages, but the results are very limited.
第 7089236号美国专利的技术可以对査询者提出的关键词进行语义分析, 并将不同的 可能语义呈现于交互界面, 帮助査询者缩小搜索范围。 与之相近的中国专利申请第 200510081867.5号的技术, 通过使用网页类别信息分散搜索引擎的关键词搜索结果。 这两 种技术的问题在于, 首先必须建立十分复杂庞大然而不可能准确的分类数据库, 由机器判 断某一页面或文本属于某关键词的哪一条或哪几条的语义或类别是十分困难的, 其可靠性 不高。 一个关键词的不同语义或类别之间很可能重叠更可能存在空白。 如果增加分类的层 次, 重叠将造成占用存储空间的暴增。 同时, 关键词搜索的査询者面对不熟悉的领域, 对 诸多语义或分类也难于准确把握。 这些都严重影响了査询效率的提高。  The technique of U.S. Patent No. 7089236 can perform semantic analysis on keywords proposed by a queryer, and present different possible semantics on the interactive interface to help the inquirer narrow down the search range. The technology of Chinese Patent Application No. 200510081867.5, which is similar to the above, uses the webpage category information to disperse the keyword search results of the search engine. The problem with these two technologies is that, first of all, it is necessary to build a very complex and large but impossible accurate classification database. It is very difficult for the machine to judge which one or which semantics or categories of a certain keyword or text belong to a certain keyword. Its reliability is not high. It is likely that there is a gap between different semantics or categories of a keyword that is likely to overlap. If you increase the level of classification, the overlap will result in a surge in storage space. At the same time, the queriers of keyword search face unfamiliar fields, and it is difficult to accurately grasp many semantics or classifications. These have seriously affected the improvement of query efficiency.
因此, 人们迫切需要一种既严密又高效的关键词搜索引擎系统技术, 能够有效地帮助 査询者缩小甚至多次缩小査阅范围。 要求不同范围之间分界明确, 容易判断, 没有重叠也 没有空白, 以大大加快査询者得到期望结果的速度, 并保证搜索的严密性。 这也成为多年 来未能解决的世界性难题。  Therefore, there is an urgent need for a keyword search engine system technology that is both rigorous and efficient, which can effectively help the inquirer to narrow down or even narrow the scope of the search multiple times. It is required that the boundaries between the different ranges are clear, easy to judge, and there is no overlap or no gap, so as to greatly speed up the speed at which the inquirer obtains the desired result and ensure the rigor of the search. This has also become a worldwide problem that has not been solved for many years.
(3 ) 发明内容  (3) Summary of the invention
本发明的目的就是提供一种计算机或搜索引擎的电子文本处理与检索或搜索的技术, 在用户进行关键词检索而面对海量的搜索结果时, 能迅速而严密地多次缩小搜索范围, 或 剔除各类无关信息或重复信息, 准确地得到所期望的结果而很少遗漏。  The object of the present invention is to provide a technique for electronic text processing and retrieval or search of a computer or a search engine, which can quickly and rigorously narrow the search range when the user performs keyword search and faces a large number of search results, or Eliminate all kinds of irrelevant information or duplicate information, and accurately get the desired result with little omission.
本发明的一个方面是提供了一种计算机执行的对多个含有同样关键词的电子文本进行 处理的方法, 包括:  One aspect of the present invention provides a computer-implemented method of processing a plurality of electronic texts containing the same keyword, including:
获得多个含有同样关键词的电子文本; 规定邻接词段所含字词数量或邻接词段截取方 式; 根据部分或全部文本中的每个文本内容中所述关键词的邻接词段与其他文本相同还是 不同,将该文本与其他文本划分入同一或不同子集或类别或者进行相应的相同或不同处理。 Obtain multiple electronic texts containing the same keyword; specify the number of words in the adjacent segment or the interceptor of the adjacent segment Depending on whether the contiguous segment of the keyword in each text content in some or all of the text is the same as or different from the other text, the text is divided into the same or a different subset or category or correspondingly the same or Different treatments.
所述的相应的相同或不同处理可以包括: 相应文本具有相同或不同的分布位置或存储 方式,或者得到相同或不同的子集标记,或者使得其索引具有相同或不同的标记或索引项, 或者具有相同或不同的编排方式, 或者在交互界面具有相同或不同的显示方式或位置, 或 者^许至少部分子集各有一个或多个邻接词段或文本进行跨子集组合或排序或在交互界面 展示;  The corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or ^ at least some of the subsets have one or more adjacent segments or texts to be combined or sorted or sorted across subsets Interface display
所述的文本可以是电子文件或网页或者它们的摘要或索引或题录或题目,  The text may be an electronic file or a web page or their abstract or index or title or title.
也可以是数据库、 著作、 词典、 手册、 专利文献的各种信息内容。 It can also be various information contents of databases, books, dictionaries, manuals, and patent documents.
上述的邻接词段或间接邻接词段可以是关键词前面的, 也可以是关键词后面的; 一般 是文本内容中的一个或多个词或字甚至词根组成的词段, 需要时也包括某些字符, 如缩写 字母、 标点等; 在某些必要情况下, 判断两个词段的相同或不同, 也可以略去某些词的前 缀或后缀或者某些虚词或非实词或者标点或空格的差别。 需要时, 也可以略去或不考虑某 些助词或数词或量词或形容词或副词的有无或差别, 甚至略去或不考虑冠词或连词有无或 差别。  The above-mentioned adjacent segment or indirectly adjacent segment may be preceded by a keyword or may be followed by a keyword; generally a segment composed of one or more words or words or even roots in the text content, including a certain part when needed Some characters, such as abbreviations, punctuation, etc.; in some cases, to judge the same or different words, you can also omit the prefix or suffix of some words or some virtual words or non-real words or punctuation or spaces difference. When necessary, the adjectives or quantifiers or quantifiers or adjectives or adverbs may or may not be considered or differentiated, even with or without regard to the presence or absence of articles or conjunctions.
当检索时的关键词为可以分开的多个字词时,上述邻接词段可以是指其中某一字词(如 靠前的字词) 或多个字词的邻接词段。  When the keyword at the time of retrieval is a plurality of words that can be separated, the adjacency segment may refer to one of the words (such as the preceding word) or the adjacent segment of the plurality of words.
所述邻接词段所包含的字词或字符的数量或该邻接词段的截止方式或具体内容可以 是预定的或者査询者同意或默认的或选定的。  The number of words or characters contained in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined or the querier agrees or defaults or selects.
在某些必要情况下, 判断词段的长度, 类似于判断两个词段的相同或不同, 也可以略 去或不考虑某些词的前缀或后缀或者某些虚词或助词或数词或量词或非实词或者标点或空 格甚至形容词或副词的有无或差别。  In some cases, judging the length of a segment is similar to judging whether the two segments are the same or different. You can also omit or ignore the prefix or suffix of certain words or some virtual words or auxiliary words or numerals or quantifiers. Or non-real words or punctuation or spaces or even the presence or difference of adjectives or adverbs.
本发明的方法的对检索的好处十分明显。査询者对关键词的某一种邻接词段有兴趣时, 很容易得到包含该种邻接词段的类别的所有文本, 反之, 他则很容易跳过这些文本。  The benefits of the method of the invention for retrieval are significant. When the inquirer is interested in a certain adjacent segment of the keyword, it is easy to obtain all the texts of the category containing the adjacent segment, and instead, it is easy for him to skip the text.
本发明的关键之处在于, 关键词的邻接内容最有可能决定该关键词在该文本中的具体 内涵或指向或限定范围或方向, 这应该是搜索者最感兴趣的。 同时, 本方法如果采用的方 式恰当的话, 完全可以避免其他利用分类检索方法所难以避免的 "不同类别或子集的内容 重叠和空白"现象, 该现象在多级分类子集系统中会造成最终难以使用的后果。 这决定了 本发明的方法或系统的搜索效果将会有突出地提升。  The key point of the present invention is that the contiguous content of the keyword is most likely to determine the specific meaning or direction or direction or direction of the keyword in the text, which should be of most interest to the searcher. At the same time, if the method is appropriate, it can completely avoid the phenomenon of "content overlap and blank of different categories or subsets" which is difficult to avoid by using the classification retrieval method. This phenomenon will eventually result in a multi-level classification subset system. The consequences of being difficult to use. This determines that the search effect of the method or system of the present invention will be significantly improved.
所述的处理方法还可以包括: 对于属于某个或某些同一第一级子集或较高的子集或其 内容含有同样关键词及邻接词段的不同文本, 根据其含有的所述同样关键词及邻接词段的 其他邻接词段的相同还是不同, 将部分或全部所述文本划分入上述子集的同一或不同的下 一级或多级子集或者进行相应的相同或不同处理。  The processing method may further include: for the different texts belonging to one or some of the same first-level subset or higher subsets or the content thereof containing the same keyword and the adjacent segment, according to the same The keywords are identical or different from other adjacent segments of the adjacent segment, and some or all of the text is divided into the same or different lower or multi-level subsets of the above subsets or correspondingly the same or different processes are performed.
这实际上就是将原来的同一邻接词段子集进一步细分为若干下一级子集。 所述的处理 方法允许依次的邻接词段的合并或分开, 以减少或增加子集层次。  This is actually subdividing the original subset of the same adjacent segment into a number of lower-level subsets. The described processing method allows for the merging or separation of successive adjacent segments to reduce or increase the subset hierarchy.
所述的处理方法还可以包括:  The processing method may further include:
编排一个反映所述文本的同样关键词的不同邻接词段或间接邻接词段或者包含这些词 段的语句或例句或摘要实例的并列或先后关系的一级或多级的目录或树状目录或序列。  Arranging a different contiguous segment or indirect contiguous segment of the same keyword that reflects the text or a directory or tree of one or more levels of the parallel or sequential relationship of the statement or example sentence or summary instance of the segment or sequence.
所述的处理方法还可以包括, 所述目录或树状目录或序列中的关键词邻接词段或间接 邻接词段, 如果其下一级或下几级邻接词段只有一种, 该词段可以连同其下一级或下几级 邻接词段在其原有位置一起分布或存储或展示。  The processing method may further include: a keyword adjacent segment or an indirect adjacent segment in the directory or tree directory or sequence, if there is only one adjacent segment in the next level or the next level, the segment It may be distributed or stored or displayed together with its next level or next level of adjacent words in its original location.
所述的处理方法, 还可以包括:  The processing method may further include:
编排或获取多个同时包含任一关键词及其任一邻接词段或邻接词段组的多个文本的一 级或多级分类树状目录或例句序列的图表,其中每一个图表所涉及的多个文本含有同样的 关键词和同样的邻接词段或邻接词段组, 该分类是根据在所述各个文本中该关键词的邻接 词段或邻接词段组的下一级邻接词段相同与否, 或者进而逐级根据下面各级邻接词段相同 与否进行的。  Arranging or acquiring a plurality of charts of a first-level or multi-level classification tree or example sentence sequence containing a plurality of texts of any of the keywords and any of the adjacent segments or adjacent segments, wherein each of the charts relates to The plurality of texts contain the same keyword and the same adjacent segment or a group of adjacent segments, the classification being the same according to the adjacent segment of the keyword or the adjacent segment of the adjacent segment in the respective text. Whether or not, or further step by step, according to whether the adjacent segments of the following levels are the same or not.
所述的处理方法还可以包括, 在上述的文本或目录或语句或例句或摘要实例中或者在 它们所包含的关键词或邻接词段或间接邻接词段附近, 可以具有其相应的并列子集数目或 下级子集数目或者相关词或词段所在子集的并列子集数目或所含的下级子集数目或文本数 目的提示。 本发明的处理方法, 还可以包括: The processing method may further include, in the above-mentioned text or directory or sentence or example sentence or abstract instance or in the vicinity of the keyword or the adjacent word segment or the indirectly adjacent word segment they contain, may have their corresponding juxtaposition subsets. A hint of the number or number of subordinate subsets or the number of juxtaposed subsets of the subset of related words or segments or the number of subordinate subsets or texts contained. The processing method of the present invention may further include:
编排含有同样关键词的多个文本或文本部分内容的序列, 它们含有的所述关键词由多 个词组成的邻接词段互不相同, 或基本上互不相同。 或者说, 含有相同所述邻接词段的文 本或文本部分内容只有一个或多个作为代表。  A sequence of a plurality of text or text partial contents containing the same keyword, which contain adjacent words composed of a plurality of words, which are different from each other, or substantially different from each other. In other words, only one or more of the contents of the text or text portion containing the same adjacent segment are represented.
这样可以用条数减少大约一半数量级的同关键词的代表性信息序列, 来代替原来的海 量信息, 使阅读大为便利。  In this way, the representative information sequence of the same keyword can be reduced by about half of the number to replace the original mass information, which makes reading more convenient.
本发明所述的处理方法, 还可以包括:  The processing method of the present invention may further include:
将所述多个文本的同样关键词的不同邻接词段进行相似比较, 划分各种相似子集, 或 者编成彼此不相似邻接词段的序列或目录。  Different adjacent segments of the same keyword of the plurality of texts are similarly compared, divided into various similar subsets, or programmed into sequences or directories that are not similar to each other.
本方法还可以包括: 将所述相应的各个目录或序列或它们的图表分别作为独立文本,令 所述独立文本分别具有包含相应的关键词的标题或题录,或者分别具有包含相应的关键词 和相应的同样邻接词段或邻接词段组或相似子集名称的标题或题录;  The method may further include: respectively, respectively, the respective respective directories or sequences or their charts as independent texts, respectively, so that the independent texts respectively have a title or a title containing a corresponding keyword, or respectively have corresponding keywords a title or title that corresponds to the same adjacent segment or adjacent segment group or similar subset name;
需要时, 令所述目录或序列或它们的图表的独立文本的标题或题录里或附近具有提示 其相关特点的提示字符或其组合。  If desired, the title or the title of the separate text of the catalog or sequence or their chart is provided with a prompt character or a combination thereof that prompts its associated feature in or near the title.
所述的处理方法, 还可以包括: 令相关搜索引擎或检索系统的关键词搜索结果排序装 置对所述目录或序列的独立文本的标题或题录里或附近具有的提示其特点的提示字符或其 组合具有识别能力, 需要时, 可以将其标题或题录里或附近具有相应字符或其组合的独立 文本或部分内容排在搜索结果的前面。  The processing method may further include: causing a keyword search result sorting device of the related search engine or the search system to have a prompt character for prompting the feature in the title or the title of the independent text of the directory or sequence or The combination has the ability to recognize, if desired, to place independent text or portions of the title or in the title with or without the corresponding character or combination thereof in front of the search results.
所述的处理方法, 还可以包括: 将其中所述提供的部分或全部目录文本或其标题或题 录的关键词索引数据(例如关键词索引倒排表)分布于搜索引擎或检索系统专门的索引数据 库或索引数据库的相应专门区域。  The processing method may further include: distributing part or all of the catalogue text or the keyword index data of the title or the title (for example, a keyword index inverted list) in the search engine or the retrieval system. The corresponding specialized area of the index database or index database.
本技术允许査询者在交互界面上对目录或序列或其他内容中的文字或图形或符号进行 指示, 例如点击光标, 确定或展开或链接相关内容。  The present technique allows a querier to indicate text or graphics or symbols in a directory or sequence or other content on an interactive interface, such as clicking a cursor to determine or expand or link related content.
本发明的处理方法也可以包括: 在所述的处理方法或者目录中, 并列子集或者并列邻 接词段或者并列文本或者并列的文本部分内容或代表性序列信息中的某一个的具体排序位 置, 可以部分或完全取决于相关子集或相关文本或者词段或内容或信息或所在文本的佩奇 链接值、 点击率、 关键词出现率、 下级子集数目或下属文本数目、 子集点击率、 文本佩奇 链接值的平均数值或最高值、 在巳有网站或系统中搜索结果的排序、 竞价、 拼写方式、 笔 划、 来源评分、 收录时间及其他等等因素中某一个或多个, 或者由相应的目标函数值来决 定。  The processing method of the present invention may further include: in the processing method or the directory, the specific sorting position of the parallel subset or the parallel adjacent word segment or the parallel text or the parallel text portion or the representative sequence information, Partial or complete depending on the relevant subset or related text or paragraph or content or information or the text's page link value, click rate, keyword occurrence rate, number of subordinate subsets or subordinate text, subset click rate, The average or maximum value of the text page link value, the sorting of the search results on the website or system, the bid, the spelling, the stroke, the source rating, the time of inclusion, and others, etc., or by The corresponding objective function value is determined.
本发明的处理方法还可以包括: 允许在巳有处理的方法或结果上, 增加或减少应具备 或不能具备的另外的关键词, 或者增加或减少时间或地域或语种或者其他类型或范围或要 求的限制, 得到进一步精炼的结果或更宽泛的结果。  The processing method of the present invention may further comprise: allowing for additional or reduced additional keywords that are or may not be available, or increasing or decreasing time or geographic or language or other type or scope or requirement, in the method or result of the processing. Limits, get further refined results or more broad results.
本发明的另一个方面是一种包括存贮装置的计算机数据系统, 其特征在于, 所述存储 装置或其中的数据部分所含有的部分或全部关键词索引或文本摘要或文本的数据以下列方 式分布:  Another aspect of the present invention is a computer data system including a storage device, wherein the storage device or data portion thereof contains part or all of a keyword index or text summary or text data in the following manner Distribution:
其文本摘要或文本含有同一关键词而该关键词邻接词段相同或不同的索引或文本摘要 或文本的数据, 位于同一关键词集的同一或不同子集的分布区域。  The text or the text containing the same keyword and the same or different index or text summary or text of the keyword adjacent to the word segment, located in the same or different subset of the same keyword set.
本发明的又一个方面是另一种包括存贮装置的计算机数据系统, 其特征在于, 所述存 储装置或其中的数据部分所含有的部分或全部的关键词索引的数据结构组成至少包括: 关键词段;  Still another aspect of the present invention is a computer data system including a storage device, wherein the data structure of the partial or total keyword index contained in the storage device or the data portion thereof includes at least: Word segment
一个或多个邻接词段,由相应文本内容中或文本摘要中的关键词的依次邻接的预定数目 的各级邻接词段按原顺序映射组成, 依次为: 邻接词段 1, 邻接词段 2, …邻接词段 N;  One or more adjacent segments are composed of a predetermined number of adjacent segments adjacent to each other in the corresponding text content or the keywords in the text summary, in order: adjacency segment 1, adjacent segment 2 , ... adjacent segment N;
相应文本 ID段, 或其相关信息的 ID段;  The ID segment of the corresponding text ID segment, or its associated information;
其中 ID段是指地址段;  The ID segment refers to an address segment;
必要时, 可以包括相应文本含有的所述关键词的摘要段或标题段;  If necessary, may include a summary segment or a title segment of the keyword contained in the corresponding text;
该系统可以允许该计算机数据系统依据搜索规定包含的所述关键词段和各邻接词段中 的一个或多个的组合或者组合词段数目的增减, 来搜索或者以变换方式搜索相应的索引或 内容。  The system may allow the computer data system to search or transform the corresponding index or by transforming according to the combination of the keyword segment and one or more of the adjacent segments included in the search specification or the number of combined segments. content.
显然, 可以规定上述的 N最小为 1, 此时相关索引只有一个邻接词段。  Obviously, it can be specified that the above N minimum is 1, and the correlation index has only one adjacent segment.
本发明可以包括具有上述关键词索引的数据结构组成的计算机可读介质、 或存储着实 现和利用上述关键词索引的数据结构的指令的计算机可读介质。 The present invention may comprise a computer readable medium having the data structure of the above keyword index, or stored in a real A computer readable medium of instructions for data structures that utilize the above-described keyword indexing.
本发明可以是存储着可以由一个或多个处理装置执行的指令的计算机可读介质 (computer-readable medium) , 所述指令用以实现一种对多个含有同样关键词的电子文本 的处理方法, 可以包括:  The present invention can be a computer-readable medium storing instructions executable by one or more processing devices for implementing a method of processing a plurality of electronic texts containing the same keywords. , can include:
用以获得或搜索多个含有同样关键词的电子文本的指令;  Used to obtain or search for multiple instructions containing electronic text of the same keyword;
用以规定邻接词段所含字词数量或邻接词段截取方式的指令;  An instruction to specify the number of words contained in an adjacent segment or the manner in which adjacent segments are intercepted;
根据部分或全部文本中的每个文本内容中所述关键词的邻接词段与其他文本相同还是 不同, 将该文本与其他文本划分入同一或不同子集或类别或者进行相应的相同或不同处理 的指令;  According to whether the adjacent segment of the keyword in each text content in some or all of the text is the same as or different from other texts, the text is divided into the same or different subsets or categories or correspondingly the same or different processing. Instruction
所述的相应的相同或不同处理可以包括: 相应文本具有相同或不同的分布位置或存储 方式,或者得到相同或不同的子集标记,或者使得其索引具有相同或不同的标记或索引项, 或者具有相同或不同的编排方式, 或者在交互界面具有相同或不同的显示方式或位置, 或 者^许至少部分子集各有一个或多个邻接词段或文本进行跨子集组合或排序或在交互界面 展示。  The corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or ^ at least some of the subsets have one or more adjacent segments or texts to be combined or sorted or sorted across subsets Interface display.
本发明的又一个方面是提供了一种搜索引擎提供査询者所期望结果的搜索方法, 该搜 索引擎系统响应査询者经由交互界面提出的关键词査询要求, 从该系统相关的信息源或数 据库搜索并提供符合上述关键词要求的文本或文本摘要或索引或其相关信息; 本搜索方法 的特点在于, 该方法包括:  Yet another aspect of the present invention is to provide a search method for a search engine to provide a desired result of a queryer, the search engine system responding to a keyword query request submitted by a queryer via an interactive interface, from a system-related information source. Or the database searches for and provides a text or text summary or index or related information that meets the above keyword requirements; this search method is characterized in that the method includes:
该系统经由交互界面接收査询者的关键词査询要求;  The system receives the query query keyword query request via the interactive interface;
确认后, 根据该关键词要求査询包含关键词索引的数据库;  After confirmation, query the database containing the keyword index according to the keyword requirement;
将在含有上述关键词的文本内容中或文本摘要中出现的上述关键词连同其邻接词段, 作为关键语句;  The above keywords appearing in the text content containing the above keywords or in the text abstract together with their adjacent words as key statements;
所述邻接词段所包含的字词或字符的数量或该邻接词段的截止方式, 是由上述系统预 定的或者査询者同意或默认的或选定的,  The number of words or characters included in the adjacent segment or the manner in which the adjacent segment is cut off is predetermined by the above system or agreed by the queryer or default or selected.
需要时也可以根据邻接词段的端部或端部附近的符号或字或词或其字体或颜色或空格 来确定, 或者由査询者在交互界面呈现的选择栏里或包含某具体索引的文本摘要或文本或 相关内容的页面上的进行的光标指示的位置和方式来确定;  If necessary, it can also be determined according to the symbol or word or word near the end or end of the adjacent segment or its font or color or space, or by the queryer in the selection bar presented by the interactive interface or containing a specific index. The position and manner of the cursor indication on the page of the text summary or text or related content is determined;
根据上述的邻接词段或关键语句归纳整理出各不相同的邻接词段或者各不相同的关键 语句 ·  According to the above-mentioned adjacent words or key sentences, the different adjacent segments or different key statements are sorted out.
n ½据得到的关键语句生成搜索结果, SP : 并将含有所述的相同或不同关键语句的不同 索弓 I或文本摘要或文本进行检索或处理或编排或整理, 以供査询者经由交互界面选用。  n 1⁄2 generates search results according to the obtained key sentences, SP : and retrieves or processes or arranges or organizes different or different texts or texts or texts of the same or different key sentences for the queryer to interact with The interface is selected.
需要的话, 以上操作可以由该系统预先或在査询时进行。  If necessary, the above operations can be performed by the system in advance or at the time of inquiry.
所述的搜索方法还可以包括:  The search method may further include:
将在含有上述关键语句的文本内容中或文本摘要中出现的上述关键语句连同其邻接词 段, 或者将原关键语句连同其邻接词段, 作为扩展的关键语句;  The above-mentioned key statement that appears in the text content containing the above-mentioned key statement or in the text summary together with its adjacent term, or the original key statement together with its adjacent segment as the key statement of the extension;
所述邻接词段所包含的字词或字符的数量或该邻接词段的截止方式或具体内容, 可以 是由上述系统预定的或者査询者同意或默认的或选定的,  The number of words or characters included in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined by the above system or agreed by the inquirer or default or selected.
也可以根据邻接词段的端部或端部附近的符号或字或词或其字体或颜色或空格来确 定, 或者由査询者在交互界面呈现的选择栏里或包含某具体索弓 I的文本摘要或文本或相关 内容的页面上的进行的光标指示的位置和方式来确定;  It can also be determined according to the symbol or word or word near the end or end of the adjacent segment or its font or color or space, or by the queryer in the selection bar presented by the interactive interface or containing a specific cable. The position and manner of the cursor indication on the page of the text summary or text or related content is determined;
根据上述的邻接词段或关键语句归纳整理出各不相同的邻接词段或者各不相同的扩展 的关键语句;  According to the above-mentioned adjacent segments or key sentences, the key sentences of different adjacent segments or different extensions are summarized;
根据得到的扩展 (需要时可以多次扩展) 的关键语句生成搜索结果, SP : 将含有所述 的相同或不同的扩展的关键语句的不同索引或文本摘要或文本或题录进行检索或处理或编 排或整理或分别存储, 以供査询者经由交互界面选用。  Generate search results based on key statements obtained by extensions (which can be expanded multiple times if needed), SP: Retrieve or process different indexes or text summaries or texts or titles containing the same or different extended key statements as described or Arrange or organize or store separately for the queryer to select via the interactive interface.
需要时, 以上操作可以由该系统预先或在査询时进行。  The above operations can be performed by the system in advance or at the time of inquiry, as needed.
在需要时,所涉及的上述邻接词段的截止位置,例如可以规定每个邻接词段的的词数, 例如词数为一。  When necessary, the cut-off position of the adjacent segment mentioned above may, for example, specify the number of words of each adjacent segment, for example, the number of words is one.
所述的处理方法或数据系统或搜索方法可以是用于互联网搜索引擎系统的, 也可以是 用于局域的或独立的计算机信息库搜索系统的, 例如数字图书馆系统、 文献资料库数字搜 索系统。 这样, 原来关键词搜索庞大初步结果将得到一再细分的子集体系, 便于用户选择。 需要时, 所述的搜索方法还可以包括编组操作: The processing method or data system or search method may be used for an Internet search engine system, or may be used for a local or independent computer information base search system, such as a digital library system, a digital database of a document database. system. In this way, the original preliminary results of the keyword search will be subdivided into a subset system, which is convenient for users to choose. The search method may also include a grouping operation as needed:
即允许将含有同样关键词的各种不同的关键语句或邻接词段或索弓 I或题录或文本摘要 或文本, 或者将含有同样原关键语句的各种不同的扩展的关键语句或邻接词段或索引或题 录或文本摘要或文本, 各自编组以目录或序列形式排列或显示, 其中对每一种邻接词段所 在的关键语句或索引或题录或文本摘要或文本仅收进各一个或多个。  That is, it is allowed to have a variety of different key sentences or adjacent paragraphs containing the same keyword or a title or text summary or text, or a variety of different key words or adjacent words that will contain the same original key statement. Segments or indexes or bibliographies or text summaries or texts, each grouped or displayed in a catalog or sequence, with only one key sentence or index or title or text summary or text for each adjacent segment being included in each Or multiple.
这样编组, 具有一定单一性或代表性, 可以帮助用户阅读少量交互界面画面信息就能 作出选择。 当我们选择的关键语句的长度达到一定程度时, 得到的索引或题录或摘要或题 录编组序列的核心内容将基本不重复也无遗漏。  This grouping has a certain unity or representativeness, which can help users to select a small amount of interactive interface screen information. When the length of the key statement we choose reaches a certain level, the core content of the index or the title or summary or the sequence of the title sequence will not be repeated.
所述的搜索方法还可以包括:  The search method may further include:
令所述部分或全部关键词索引或文本摘要或文本的数据, 根据其含有的关键词或关键 语句或者扩展关键语句的不同或相同, 分布于不同或相同的子集区域或者不同或相同的更 低级的子集区域存储。 这样可以在关键词査询时, 直接提取或提供相应的关键语句或关键 词索引或题录或文本摘要或文本的数据。  Having the data of some or all of the keyword indexes or text summaries or texts distributed in different or identical subset regions or different or identical according to different or identical keywords or key sentences or extended key sentences Low-level subset area storage. This allows you to directly extract or provide data for key statements or key word indexes or bibliographic or text summaries or texts when searching for keywords.
所述的搜索方法也可以包括:  The search method may also include:
对所述数据库内的文本或摘要或搜索引擎附带的数据采集服务器从互联网或其他信息 源得到的文本进行分析, 产生所述文本相应的至少包含关键词段和邻接词段和文本或相关 内容的 ID段 (地址段) 的索引, 必要时包括文本摘要或标题, 并存储。 可以在搜索时, 根据其所包含的关键词段和邻接词段在存储中检索和提供相应的索引或摘要或标题或文 本。  Parsing the text obtained from the Internet or other information source by the text or abstract in the database or the data collection server attached to the search engine, and generating the text corresponding to at least the keyword segment and the adjacent segment and the text or related content. The index of the ID segment (address segment), including a text summary or title if necessary, and stored. At the time of the search, the corresponding index or abstract or title or text can be retrieved and provided in the store according to the keyword segments and the adjacent segments it contains.
文本地址, 可以是数据库地址或互联网地址或 URL或代表该 URL域散列的形式或其 他形式, 可以提供访问或打开文本的链接。  The text address, which can be a database address or an internet address or URL or a form that represents a hash of the URL field, can provide a link to access or open the text.
所述的搜索方法还可以包括:  The search method may further include:
编排一个反映具有同一关键词的文本或文本摘要中的该关键词不同级别邻接词段之间 先后或并列关系的树状目录(图 8), 或者一个反映该关键词不同级别扩展的关键语句之间 先后或并列关系的树状目录; 以供査询时使用。  A tree-like directory (Fig. 8) that reflects the sequential or juxtaposed relationship between adjacent segments of the keyword in the text or text summary with the same keyword, or a key statement reflecting the different levels of expansion of the keyword. A tree directory of successive or juxtaposed relationships; used for querying.
所述的搜索方法也可以包括选定操作:  The search method may also include a selected operation:
即允许所述系统根据査询者在交互界面的页面的上述文本或文本摘要或题录或关键语 句或邻接词段目录上或者在选择栏或框中的光标指示, 确定相应的关键语句, 并且对该关 键语句对应的各种不同的扩展的关键语句或扩展的邻接词段或索引或文本摘要或文本或题 录进行编组操作或目录展示, 或者进行相应索引或文本摘要或文本或题录的排序展示, 或 者根据确定的相应的关键语句进行移除操作, 将所述页面或其他多个页面含有该关键语句 的条目或索引或文本摘要或文本或题录剔除或移动位置。  That is, the system is allowed to determine the corresponding key statement according to the above text or text summary or the title or key sentence or the adjacent segment directory of the page of the queryer or the selection column or the frame of the queryer, and Performing a grouping operation or catalog display of various different extended key sentences or extended adjacent paragraphs or indexes or text summaries or texts or titles corresponding to the key statement, or performing corresponding index or text summary or text or bibliography Sorting the presentation, or performing a removal operation based on the determined corresponding key statement, culling or moving the page or other plurality of pages containing the entry or index or text summary or text or title of the key statement.
所述的搜索方法还可以包括忽视操作:  The search method may also include an ignore operation:
即根据査询者浏览包含原关键词的或者包含原关键语句的索引或文本摘要或题录或文 本序列时在交互界面上对页面或在页面上的操作, 判断査询者浏览该索引或文本序列的即 时位置; 如果可以确定,排列在该位置前面一定范围里包含某种关键语句的索引或文本摘 要或题录或文本或该关键语句本身一直或连续一定次数未被打开或链接, 也未被以其他方 式点击或关注或提示保留, 则根据该关键语句进行移除操作, 将所述页面或其他多个页面 含有该关键语句的条目或索引或文本摘要或题录或文本剔除或移动位置。  The queryer browses the index or text according to the operation of the page or the page on the interactive interface when the queryer browses the index or text summary or the title or text sequence containing the original keyword or contains the original key sentence. The immediate position of the sequence; if it can be determined, an index or text summary or a bibliography or text containing a certain key sentence in a certain range in front of the position or the key statement itself has not been opened or linked for a certain number of times, or not Being clicked or followed or prompted to be retained in other ways, the removal operation is performed according to the key statement, and the page or other pages containing the key statement or the index or text summary or the title or text are removed or moved. .
这一方法, 可以将査询者在阅读过程中长时间未关注文件的类似文件信息从后面的序 列中后移或剔除, 减少无用信息过多的困扰。  In this method, the similar file information of the file that the querier does not pay attention to for a long time in the reading process can be moved backward or eliminated from the following sequence, thereby reducing the trouble of excessive useless information.
本发明的另一个方面是给出了一种响应査询者 1经由交互界面 2提出的要求, 提供所 期望搜索结果的搜索引擎系统, 包括:  Another aspect of the present invention is to provide a search engine system that provides a desired search result in response to a request by the querier 1 via the interactive interface 2, including:
服务器 5, 该服务器经由通讯网络 4或线路与所述交互界面 2所在的客户  Server 5, the server is connected to the client where the interaction interface 2 is located via the communication network 4 or the line
机 3親合 · Machine 3 affinity
位 务器 5的搜索引擎 8, 所述搜索引擎 8包括: 包括关键词索引在内的数据库 9, 以及査询器 11,该査询器能够根据査询者提出的关键词要求在所述数据库 9进行査询并将 査询到的相关数据结果列表提供给交互界面 2; a search engine 8 of the server 5, the search engine 8 comprising: a database 9 including a keyword index, and a querier 11 capable of requesting the database in accordance with a keyword request by the querier 9 querying and providing the queryed related data result list to the interactive interface 2 ;
特点、  Characteristics,
述 据库 9还包括包含关键词的文本摘要或文本, 该文本摘要可以包含在所述关键 词索引内; The library 9 also includes a text summary or text containing keywords that can be included in the key Within the word index;
所述査询器 11或搜索引擎 8包括关键词扩展部件 10, 该部件可以对査询的关键词进 行扩展操作: 将在含有上述关键词的文本内容中或文本摘要中出现的上述关键词连同其邻 接词段, 作为各个不同的扩展关键语句, 并将其列表或将上述关键词的不同邻接词段列表 以供査询者经由交互界面选用, 或者将含有相同或不同的所述扩展关键语句的不同索引或 文本摘要或文本进行检索或处理或编排或整理, 以供査询者经由交互界面选用;  The querier 11 or the search engine 8 includes a keyword expansion component 10 that can perform an expansion operation on the keywords of the query: the above keywords that appear in the text content containing the above keywords or in the text summary together with Adjacent segments, as different extended key statements, and list or list different adjacent segments of the above keywords for the queryer to select via the interactive interface, or will contain the same or different said extended key statements Different indexes or text summaries or texts are retrieved or processed or arranged or organized for the queryer to select via the interactive interface;
所述关键词扩展部件 10同样可以对关键语句进行一次或多次扩展操作,并进行相应的 査询或处理。  The keyword expansion unit 10 can also perform one or more expansion operations on key statements and perform corresponding queries or processing.
这里所述的 "査询的关键词"可以是指 "正査询的关键词"或 "待査询的关键词"。 以上所述的搜索引擎系统可以是位于互联网的为网上客户服务的搜索系统, 也可以是 独立的计算机信息库搜索系统。所述的服务器 5为计算机存储和处理装置,可以是单个的, 也可以是多个成组或分散配置的。 所述的客户机 3可以是个人电脑或工作站或其他计算机 装置, 需要时, 可以配置适当的浏览器。  The "keyword of the query" described herein may refer to "a keyword being queried" or "a keyword to be queried". The search engine system described above may be a search system for the online customer service located on the Internet, or may be a stand-alone computer information base search system. The server 5 is a computer storage and processing device, which may be a single device or a plurality of groups or distributed configurations. The client 3 can be a personal computer or workstation or other computer device, and an appropriate browser can be configured as needed.
所述的搜索引擎系统还可以允许: 所述的搜索引擎包括索引构造部件 13, 用于对所述 数据库内的文本或搜索引擎附带的数据采集服务器 12从互联网 4或其他信息源得到的文本 进行分析, 产生所述文本相应的至少包含关键词段和邻接词段和文本 ID段的索引, 并存 储。  The search engine system may further allow: the search engine includes an index construction component 13 for performing text in the database or text obtained by the data collection server 12 attached to the search engine from the Internet 4 or other information source. The analysis generates an index corresponding to at least the keyword segment and the adjacent segment and the text ID segment, and stores the corresponding text.
需要时, 此处可以简单地规定每个邻接词段的词数, 例如词数为一。  When needed, the number of words in each adjacent segment can be simply specified here, for example, the number of words is one.
所述的搜索引擎系统还可以包括一个反映具有同一关键词的文本或文本摘要中的该关 键词不同级别邻接词段之间先后关系的邻接词段树状目录(图 8), 或者包括一个反映该关 键词不同级别扩展的关键语句之间先后关系的树状目录。  The search engine system may further comprise a tree of adjacent segments reflecting the relationship between adjacent segments of the keyword in the text or text summary having the same keyword (Fig. 8), or including a key reflecting the key A tree-like directory of the relationship between key statements of different levels of word expansion.
本发明还可以包括具有反映具有同一关键词的文本或文本摘要或题录中的该关键词不 同级别邻接词段之间先后关系的邻接词段树状目录, 或者反映该关键词不同级别扩展的关 键语句之间先后关系的树状目录的计算机可读介质 (computer-readable medium )0 The present invention may further comprise a tree of adjacent segments having a relationship between different levels of adjacent words in the text or text summary or the title of the keyword, or a key statement reflecting the different levels of expansion of the keyword. tree has between computer readable media relations (computer-readable medium) 0
实际上也可以说, 这意味着该计算机可读介质存储着可以由一个或多个处理装置执行 的用以反映所述树状目录的指令。 上述树状目录, 实际上也可以同时反映同级邻接词段或 关键语句的并列关系。  In fact, it can also be said that this means that the computer readable medium stores instructions which can be executed by one or more processing means to reflect the tree directory. The above tree directory can actually reflect the parallel relationship of peer segments or key statements at the same time.
所述的邻接词段树状目录的相应邻接词段处, 或者反映该关键词不同级别扩展的关键 语句之间先后关系的树状目录的相应关键语句处, 也可以显示其后面的子集数量或所含文 件数量。  The corresponding key segment of the tree directory of the adjacent segment tree directory, or the corresponding key sentence of the tree directory reflecting the relationship between the key statements of different levels of the keyword, may also display the number of subsets or the number of subsequent segments thereof. Contains the number of files.
所述的搜索引擎系统, 还可以包括一种用户图形交互界面(图 10), 允许査询者添加 附加査询信息, 其界面可以包含一种对话框或选择框 51, 以接收査询者对操作方式或模式 等方面的选择; 其界面可以包含可以点击的关键语句或邻接词段或语句或段落或操作命令 或选择的文字或符号或图形, 允许査询者添加附加査询信息。  The search engine system may further include a user graphical interaction interface (FIG. 10), which allows the queryer to add additional query information, and the interface may include a dialog box or selection box 51 to receive the queryer pair. Choice of operation mode or mode; its interface can contain key statements that can be clicked or adjacent words or sentences or paragraphs or operation commands or selected text or symbols or graphics, allowing the querier to add additional query information.
本发明的以关键词及邻接词为核心的搜索技术, 在划分和不断缩小同一关键词搜索结 果范围方面, 具有词典式的严密性和明显超越现有技术的便捷性, 还可以将常常上百万条 的同关键词网上信息,浓缩成条数减少 2、3个数量级的同关键词的代表性信息序列或目录, 而每条信息的核心内容 (关键词附近几个邻近词构成的内容) 既不重复也不遗漏, 将更好 地满足广大信息搜索用户长期以来的迫切需求。  The search technology with keywords and adjacent words as the core of the invention has the rigorousness of dictionary and the convenience of obviously surpassing the prior art in dividing and continuously narrowing the range of search results of the same keyword, and can often be hundreds of times. 10,000 pieces of the same keyword online information, condensed into a representative information sequence or directory of the same keyword with 2, 3 orders of magnitude, and the core content of each piece of information (contents of several adjacent words near the keyword) It will not repeat or miss, and will better meet the long-term urgent needs of information search users.
(4) 附图说明  (4) BRIEF DESCRIPTION OF THE DRAWINGS
图 1为规定邻接词段所含字词数量或邻接词段截取方式的举例示意图。  Figure 1 is a schematic diagram showing an example of the number of words contained in an adjacent segment or the way in which adjacent segments are intercepted.
图 2为同样关键词的不同邻接词段或相应子集的树状目录举例示意图。  Figure 2 is a schematic diagram showing an example of a tree directory of different adjacent segments or corresponding subsets of the same keyword.
图 3为同样关键词的不同相似子集及下级不同邻接词段子集的举例示意图。  FIG. 3 is a schematic diagram showing examples of different similar subsets of the same keyword and subsets of different adjacent segments of the lower level.
图 4所示为根据本发明的搜索系统的一个实施例的结构框图。  4 is a block diagram showing the structure of an embodiment of a search system in accordance with the present invention.
图 5所示为本发明的一个实施例的关键语句生成的示意图。  Figure 5 is a schematic diagram showing the generation of key statements in accordance with one embodiment of the present invention.
图 6所示为本发明的实施例的另一种关键语句生成方式的示意图。  FIG. 6 is a schematic diagram showing another key sentence generation manner according to an embodiment of the present invention.
图 7所示为本发明的一个实施例的用户在交互界面的示例性操作流程图。  FIG. 7 is a flow chart showing an exemplary operation of a user at an interactive interface according to an embodiment of the present invention.
图 8所示为本发明的一个实施例展示的一个反映关键词不同级别邻接词  FIG. 8 is a block diagram showing a different level of adjacent words of a keyword according to an embodiment of the present invention.
之间先后关系的邻接词段树状目录示意图。 A schematic diagram of a tree of adjacent segments in a sequential relationship.
图 9所示为本发明的一个实施例的搜索引擎的工作流程图。  FIG. 9 is a flow chart showing the operation of a search engine according to an embodiment of the present invention.
图 10所示为本发明的一个实施例的搜索过程中一次光标点击 (选定操作)及生成显 示结果的局部屏幕画面示意图。 FIG. 10 shows a cursor click (selected operation) and a display display during a search process according to an embodiment of the present invention. A partial screen shot showing the results.
图 11所示为本发明的方法一个实施例的示例性流程图。  Figure 11 shows an exemplary flow chart of one embodiment of the method of the present invention.
(5 ) 具体实施方式  (5) Specific implementation
下面结合附图, 在前面 "发明内容"的基础上进一步具体说明。  The following is further described in detail on the basis of the "invention" in the following with reference to the accompanying drawings.
本发明提供的一种计算机执行的对多个含有同样关键词的电子文本进行处理的方法, 具体举例来说包括:  A computer-implemented method for processing a plurality of electronic texts containing the same keyword provided by the present invention includes, by way of example only:
首先从计算机或数据库或互联网获得多个含有同样关键词的电子文本; 所述的文本可 以是电子文件或文档或网页或者它们的摘要或索引或题录或题目,也可以是数据库、著作、 词典、 手册、 专利文献的各种信息内容。  First, obtain a plurality of electronic texts containing the same keyword from a computer or a database or the Internet; the texts may be electronic files or documents or web pages or their abstracts or indexes or titles or topics, or may be databases, works, or dictionaries. , manuals, and various information content of patent documents.
规定文本里该关键词的邻接词段所含字词数量或邻接词段截取方式:  Specify the number of words in the adjacent segment of the keyword in the text or the way in which the adjacent segment is intercepted:
具体来说所述的邻接词段一般是指直接邻接词段, 在必要时也可以是间接邻接词段; 直接邻接词是指该邻接词段在原文本内容中与上述关键词之间没有文字间隔, 而间接邻接 词段邻接词段指该邻接词段在原文本内容中与上述关键词之间有少量文字 (一两个字词) 间隔, 间隔大将明显影响本方法的使用效果。  Specifically, the adjacent word segment generally refers to a directly adjacent segment, and may also be an indirectly adjacent segment when necessary; a direct adjacency refers to the adjacent segment having no text between the original text and the above keyword. The interval, and the indirect adjacent segment adjacent segment means that the adjacent segment has a small amount of text (one or two words) between the original text content and the above keyword, and the large interval will obviously affect the use effect of the method.
所述的邻接词段可以是关键词前面的, 也可以是关键词后面的; 一般是文本内容中的 一个或多个词或字甚至词根组成的词段, 需要时也包括某些字符, 如缩写字母、 标点等。  The adjacent segment may be preceded by a keyword or may be followed by a keyword; generally a segment consisting of one or more words or words or even roots in the text content, and including certain characters when needed, such as Abbreviations, punctuation, etc.
所述邻接词段所包含的字词或字符的数量或该邻接词段的截止方式或具体内容可以是 计算机系统预定的或者査询者同意或默认的或选定的, 或者由査询者在交互界面呈现的选 择栏里或包含某具体索引的文本摘要或文本或相关内容的页面上的进行的光标指示的位置 和方式来确定。  The number of words or characters included in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined by the computer system or agreed by the querier or selected or selected, or may be The position and manner of the cursor indication on the selection bar of the interactive interface presentation or the text summary or text or related content of a specific index is determined.
图 1和图 5、 图 6给出了规定邻接词段所含字词数量或邻接词段截取方式  Figure 1 and Figure 5, Figure 6 show the number of words in the adjacent segment or the way in which the adjacent segments are intercepted.
的几个例子。 在图 1 的例子中, 关键词为 "搜索引擎"。 其中 101表示 "截取关键词前 2 实词"的方式; 102表示 "截取关键词前 2+后 2实词"的方式; 103表示 "截取关键词后 2实词"的方式; 104表示"截取关键词后面第一个逗号或句号前实词 "的方式; 105表示 "截取关键词后面距离不小于 2个词的第一个逗号或句号前面的词"的方式。 A few examples. In the example of Figure 1, the keyword is "search engine." 101 means "the way to intercept the first 2 real words of the keyword"; 102 means "the way to capture the first 2 words after the keyword"; 103 means the "2 words after the keyword is intercepted"; 104 means "after the keyword is intercepted" The first comma or the period before the actual word "105; 105 means "the way to intercept the keyword after the first comma or the word before the period" is not less than 2 words.
在某些必要情况下, 判断词段的长度, 也可以略去或不考虑某些词的前缀或后缀或者 某些虚词或助词或数词或量词或非实词或者标点或空格的有无或差别或不同 (参见后面的 实施例 A), 甚至也可以略去或不考虑其中的形容词或副词的有无或差别或不同。  In some cases, the length of the segment may be judged, or the prefix or suffix of certain words or the presence or absence of certain words or auxiliary words or numerals or quantifiers or non-real words or punctuation or spaces may be omitted or not considered. Or different (see the following embodiment A), the presence or absence or difference or difference of the adjectives or adverbs therein may even be omitted or not considered.
当检索时的关键词为可以分开的多个字词时, 例如上述邻接词段可以是指其中某一字 词 (如靠前的字词) 或多个字词的各个邻接词段。 在后者的情况下, 可能需要对关键词的 不同部分的邻接词段分别进行比较, 才能判断不同文本的关键词邻接词段是否完全相同。  When the keyword at the time of retrieval is a plurality of words that can be separated, for example, the above-mentioned adjacent segment may refer to each adjacent segment in which one word (such as a preceding word) or a plurality of words. In the latter case, it may be necessary to compare adjacent segments of different parts of the keyword to determine whether the adjacent words of different texts are identical.
当一个文本中多次出现同一关键词时, 可以仅仅考虑任一出现的关键词的邻接内容, 还可以将该文本适当分开, 当作多个文本来处理。 这对于篇幅较长文本的检索来说比较适 用。  When the same keyword appears multiple times in a text, only the contiguous content of any of the appearing keywords can be considered, and the text can be appropriately separated and treated as multiple texts. This is useful for searches with longer texts.
° 根据部分或全部文本中的每个文本内容中所述关键词的邻接词段与其他文本相同还是 不同, 将该文本与其他文本划分为相同或不同的子集或类别或者进行相应的相同或不同处 理。 (参见后面的实施例 A)  ° According to whether the adjacent segment of the keyword in each text content in some or all of the text is the same as or different from other texts, the text is divided into the same or different subsets or categories or correspondingly the same or Different treatments. (See Example A below)
一般说来, 所谓 "相同"意味着两个词段完全一样; 但在某些必要情况下, 判断两个 词段的相同或不同, 也可以略去或不考虑某些词的前缀或后缀或者某些虚词或助词或数词 或量词或非实词或者标点或空格的有无或差别或不同, 甚至也可以略去或不考虑其中的形 容词或副词的有无或差别或不同。  In general, the so-called "identical" means that the two segments are exactly the same; but in some cases, the same or different words are judged, and the prefix or suffix of some words may be omitted or not considered. The presence or absence or difference or difference of certain function words or auxiliary words or numerals or quantifiers or non-real words or punctuation or spaces may even omit or disregard the presence or absence or difference or difference of adjectives or adverbs.
例如,需要的话,如果按照宽松的标准,可以认为: "科学的力量是十分强大的 "与"科 学力量十分强大"是两个相同的邻接词段。  For example, if needed, if you follow the loose criteria, you can think: "The power of science is very powerful" and "the science is very powerful" is two identical adjacent segments.
将每个文本内容中所述关键词的邻接词段与其他文本相同还是不同, 将该文本与其他 文本划分为相同或不同类别以后,査询者可以直接根据对关键词的某一种邻接词段的兴趣, 按照类型获得或跳过包含该种邻接词段的类别的所有文本。  The adjacent segment of the keyword in each text content is the same as or different from other texts. After the text is divided into the same or different categories, the queryer can directly refer to a certain adjacent word of the keyword. The interest of the segment, obtained or skips all texts of the category containing the adjacent segment by type.
所述的相应的相同或不同处理可以包括: 相应文本具有相同或不同的分布位置或存储 方式,或者得到相同或不同的子集标记,或者使得其索引具有相同或不同的标记或索引项, 或者具有相同或不同的编排方式, 或者在交互界面具有相同或不同的显示方式或位置, 或 者允许至少部分子集各有一个或多个邻接词段或文本进行跨子集组合或排序或在交互界面 展示。 (参见后面实施例 A) 对于属于某个或某些同一第一级子集或较高的子集的不同文本,或者说其内容含有同 样关键词及邻接词段的不同文本, 可以根据其含有的所述同样关键词及邻接词段的其他邻 接词段的相同还是不同, 将部分或全部所述文本划分入上述子集的同一或不同的下一级或 多级子集或者进行相应的相同或不同处理。 The corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or allow at least some of the subsets to have one or more adjacent segments or texts to be combined or sorted across the subset or in the interactive interface. Show. (See Example A below) For different texts belonging to one or some of the same first-level subset or higher subset, or different texts whose contents contain the same keyword and adjacent segments, may be based on the same keywords and The other adjacent segments of the adjacent segment are the same or different, and some or all of the text is divided into the same or different lower or multi-level subsets of the above subset or the corresponding identical or different processing is performed.
此处所述的相应的相同或不同处理同样可以包括: 相应文本具有相同或不同的分布位 置或存储方式, 或者得到相同或不同的子集标记, 或者使得其索引具有相同或不同的标记 或索引项, 或者具有相同或不同的编排方式, 或者在交互界面具有相同或不同的显示方式 或位置, 或者允许至少部分子集各有一个或多个邻接词段或文本进行跨子集组合或排序或 在交互界面展示。 (参见后面的实施例 A及图 10的内容)  The corresponding identical or different processes described herein may also include: the respective texts having the same or different distribution locations or storage methods, or obtaining the same or different subset markers, or having their indices having the same or different markers or indexes Items, either having the same or different arrangement, or having the same or different display modes or positions in the interactive interface, or allowing at least some of the subsets to have one or more adjacent segments or texts to be combined or sorted across subsets or Displayed in the interactive interface. (See the contents of Example A and Figure 10 below)
这实际上就是通过对原有关键词邻接词段的扩大以及对扩大部分相同与否的比较, 将 原来的同一邻接词段子集进一步细分为若干下一级子集, 如果需要, 还可以继续下去, 直 到得到査询者满意的结果。 这也是本方法的又一个优势。  This is actually to further subdivide the original subset of the same adjacent segment into a number of lower-level subsets by expanding the adjacent words of the original keywords and comparing the same or not. If necessary, continue Go on until you get the results that the inquirer is satisfied with. This is another advantage of this method.
例如, 我们对关键词为 "搜索引擎"的多个文本根据邻接词段划分子集, 其中的第一 邻接词段按 "关键词前 1词+后 1词"的方式截取, 这样得到了多个子集, 其中含有邻接 词段为 "专业 K公司"子集 (此处 K代表关键词 "搜索引擎" ), 该子集包含 185个文 本; 如果我们将这 185个文本按 "K公司"后面的 3个词构成的第二邻接词段是否相同进 行划分, 又得到第二邻接词段为 "通过专业技术"、 "力图开拓市场"等 13个二级子集; 如 果我们对含有 "通过专业技术"词段的二级子集包含的文本继续按第三邻接词段 (为其后 面的 2实词词段) 划分下去, 还可以得到若干三级子集。 (可参考图 2、 图 8)。  For example, we divide a plurality of texts whose keywords are "search engines" according to a subset of adjacent words, and the first adjacent segment is intercepted in the manner of "1 word before word + 1 word after word", which results in more a subset of which contains a contiguous segment of the "professional K company" subset (where K stands for the keyword "search engine"), which contains 185 texts; if we press the 185 texts after "K company" The second adjacent segments formed by the three words are divided into the same, and the second adjacent segment is obtained as 13 secondary subsets such as "through professional technology" and "trying to open up the market"; The text contained in the secondary subset of the technical segment continues to be divided by the third adjacency segment (for the latter 2 word segments), and several third-level subsets are also available. (Refer to Figure 2 and Figure 8).
利用本发明的处理方法例如还可以允许依次的邻接词段的合并或分开, 以减少或增加 子集层次。 例如对于上述关键词为 "搜索引擎"的多个文本, 如果我们一开始就按 "关键 词前 1词+后 4词"的方式截取第一邻接词段, 得到的一级子集的数量应该等于前面方式 划分的各个二级子集数量之和, 结果相仿, 但子集层次减少了。  Utilizing the processing method of the present invention, for example, may also allow for the merging or separation of successive adjacent segments to reduce or increase the subset hierarchy. For example, if the above keyword is a plurality of texts of "search engine", if we first intercept the first adjacent segment in the manner of "1 word before word + after 4 words", the number of first-level subsets obtained should be Equal to the sum of the number of sub-level subsets divided by the previous method, the results are similar, but the subset level is reduced.
实际上, 面对同样的大量文本, 如果将关键词的邻接词段的长度定得较长, 得到的子 集的数目会多一些, 但每个子集的文本数会较少; 反过来, 如果将关键词的邻接词段的长 度定得较短, 得到的子集的数目会少一些, 但每个子集的文本数会较多。  In fact, in the face of the same large amount of text, if the length of the adjacent segment of the keyword is set longer, the number of subsets obtained will be more, but the number of texts in each subset will be less; conversely, if The length of the adjacent segment of the keyword is set to be shorter, and the number of subsets obtained will be less, but the number of texts for each subset will be larger.
面对上述方法的众多子集或它们的下级子集, 为了方便査阅, 我们可以编排一个反映 所述文本的同样关键词的不同邻接词段或间接邻接词段或者包含这些词段的语句或例句或 摘要实例的并列或先后关系的一级或多级的目录或树状目录或序列, 其中, 可以包括所述 文本的一个或多个不同子集各自的所述的相同邻接词段或相同间接邻接词段或者包含该词 段的语句或例句或摘要实例, 或者包括这个或这些子集的下一级或下几级多个子集各自的 相同的邻接词段或间接邻接词段或者包含该词段的语句或例句或摘要实例, 按照并列或隶 属先后关系编排或分布或存储或展示; 其中所述的词段或语句或例句或摘要实例可以是跨 子集并列的。  Faced with a large number of subsets of the above methods or their subordinate subsets, for ease of reference, we can arrange a different contiguous segment or indirect contiguous segment that reflects the same keyword of the text or a statement containing the segments or A one or more hierarchical directory or tree directory or sequence of side-by-side or sequential relationships of example sentences, wherein the same or adjacent segments of the one or more different subsets of the text may be included An indirect adjacency segment or a statement or example sentence or summary instance containing the segment, or an identical adjacent segment or indirect adjacency segment including each of the lower or lower subsets of the subset or subsets of the subset or subsets A statement or example sentence or a summary instance of a segment is arranged or distributed or stored or displayed in a side-by-side or subordinate relationship; wherein the segment or statement or example sentence or summary instance may be juxtaposed across the subset.
图 2和图 8所示目录就是这种目录的两个例子。图 2反映了上述关键词为 "搜索引擎" 的多个文本的不同邻接词段或者这些词段的邻接词段的并列或先后关系的一级或多级的目 录或树状目录。 (图 8所示例子将在后面说明)  The directories shown in Figures 2 and 8 are two examples of such directories. Figure 2 reflects the first or multi-level directory or tree directory of the adjacent words of the plurality of texts of the "search engine" or the parallel or sequential relationship of the adjacent segments of the segments. (The example shown in Figure 8 will be explained later)
图 2的树状目录示例中, 关键词为 "搜索引擎", 用符号 "K"代表, 其中的第一邻接 词段按 "关键词前 1词+后 1词"的方式截取, 第二邻接词段为其后面的 3词词段, 第三 邻接词段为其后面的 2实词词段。  In the tree directory example of Figure 2, the keyword is "search engine", represented by the symbol "K", where the first adjacent segment is intercepted in the manner of "1 word before word + word after 1 word", the second adjacency The word segment is the 3 word segment behind it, and the third adjacent segment is the 2 word segment behind it.
如果我们在阅读有关文本不同邻接词段的目录时感到对理解其核心内容有困难, 就会 希望看到包含各个邻接词段的更多内容。 因而, 我们可能需要编排一个反映所述文本的同 样关键词的不同邻接词段或者这些词段的邻接词段的并列或先后关系的一级或多级的目录 或树状目录的派生序列, 其中, 原目录或树状目录中的任一邻接词段可以附加或替换为含 有该邻接词段的更多内容。  If we feel that it is difficult to understand the core content when reading a table of contents with different adjacent segments, we would like to see more content containing each adjacent segment. Thus, we may need to orchestrate a sequence of one or more levels of a directory or tree of different adjacent segments that reflect the same keyword of the text or the collocated or sequential relationship of the segments. Any adjacent segment in the original directory or tree directory can be appended or replaced with more content containing the adjacent segment.
例如这些内容可以是含有该邻接词段的语句或例句或摘要实例或题录或代表性文本。 其中所述的语句或例句或摘要实例或代表性文本中的关键词和邻接词段可以具有有别于其 他内容的字体或颜色或特点; 其中所述的词段或语句或例句或摘要实例或代表性文本可以 是跨子集并列的。  For example, the content may be a sentence or example sentence or a summary instance or a bibliographic or representative text containing the contiguous segment. The keywords and adjacent segments in the sentence or example sentence or summary text or representative text may have fonts or colors or features different from other content; wherein the word segment or statement or example sentence or summary instance or Representative text can be juxtaposed across subsets.
当然, 我们也可以编排或获取多个同时包含任一关键词及其任一邻接词段或邻接词段 组的多个文本的一级或多级分类树状目录或例句序列的图表,其中每一个图表所涉及的多 个文本含有同样的关键词和同样的邻接词段或邻接词段组, 该分类是根据在所述各个文本 中该关键词的邻接词段或邻接词段组的下一级邻接词段相同与否, 或者进而逐级根据下面 各级邻接词段相同与否进行的。 Of course, we can also arrange or obtain a plurality of charts of one or more levels of classification tree or example sentence sequences containing multiple words of any keyword and any of its adjacent segments or groups of adjacent segments, each of which More than one chart The text contains the same keyword and the same contiguous segment or contiguous segment group, which is based on the contiguous segment of the keyword or the next contiguous segment of the contiguous segment group in the respective texts. No, or further step by step according to whether the adjacent segments of the following levels are the same or not.
也可以说, 需要时, 我们可以得到给定关键词相关文本的任一子集及其下级子集的分 类树状目录或例句序列。  It can also be said that, when needed, we can get a classification tree or example sentence sequence for any subset of the given keyword-related text and its subordinate subsets.
实际上, 我们可以允许每个子集或下级子集由相应的邻接词段或者包含该词段的语句 或例句或摘要实例来代表该子集, 这样在有限的交互界面上, 就能排列更多子集的代表性 内容,形成目录或序列,査询者可以通过点击代表性内容来选择有兴趣的子集及所含文本。 例如,我们点击图 2所示目录中"不同 K技术 672 "后面的"分别由雅虎公司"后面的 "谷 歌公司"词段, 就可以得到含有 "不同 K 技术分别由雅虎公司谷歌公司"词段的子集的 所有文本目录或相关内容。 如果我们点击相关序列或目录的相关内容中的邻接词段, 也可 以得到相同的结果。  In fact, we can allow each subset or subordinate subset to represent the subset by the corresponding contiguous segment or the statement or example sentence or summary instance containing the segment, so that on a limited interaction interface, more The representative content of the subset forms a directory or sequence, and the queryer can select the subset of interest and the included text by clicking on the representative content. For example, if we click on the "Google Company" section after "Different K Technology 672" in the directory shown in Figure 2, we can get the phrase "Different K Technology by Yahoo! Google Company". A subset of all text directories or related content. The same result can be obtained if we click on an adjacent segment in the relevant content of the relevant sequence or directory.
实际上, 本技术允许査询者在交互界面上对目录或序列或其他内容中的文字或图形或 符号进行指示, 例如点击光标, 确定或展开或链接相关内容。  In fact, the present technique allows a queryer to indicate text or graphics or symbols in a directory or sequence or other content on an interactive interface, such as clicking a cursor, determining or expanding or linking related content.
我们还可以使本方法更方便, 例如可以安排: 对于所述目录或树状目录或序列中的关 键词邻接词段或间接邻接词段, 如果其下一级或下几级邻接词段只有一种, 该词段可以连 同其下一级或下几级邻接词段在其原有位置一起分布或存储或展示。  We can also make the method more convenient, for example, it can be arranged: for the keyword adjacent column or indirect adjacent segment in the directory or tree directory or sequence, if there is only one adjacent segment in the next level or the next level Alternatively, the segment may be distributed or stored or displayed in its original location along with its next or next level of adjacent segments.
为了方便査询, 例如我们还可以安排本技术, 允许在上述的文本或目录或语句或例句 或摘要实例中或者在它们所包含的关键词或邻接词段或间接邻接词段附近, 可以具有其相 应的并列子集数目或下级子集数目或者相关词或词段所在子集的并列子集数目或所含的下 级子集数目或文本数目的提示。 (如图 2)  For the convenience of the query, for example, we can also arrange the present technology to allow it to be in the text or directory or sentence or example sentence or summary example described above or in the vicinity of the keywords or adjacent segments or indirect adjacent segments they contain. A hint of the number of corresponding parallel subsets or subordinate subsets or the number of juxtaposed subsets of the subset of related words or segments or the number of subordinate subsets or texts contained. (Figure 2)
还可以进一步利用本发明的处理方法, 例如编排含有同样关键词的多个文本或文本部 分内容的序列, 它们含有的所述关键词由多个词组成的邻接词段互不相同或基本上互不相 同。 可以认为, 其中含有相同所述邻接词段的文本或文本部分内容只有一个或多个作为序 列中的代表。  It is also possible to further utilize the processing method of the present invention, for example, to arrange a sequence of a plurality of text or text partial contents containing the same keyword, and the adjacent words of the keyword containing the plurality of words are different from each other or substantially mutually Not the same. It can be considered that only one or more of the text or text portion contents containing the same adjacent sentence segment are represented in the sequence.
所述文本部分内容可以是指含有同样关键词的摘要或索引或题录或例句或词组等。 也可以说该代表性序列各个中文本或文本部分内容含有的多个词(2个或 2个以上词) 的关键词邻接词段互不相同或基本上互不相同。 多个词一般可以更好地反映关键词邻近核 心内容的含章。  The text portion content may refer to a digest or index or a bibliography or an example sentence or a phrase or the like containing the same keyword. It can also be said that the keyword adjacent segments of the plurality of words (two or more words) contained in the contents of the text or the text portion of the representative sequence are different from each other or substantially different from each other. Multiple words can generally better reflect the inclusion of the core content of the keyword.
这样, 巾方法可以将动辄上百万条同关键词网上信息,浓缩成条数减少 2、 3个数量 级的同关键词的代表性信息序列, 而每条信息的核心内容 (关键词附近几个邻近词构成的 内容) 既不重复也不遗漏。 这也是精炼网页核心内容的非常有效的方法, 比起巳有技术只 能剔除镜象网页的方法有了显著进步。  In this way, the towel method can condense millions of online information with keywords into a representative information sequence with the same number of keywords reduced by 2 and 3 orders of magnitude, and the core content of each message (a few near the keyword) The content of the adjacent words) is neither repeated nor missing. This is also a very effective way to refine the core content of a web page, and there has been a significant improvement over the way that only technology can only eliminate mirrored web pages.
如果我们对巳经得到的同样关键词的不同邻接词段的目录和所谓代表性信息序列仍然 感到内容太多, 我们例如还可以将所述多个文本的同样关键词的不同邻接词段进行相似比 较, 将相互符合一定相似要求的多个不同邻接词段划分入同一相似子集, 或者将相互不符 合一定相似要求的多个不同邻接词段划分入不同相似子集, 或者将相互不符合一定相似要 求的多个不同邻接词段, 编成彼此不相似邻接词段的序列或目录, 可以将同一相似子集的 各元素的共同内容作为该相似子集的名称或标记,或者将其列入相似子集名称序列或目录。  If we still feel too much content for the catalogue of different adjacent segments of the same keyword obtained by the chanting and the so-called representative information sequence, we can, for example, similarly different adjacent segments of the same keyword of the plurality of texts. In comparison, a plurality of different adjacent segments that meet certain similar requirements are divided into the same similar subset, or a plurality of different adjacent segments that do not meet certain similar requirements are divided into different similar subsets, or may not meet each other. A plurality of different adjacent segments of similar requirements are grouped into sequences or directories that are not similar to each other, and the common content of each element of the same similar subset may be used as the name or tag of the similar subset, or may be included A similar subset name sequence or directory.
所述的相似比较方式或相似要求可以有很多种, 需要时可以规定:  There are many similar comparison methods or similar requirements, which can be specified when needed:
所述的一定相似要求至少包括对不同邻接词段所含有的同样的字或词或词组或字符的 数量或所占比例的要求。  The certain similarity requirements include at least the requirement for the number or proportion of the same word or word or phrase or character contained in different adjacent segments.
举例来说, 如果涉及的是长度为 4个词 (略去或不略去虚词) 的不同邻接词段的序列 或目录,可以要求不同的邻接词段之间起码有 4个或 3个词彼此相同 (但词序不一定相同), 作为相似要求。 相似要求可以由系统预设, 也可以由査询者选定。 在该例中, 这 4个或 3 个共同出现的词可以用标点分开后作为相应相似子集的标题或一条目录内容。  For example, if a sequence or directory of different contiguous segments of 4 words in length (with or without imaginary words) is involved, at least 4 or 3 words between different adjacent segments may be required to each other. The same (but the word order is not necessarily the same), as a similar requirement. Similar requirements can be preset by the system or by the querier. In this example, the four or three co-occurring words can be separated by punctuation as the title or a directory of the corresponding similar subset.
图 3构成了另一个相似性子集示例, 是在关键词不同邻接词段 (按后面 3个词截取) 序列的基础上, 经过相似比较形成的。 其关键词不同邻接词段相似要求由系统预设为 "必 须具有同样 3个词, 而彼此的前后顺序不限"。 该例中各个文本的关键词为 "搜索引擎", 用符号 "K"代表, 其中关键词不同邻接词段构成的相似子集的名称, 用其共有的各个词 表示,此处分别用大写字母代表。图 3所示的第一个相似子集的不同邻接词段都包含了 X、 Y、 Z这 3个词。 同一相似子集的邻接词段由于共有的各个词的顺序不同, 可以构成不同 的第一邻接词段, 构成了该相似子集的下一级子集。 Figure 3 constitutes another example of similarity subsets, which is formed by similar comparisons based on the sequence of different adjacent words (taken by the following three words). The keyword similarity requirements of different adjacent segments are preset by the system as "must have the same 3 words, and the order of each other is not limited." The keyword of each text in this example is "search engine", which is represented by the symbol "K", in which the names of similar subsets composed of different adjacent segments of the keyword are represented by their common words, respectively, in uppercase letters. representative. The different adjacent segments of the first similar subset shown in Figure 3 contain X, Y, Z these three words. The adjacent segments of the same similar subset may constitute different first adjacent segments due to the different order of the shared words, and constitute the next subset of the similar subset.
需要时也可以将含有某种邻接词段或间接邻接词段的子集划为该邻接词段所在的上述 相似子集的下级子集, 也可以将含有某种邻接词段的文本或其部分内容划入该邻接词段所 在的上述相似子集。  If necessary, a subset containing a certain adjacent segment or an indirectly adjacent segment may be classified as a lower subset of the above-mentioned similar subset in which the adjacent segment is located, or a text containing a certain adjacent segment or a portion thereof The content is classified into the above-mentioned similar subset in which the adjacent segment is located.
显然, 相似子集可以看作是在原来的邻接词段的目录或序列的基础上编成的, 所以相 似子集的数量或其目录的篇幅比原目录或原序列明显减少, 査询者可以更方便地通过目录 中相似子集的名称看出相关邻接词段的主要成分(几个并列的独立的词), 如有兴趣, 则可 以打开该相似子集, 得到其所属的各个下级子集的不同关键词段及相关信息。  Obviously, the similar subset can be regarded as being compiled on the basis of the original directory or sequence of adjacent segments, so the number of similar subsets or the length of the catalog is significantly smaller than the original catalog or the original sequence. It is more convenient to see the main components of the related adjacent segments (several parallel independent words) by the names of similar subsets in the directory. If you are interested, you can open the similar subsets and get the subordinate subsets to which they belong. Different keyword segments and related information.
本技术可以采用一种比较高效的方式:规定关键词的邻接词段的词数为 2至 10之中的 一种,例如 6个,这样经过处理就会得到不同的邻接词段相关的序列或目录,如果需要(如 内容过多, 超过数百条) 可以进一步对其进行相似比较, 得到不同的相似子集的目录或序 列 (如减少到几十条)。 这对査询十分便利。  The technique can adopt a more efficient method: specifying that the number of words of the adjacent segment of the keyword is one of 2 to 10, for example, six, so that after processing, different adjacent segments can be obtained or The directory, if needed (such as too much content, more than a few hundred) can be further similarly compared to get a different similar subset of the directory or sequence (such as reduced to dozens). This is very convenient for the query.
本方法还可以将上述的相应的各个目录或序列或它们的图表分别作为独立文本,令所 述独立文本分别具有包含相应的关键词的标题或题录,或者分别具有包含相应的关键词和 相应的同样邻接词段或邻接词段组或相似子集名称的标题或题录。这样, 即使这些独立文 本采取一般网页或文本存储方式, 在数据采集时或生成索引时或者关键词査询时, 搜索引 擎系统也比较易于根据各个独立文本的标题或题录, 将其置于相应关键词或关键语句 (即 关键词连同其邻接词段) 査询的搜索结果之中。  The method may also use the respective respective directories or sequences or their charts as separate texts, so that the independent texts respectively have a title or an inscription containing the corresponding keywords, or respectively have corresponding keywords and corresponding The title or title of the same adjacent segment or adjacent segment group or similar subset name. In this way, even if these independent texts adopt a general webpage or text storage method, the search engine system is relatively easy to place the corresponding title or title according to each independent text during data collection or when generating an index or keyword query. Keywords or key statements (ie, keywords along with their adjacent segments) are among the search results of the query.
需要时, 还可以令所述目录或序列或它们的图表的独立文本的标题或题录里或附近具 有提示其相关特点的提示字符或其组合。  If desired, it is also possible to have prompt characters or combinations thereof in or near the title or title of the individual text of the catalog or sequence or their chart.
在所述的提示字符或其组合里, 可以带有由所述目录图表文本的提供方规定的或专用 的独特提示字符或其组合, 也可以带有提示该图表或文本形成或发布时间或顺序或者提示 该图表所示的目录所包含的层级数目的字符或其组合。  In the prompt characters or combinations thereof, there may be unique prompt characters or combinations thereof specified by the provider of the catalogue chart text, or may be accompanied by prompting the chart or text to form or publish time or order. Or prompt the number of characters in the directory shown in the chart or a combination thereof.
例如, 可以在所述标题内相应关键词(例如 "布林" )或连同邻接词段(例如 "布林 线指标" )或邻接词段组后面加上 "目录" 或"、、例句"或"、、 M "或 " + +L"或 " = =目录"或 " = =目录 09" (此时标题可以成为 "布林 = =目录 09"或 "布林线指标 = =目录 09" ) 或搜索引擎易于识别的其他约定的提示字符等等。 这样更利于査询者或搜 索引擎的搜索更为方便和准确。  For example, you can add a "directory" or ", example sentence" after the corresponding keyword (such as "Bulin") in the title or along with the adjacent segment (such as "Bulin line indicator") or the adjacent segment group. ",, M" or "+ +L" or "==directory" or "==directory 09" (the title can now be "Bulin == directory 09" or "Bulin line indicator == directory 09") Or other agreed-upon characters that are easily identifiable by search engines, and so on. This is more convenient and accurate for the search of the querier or search engine.
我们也可以同时令相关搜索引擎或检索系统的关键词搜索结果排序装置对所述目录或 序列的独立文本的标题或题录里或附近具有的提示其特点的提示字符或其组合具有识别能 力。 例如, 需要时, 约定该装置在关键词搜索结果之中遇到其标题或题录中包含 " = =目 录"的文本时, 将其优先置于该关键词其他搜索结果的前面。 这样, 需要时, 例如査询者 在输入关键词 (例如 "搜索引擎" ) 时还输入了提示字符 " = =目录"(全部输入为 "搜 索引擎 = =目录" ),或者改为输入关键词后再用光标点击了交互界面的相应的选择键(如 "目录"键) 或类似操作, 那么, 关于含有 "搜索引擎"的多个文本的分类目录的图表的 相关信息就会排在其他含有 "搜索引擎"的搜索结果的前面。  We can also simultaneously have the keyword search result sorting means of the relevant search engine or retrieval system have the ability to recognize the prompt characters or combinations thereof in the title or the title of the independent text of the directory or sequence that have their characteristics. For example, if necessary, the device is arbitrarily placed in the keyword search result with the text of the "== directory" in the title or the title, and is placed in front of the other search results of the keyword. Thus, when needed, for example, the queryer enters the prompt character "==directory" (all inputs are "search engine == directory") when inputting keywords (such as "search engine"), or enters keywords instead. Then, after clicking the corresponding selection key (such as "directory" key) of the interactive interface with the cursor or the like, the information about the chart of the classification directory of the plurality of texts containing the "search engine" is ranked in other contents. "Search engine" in front of the search results.
值得注意的是, 按这种方式安排, 无论相关目录或序列图表来自相关搜索引擎数据库 还是另外的计算机数据处理系统还是互联网, 以现有技术为基础的相关搜索引擎甚至不需 要进行大的改变, 就可以在关键词査询时, 十分方便地找到或向査询者优先提供相关目录 或序列图表。  It is worth noting that, in this way, regardless of whether the relevant directory or sequence chart is from the relevant search engine database or another computer data processing system or the Internet, the related search engine based on the prior art does not even need to make major changes, It is very convenient to find or provide the relevant directory or sequence chart to the querier preferentially when searching for keywords.
所述的处理方法还可以将其中所述提供的部分或全部目录文本或其标题或题录的关键 词索引数据(如关键词倒排表)分布于搜索引擎或检索系统专门的索引数据库或索引数据库 的相应专门区域中。 这样, 当搜索引擎系统需要向査询者提供所述各种目录和序列时, 可 以直接在该专门数据库或该专门区域里进行关键词搜索, 可以明显提高搜索的速度, 并避 免在搜索结果中出现与所述目录或序列无关的信息。  The processing method may further distribute keyword index data (such as keyword inversion table) of some or all of the provided catalogue text or its title or title to a dedicated index database or index of the search engine or the retrieval system. In the corresponding specialized area of the database. In this way, when the search engine system needs to provide the various directories and sequences to the querier, the keyword search can be directly performed in the special database or the special area, which can significantly improve the search speed and avoid the search results. Information that is not related to the directory or sequence appears.
对于利用本发明的方法得到的上述各种目录或序列的内容的排序, 有时可以是随机分 布的, 也可以利用公知的巳有排序技术, 或者在需要时令其中的并列子集或者并列邻接词 段或间接邻接词段或者并列文本或者并列的语句或例句或摘要实例或代表性序列信息中的 某一个的具体排序位置, 部分或完全取决于下列其中某一个或多个因素:  The ordering of the contents of the above various directories or sequences obtained by the method of the present invention may sometimes be randomly distributed, or may be performed using a well-known sorting technique, or a parallel subset or a parallel adjacent segment may be used as needed. Or the specific ordering position of one of the indirect contiguous segments or side-by-side text or side-by-side sentences or example sentences or summary instances or representative sequence information, depending in part or in whole on one or more of the following factors:
该文本或者该词段或语句或例句或摘要实例或信息所在文本的佩奇链接值的大小或点 击率的高低或关键词出现率的高低, The size or point of the text or the page link value of the segment or statement or example sentence or summary text or text The rate of hitting or the rate of occurrence of keywords,
或者该子集的下级子集数目或下属文本数目的多少或者该子集点击率的高低或者该子 集的文本佩奇链接值的平均数值的大小,  Or the number of subordinate subsets or the number of subordinate texts of the subset or the click rate of the subset or the average value of the text page link value of the subset,
或者该词段或者文本或者语句或例句或摘要实例或信息所在子集的下级子集数目或下 属文本数目的多少或者所在子集点击率的高低或者所在子集的文本佩奇链接值的平均数值 的大小,  Or the number of subordinate subsets or the number of subordinate texts of the segment or text or statement or example sentence or summary instance or information, or the number of sub-sets of the subset or the average value of the textual link value of the subset the size of,
或者该子集的佩奇链接值最高的文本或另外的文本实例的佩奇链接值的大小, 或者该子集的点击率最高的或关键词出现率最高的文本或另外的文本实例的点击率或 关键词出现率的高低,  Or the size of the Page link value of the subset with the highest link text value or another text instance, or the highest click rate of the subset or the highest keyword occurrence rate or the click rate of another text instance Or the level of keyword occurrence,
或者相关文本或相关子集内的相关文本在其他搜索网站或检索系统搜索结果中的排 序,  Or the ordering of related texts in related texts or related subsets in other search sites or search system search results,
或者有关文本或有关词段的出资人相关付费或竞价的高低,  Or the relevant payment or bidding level of the relevant text or the relevant fund segment,
或者相关的邻接词段的词字的拼写或拼音的字母顺序或笔划,  Or the spelling of the words of the related adjacent segments or the alphabetical order or strokes of the pinyin,
或者文本的来源网站或单位或人的评分,  Or the source of the text or the rating of the unit or person,
或者相关文本收录的时间先后或新旧,  Or the time of the relevant texts is either new or old.
或者是否属于某一级的同一子集。  Or whether it belongs to the same subset of a certain level.
需要时, 具体排序位置, 可以通过一种目标函数值来决定, 目标函数值取决于一个或 多个变量, 该目标函数的部分或全部变量可以分别代表上述所列其中某一个或多个因素。  When needed, the specific sort position can be determined by an objective function value, which depends on one or more variables, and some or all of the variables of the objective function can respectively represent one or more of the factors listed above.
例如一个目标函数值可以表示为 F (xl x2 ·'·χ η ) , For example, an objective function value can be expressed as F (x l x 2 ·'·χ η )
例如可以令 F (Xl ,x2 -xn )= F (x i) + F (x2 )+ + F (xn ) ; 其中, ,x2 ^^。分别为前文发明内容部分中所提到的决定具体排序位置的某一个 或多个因素(变量)或其他因素。 由于巳有技术中 (如 US6285999专利)有许多具体处理 方法, 此处不再详述。 For example, F ( Xl , x 2 - x n ) = F (xi) + F (x 2 ) + + F (x n ) ; where , x 2 ^^. One or more factors (variables) or other factors that determine the specific ranking position as mentioned in the previous section of the invention. Since there are many specific treatment methods in the prior art (for example, US Pat. No. 6,285,999), it will not be described in detail here.
需要时本发明的处理方法还可以允许在巳有处理的方法或结果上, 增加或减少应具备 或不能具备的另外的关键词, 或者增加或减少时间或地域或语种或者其他类型或范围或要 求的限制, 得到进一步精炼的结果或更宽泛的结果。  The processing method of the present invention may also allow for the addition or subtraction of additional keywords that are or may not be available, or increase or decrease time or geographic or linguistic or other types or ranges or requirements, if desired. Limits, get further refined results or more broad results.
例如本发明允许对在宽松要求下 (例如忽略邻接词段中虚词的差别) 的邻接词段比较 而得到的子集的内容, 进行较严格要求 (例如不忽略虚词的差别) 的邻接词段比较, 而划 分下一级子集或得到更详细的邻接词段目录或相应信息; 或进行反向操作。  For example, the present invention allows for comparison of contiguous segments of more stringent requirements (eg, not ignoring differences in function words) for the content of subsets obtained by comparing adjacent segments under loose requirements (eg, ignoring differences in lexical words in adjacent segments) , and divide the next level of the subset or get a more detailed list of adjacent segments or corresponding information; or reverse the operation.
增减一个关键词 (如 "中国" ), 或者改变时间 (如一年内改为半年内或两年内) 或 地域 (河北或保定或华北) 或语种 (如英语或西语) 或者其他类型 (如物品或玩具) 或范 围 (如男孩或儿童或人) 的限制, 都可以方便地缩小和扩大搜索范围。  Add or subtract a keyword (such as "China"), or change the time (such as within six months or within two years) or region (Hebei or Baoding or North China) or language (such as English or Spanish) or other types (such as items) The limits of the scope or range (such as boys or children or people) can be easily narrowed and expanded.
本发明的又一个方面是另一种包括存贮装置的计算机数据系统, 所述存储装置或其中 的数据部分所含有的部分或全部有关文本的关键词索引的数据结构组成至少包括:  Still another aspect of the present invention is another computer data system including a storage device, wherein the data structure of the memory device or a portion of the data index contained in the data portion thereof includes at least:
关键词段;  Keyword segment
一个或多个邻接词段,由相应文本内容中或文本摘要中的关键词的依次邻接的预定数目 的各级邻接词段按原顺序映射组成, 依次为: 邻接词段 1, 邻接词段 2, …邻接词段 N; 相应文本 ID段, 或其相关信息的 ID段, (其中 ID段是指地址段);  One or more adjacent segments are composed of a predetermined number of adjacent segments adjacent to each other in the corresponding text content or the keywords in the text summary, in order: adjacency segment 1, adjacent segment 2 , ... the adjacent segment N; the corresponding text ID segment, or the ID segment of its associated information, (where the ID segment refers to the address segment);
必要时, 可以包括相应文本含有的所述关键词的摘要段或标题段。  If necessary, a summary segment or a title segment of the keyword contained in the corresponding text may be included.
一般说来, 关键词索引是为了方便搜索或检索系统进行关键词检索而建立的, 同一文 本为了方便多种关键词检索的需要,常常要有多个不同关键词的索引。作为本发明的例子, 一个文本针对关键词 "长江"的索引数据结构如下:  In general, keyword indexing is established to facilitate search or retrieval systems for keyword retrieval. In order to facilitate the retrieval of multiple keyword searches, the same text often has multiple indexes of different keywords. As an example of the present invention, the index data structure of a text for the keyword "Yangtze River" is as follows:
关键词 邻接词段 1 邻接词段 2 …邻接词段 N 文本地址 标题内容  Key words adjacent segment 1 adjacent segment 2 ... adjacent segment N text address title content
长江 ¾ ΈΪ ……效益 XSTSS96 —  Yangtze River 3⁄4 ΈΪ ...... Benefits XSTSS96 —
对于这样的数据结构, 搜索引擎无论是搜索 "长江", 还是搜索加长的搜索词"长江流 域", 还是更长的 "长江流域水力"都可以十分方便地访问到该索引, 进而按地址找到该文 本, 有利于本发明的具体实现。 也就是该系统可以允许该计算机数据系统依据搜索规定包 含的所述关键词段和各邻接词段中的一个或多个的组合或者组合词段数目的增减, 来搜索 或者以变换方式搜索相应的索引或内容。  For such a data structure, whether the search engine searches for "Yangtze River" or searches for the long search term "Yangtze River Basin" or the longer "Yangtze River Basin Hydraulics", it is very convenient to access the index, and then find the address by address. The text facilitates the specific implementation of the invention. That is, the system may allow the computer data system to search or transform the corresponding number according to the combination of the keyword segment and one or more of the adjacent segments or the number of combined segments included in the search specification. Index or content.
举例来说, 如果索引中的每个邻接词段都是一个词的长度, 确定査询的关键词和各个 邻接词后, 计算机很容易得到关键词段和邻接词段内容都符合査询要求的索引。 For example, if each adjacent segment in the index is the length of a word, determine the keywords and each of the queries. After the adjacent words, the computer can easily get the index of the keyword segment and the adjacent segment content that meet the query requirements.
上述地址可以是数据库文本地址, 或网页地址或其他地址。  The above address can be a database text address, or a web page address or other address.
所述计算机数据系统也可以是搜索引擎系统。 (参见后面的实施例 B )  The computer data system can also be a search engine system. (See Example B below)
本发明可以包括具有上述关键词索引的数据结构组成的计算机可读介质。  The invention may comprise a computer readable medium comprising a data structure having the above keyword index.
本发明还可以是一种包括存贮装置的计算机数据系统, 可以安排所述存储装置或其中 的数据部分所含有的部分或全部关键词索引或文本摘要或文本的数据以下列方式分布: 其文本摘要或文本含有同一关键词而该关键词邻接词段相同或不同的索引或文本摘要 或文本的数据, 位于同一关键词集的同一或不同子集的分布区域。  The present invention may also be a computer data system including a storage device, which may arrange for some or all of the keyword index or text summary or text data contained in the data portion or the data portion thereof to be distributed in the following manner: A summary or text containing the same keyword and the same or different index or text summary or text of the adjacent words, located in the same or different subset of the same keyword set.
需要时, 允许位于同一子集, 而其文本摘要或文本含有同一扩展关键语句 (扩展关键 语句即关键词连同一级或多级邻接词段) 而该语句的邻接词段相同或不同的索引数据, 位 于同一子集的同一或不同的低一级或多级子集分布区域。  When needed, it is allowed to be in the same subset, and its text summary or text contains the same extended key statement (extension key statement is the keyword together with one or more levels of adjacent segments) and the adjacent segments of the statement are the same or different index data , the same or different lower-level or multi-level subset distribution areas in the same subset.
例如, 可以将具有同一种关键词的各个文本或文本部分内容 (例如摘要或题录或语句 或段落等) 的各个索引, 需要时可以包括该关键词的各种邻接词段的目录表 (或子集目录 表) 的或多级邻接词段树状目录表 (或多级子集目录表) 的或相应的例句序列表或摘要实 例序列表的索引,全部或部分集中分布或连续排列在与该关键词对应的集中存储区域。(例 如后面的实施例 A)  For example, individual indexes of individual text or textual parts of the same keyword (eg, abstract or bibliographic or statement or paragraph, etc.) may be included, and if necessary, a table of contents of various contiguous segments of the keyword may be included (or Index of the subset directory table or the multi-level contiguous segment tree table (or multi-level subset table) or the corresponding example sequence sequence table or summary instance sequence table, all or part of which is concentrated or continuously arranged in the key The centralized storage area corresponding to the word. (For example, the following embodiment A)
此处所述各索引的数据结构组成至少包括被索引的存储对象 (如文本、 目录表、 序列 表等) 的地址段。  The data structure of each index described herein constitutes at least an address segment of the indexed storage object (such as text, directory table, sequence table, etc.).
这样査询该关键词时, 可以方便或连续地访问相关索引, 得到索引的地址段 (ID段) 的地址或编号, 访问或提取或展现相关目录或文本或其他内容。  When the keyword is queried in this way, the relevant index can be conveniently or continuously accessed, the address or number of the indexed address segment (ID segment) is obtained, and the relevant directory or text or other content is accessed or extracted.
类似地, 也可以进一步将具有同一种关键词的各个文本或文本部分内容的各个索引, 需要时可以包括各种更下一级或多级的邻接词段的目录表或树状目录表的或相应的例句序 列表或摘要实例序列表的索引, 分别全部或部分集中分布或连续排列在与该关键词的不同 邻接词段分别对应的集中存储区域。  Similarly, each index of each text or text portion content having the same keyword may be further included, and may include a directory table or a tree table table of various next-level or multi-level adjacent segments as needed. The index of the corresponding example sentence sequence table or summary instance sequence table is respectively distributed in whole or in part or continuously arranged in a centralized storage area corresponding to different adjacent word segments of the keyword.
所述计算机数据系统可以是搜索引擎系统, 这样可以更方便地査询或处理或向用户提 供与査询的关键词及邻接词段有关的同一子集及低一级或多级子集的数据。  The computer data system may be a search engine system, which may more conveniently query or process or provide the user with the same subset of the queryed keywords and adjacent segments and data of the lower or more subsets. .
本发明可以包括关键词索引或文本摘要或文本的数据以上述方式分布计算机可读介 质。  The present invention may include data of a keyword index or text digest or text distributed in a manner that is readable in the manner described above.
本发明方法的由计算机系统实施具体流程的可以通过图 11和图 7、图 9、图 10的几个 例子 (包括实施例 A、 B、 C等) 来说明。 在图 11的示例中, 相关计算机处理设备开始工 作 61, 接收査询者提交的关键词査询 62,得到大量含有该关键词的文本,根据预设或査询 者指定, 确定关键词的邻接词段字词数量或范围 (例如 5个实词) 63, 对来自不同文本的 该范围内的各个邻接词段进行比较分类 64, 并划分出其邻接词段分别相同的各个子集 65。 在此基础上, 可以对得到的子集进行再分 66, 例如根据下一级的邻接词段的相同与否划分 下一级子集, 或者进行较严格要求的邻接词段比较, 而划分下一级子集; 也可以安排代表 性序列或邻接词段不相同序列或编排相应目录 67, (包括标注相应子集的文本数目和进行 适当排序 71 ), 在界面展示 70以供査询者选择操作, 展开相关子集或内容或显示相关文本 72。 如果这些序列或目录的条目过多, 还可以对这些条目的关键词査询项邻接词段进行相 似比较, 在其中划分相似子集 68或安排不同相似内容的序列或目录 69, 这将更便于浏览, 査询者发现有兴趣的内容时, 再进行点击操作 70, 展开相关的子集或更详细内容 72。 (图 7、 图 9、 图 10的例子将在后面说明)  The specific flow of the method of the present invention implemented by a computer system can be illustrated by several examples (including embodiments A, B, C, etc.) of Figs. 11 and 7, 9, and 10. In the example of FIG. 11, the related computer processing device starts work 61, receives the keyword query 62 submitted by the querier, obtains a large amount of text containing the keyword, and determines the contiguousness of the keyword according to the preset or the querier designation. The number or range of the word segments (for example, 5 real words) 63, the respective adjacent segments within the range from different texts are compared and classified 64, and the respective subsets 65 whose adjacent segments are respectively identical are divided. On this basis, the obtained subset can be subdivided 66, for example, according to the same or not of the next-level adjacent segment, the next-level subset is divided, or the stricter required adjacent segment is compared, and the division is performed. First-level subsets; can also arrange representative sequences or adjacent segments not identical sequences or arrange corresponding directories 67, (including the number of texts labeled with the corresponding subset and appropriate sorting 71), display 70 in the interface for the queryer to choose Action, expand related subsets or content or display related text 72. If there are too many entries in these sequences or directories, similar comparisons can be made to the keyword query term contiguous segments of these entries, in which a similar subset 68 or a sequence or directory 69 of different similar content is arranged, which is more convenient. When the queryer finds the content of interest, click operation 70 to expand the related subset or more detailed content 72. (The examples in Fig. 7, Fig. 9, and Fig. 10 will be described later)
图 4所示实施例 A为一个能执行本发明的计算机电子文本处理方法的计算机数据系统 的例子一能提供扩展的关键语句搜索的互联网搜索引擎系统。 它包括: 设在带有存储器 6 和处理器 Ί的服务器 5上的搜索引擎 8, 该搜索引擎 8通过互联网的通讯网络 4与带有交 互界面 2的客户机 3连接; 该搜索引擎 8具有数据库 9、 査询器 11和关键词扩展部件 10 或模块, 并且与数据采集器 12与索引构造器 13连接;  Embodiment A shown in Fig. 4 is an example of a computer data system capable of executing the computer electronic text processing method of the present invention. An Internet search engine system capable of providing extended key sentence search. It comprises: a search engine 8 arranged on a server 5 with a memory 6 and a processor, the search engine 8 being connected to a client 3 with an interactive interface 2 via a communication network 4 of the Internet; the search engine 8 has a database 9. The querier 11 and the keyword expansion unit 10 or module are connected to the data collector 12 and the index constructor 13;
数据采集器 12为数据库 9的文本库从互联网或其他信息源搜集和增添文本,索引构造 器 13对文本库的文本分析得到文本索引并提供给数据库 9的关键词索引库;  The data collector 12 collects and adds text to the text library of the database 9 from the Internet or other information sources, and the index constructor 13 analyzes the text of the text library to obtain a text index and provides the keyword index library of the database 9;
该索引构造器 13根据对文本的分析得到的每个索引都包括关键词段、 6个单邻接词段、 相应文本的 ID段、 文本标题段、 文本摘要段, 这样, 搜索引擎在需要时可以根据所要求 的关键词段, 或者和所要求的 1个或多个单邻接词段找到所需的文本索引, 得到该文本的 标题段或文本摘要段或相应文本的 ID段, 需要时可以方便地链接到原文本。 关键词索引 库的索引根据各级邻接词段的异同, 按多级子集分布, 以便于检索或提取。 相应的邻接词 段目录、 邻接词段树状目录 (图 8 ) 和关键语句目录, 也预先存储。 Each index obtained by the index constructor 13 according to the analysis of the text includes a keyword segment, six single adjacent segments, an ID segment of the corresponding text, a text title segment, and a text summary segment, so that the search engine can Find the desired text index according to the required keyword segment, or with the required one or more single adjacent segments, to get the text The title segment or the text summary segment or the ID segment of the corresponding text, which can be easily linked to the original text when needed. The index of the keyword index library is distributed according to the similarities and differences of adjacent word segments at various levels, so as to facilitate retrieval or extraction. The corresponding adjacent segment directory, the adjacent segment tree directory (Figure 8), and the key statement directory are also pre-stored.
实施例 A的客户机 3上的客户机应用程序浏览器(微软公司的 Internet Explorer)允许 用户 1通过通讯网络 4从服务器 5检索 HTML文档(包括 Web表单)。客户机 3上的交互 界面(UI) 2允许用户 1利用监视器、 键盘或鼠标与检索到的 Web表单交互, 提交搜索请 求, 作出选择和接收搜索结果。  The client application browser (Microsoft Internet Explorer) on client 3 of embodiment A allows user 1 to retrieve HTML documents (including web forms) from server 5 over communication network 4. The interactive interface (UI) 2 on client 3 allows user 1 to interact with the retrieved web form using a monitor, keyboard or mouse, submit search requests, make selections and receive search results.
本发明的搜索方法的一个重要问题是邻接词段的选择方式 (或关键词与邻接词段的结 合方式)即关键语句的生成方式。 图 5所示的实施例 A的示范性关键语句是在文本摘要中 沿着关键词 21逐个增加邻接词段(此例为单词) 向后扩展的。 其中, 22为 1级关键语句, 23为 2级关键语句, 24为 3级关键语句, 25为 4级关键语句。  An important issue in the search method of the present invention is the way in which the adjacent segments are selected (or the way in which the keywords are joined to the adjacent segments), i.e., the manner in which the key statements are generated. An exemplary key statement of the embodiment A shown in Fig. 5 is to incrementally add adjacent words (in this case, words) one by one along the keyword 21 in the text summary. Among them, 22 is a level 1 key statement, 23 is a level 2 key statement, 24 is a level 3 key statement, and 25 is a level 4 key statement.
图 6所示为另一种实施例 B的关键语句生成方式。 其第 1级邻接词段位于关键词 21 的前面,第 2级邻接词段以及其他邻接词段位于关键词的后面。其中, 22为 1级关键语句, 23为 2级关键语句, 24为 3级关键语句, 25为 4级关键语句。 此种前后兼顾生成方式似 乎更适合搜索西文文件。 各级关键语句的邻接词段的长度 (词数) 也可预先规定或由査询 者在搜索时选定或默认系统的安排。  Figure 6 shows the key statement generation method of another embodiment B. The first level adjacent word segment is located in front of the keyword 21, and the second level adjacent word segment and other adjacent word segments are located behind the keyword. Among them, 22 is a level 1 key statement, 23 is a level 2 key statement, 24 is a level 3 key statement, and 25 is a level 4 key statement. This type of integration before and after seems to be more suitable for searching Western documents. The length of the adjacent segments of the key statements at all levels (number of words) can also be pre-specified or arranged by the queryer at the time of the search or by default.
在其他极端的实施例中, 也可以允许从关键词一再向前面的邻接词段扩展, 形成各级 关键语句。  In other extreme embodiments, it is also possible to allow the extension of the adjacent words from the keyword to the front to form key statements at all levels.
对于允许分离的多个词的关键词搜索的扩展方式,应该择其一作为核心关键词,通过 结合它的邻接词段来形成各级关键语句, 这些关键语句都带有可分离的其余关键词。 也可 以在多个词的关键词的各个词或词段附近, 依所需顺序逐次添加邻接词段, 形成各级关键 语句。  For the extension of the keyword search that allows multiple words to be separated, one should be selected as the core keyword, and the key words of each level are formed by combining its adjacent words. These key sentences all have the remaining keywords that can be separated. . It is also possible to sequentially add adjacent segments in the desired order near each word or segment of a keyword of a plurality of words to form key statements at all levels.
H 施例 A的系统在计算邻接词段的词数以及比较邻接词段的相同与否时,可以选择不 把虚词、 量词、 标点、 空格等计算在内, 将它们归并邻接的实词内。 本例对于西文可以有 也有相应的具体规定。 在其他实施例中, 需要时甚至在计算邻接词段的词数以及比较邻接 词段的相同与否时, 可以选择不把形容词或副词等计算在内。 The system of the H example A can calculate the number of words in the adjacent segment and compare the similarity of the adjacent segments, and can choose not to count the virtual words, quantifiers, punctuation, spaces, etc., and merge them into the adjacent real words. This example can have specific regulations for Western languages. In other embodiments, adjectives or adverbs may not be counted, if desired, even when calculating the number of words in a contiguous segment and comparing the similarity of adjacent segments.
在实施例 A中, 由査询器 11认证用户 1的査询请求, 并根据提出的关键词请求在所 述数据库 9进行査询并将査询到的相关数据结果列表, 以备提供给交互界面; 关键词扩展 部件 10作为査询器 11的补充, 需要时将暂存或处理该关键词相应的各级关键语句、 相应 的例句、 邻接词段树状结构目录(参见图 8 )等, 以满足而后搜索或显示的需要; 如果这 些内容在数据库 9或关键词扩展部件 10中尚未安排, 关键词扩展部件 10将在査询器 11 的关键词査询数据基础上将其建立。  In Embodiment A, the query request of the user 1 is authenticated by the querier 11, and a query is made in the database 9 according to the proposed keyword request and the related data result list is queried for provision to the interaction. The keyword expansion component 10 is supplemented by the querier 11. When necessary, the key statement corresponding to the keyword, the corresponding example sentence, and the adjacent segment tree structure directory (see FIG. 8) are temporarily stored or processed to meet the requirements. Then, the search or display needs; if the content is not arranged in the database 9 or the keyword expansion unit 10, the keyword expansion unit 10 will establish it based on the keyword query data of the querier 11.
实际上, 达到上述目的十分容易, 可以利用各种方法。 例如, 无论事后还是事前, 对 于一个可能的关键词或实际提出的关键词, 无论是实施例 A的关键词扩展部件 10还是计 算机或其他搜索系统, 都可以从含有该关键词的索引或文件序列任找一条 (例如第一条) 索引或文件査看关键词及邻接的词或词组即邻接词段(按照预定的长度),将它们作为第一 条关键语句存储; 再找第二条索引或文件査看其关键词邻接的词或词组是否与第一条的相 同?如果不同, 则依次存储, 相同则舍弃; 再査看第三条索引或文件并与前两条比较…… 依此类推, 将得到一组彼此各不相同的关键语句; 在上述比较过程中, 如果顺便将包含同 样关键语句的索引或文件分别排列成组, 则各个子集巳经形成, 否则, 以各关键语句为标 准由査询器 11分别检索所述索引或文件序列,即可得到相应各个子集。如果在每一个索引 或文件子集的序列里, 比照上述方法搜索各种第 2级邻接词段, 将得到各种第 2级关键语 句和相应的低一级子集……并依此类推。 如果在得到的每一个子集里各选一条 (例如第一 条) 或几条索引或摘要作为例句, 则得到所需的目录和例句序列, 进而完成编组操作。  In fact, it is very easy to achieve the above purpose, and various methods can be utilized. For example, whether after the event or beforehand, for a possible keyword or an actual proposed keyword, whether it is the keyword expansion component 10 of the embodiment A or a computer or other search system, an index or a file sequence containing the keyword may be used. Find a (for example, the first) index or file view keyword and adjacent words or phrases that are adjacent segments (according to the predetermined length), store them as the first key statement; then find the second index or Does the file look at the words or phrases adjacent to the keyword that are the same as the first one? If they are different, they are stored in order, and the same is discarded; then the third index or file is compared and compared with the first two... and so on, a set of key statements different from each other will be obtained; in the above comparison process, If, by the way, the indexes or files containing the same key sentences are respectively arranged into groups, each subset is formed. Otherwise, the index or file sequence is respectively retrieved by the querier 11 with each key sentence as a standard, and the corresponding index can be obtained. Each subset. If in each sequence of index or subset of files, a variety of Level 2 adjacency segments are searched for as described above, various Level 2 key phrases and corresponding lower level subsets are obtained... and so on. If one (for example, the first one) or several indexes or digests are selected as the example sentences in each of the obtained subsets, the desired directory and example sentence sequences are obtained, thereby completing the grouping operation.
这直接表明, 本方法无论事后还是事前, 无论根据的是可能的关键词或实际提出的关 键词, 都同样可以对相关文本进行处理, 以方便査询。  This directly indicates that the method can handle the relevant texts in the same way, whether it is based on the possible keywords or the actual key words, in order to facilitate the query.
这实际上就是通过对原有关键词邻接词段的扩大以及对扩大部分相同与否的比较, 将 原来的同一邻接词段子集进一步细分为若干下一级子集。  This is actually subdividing the original subset of the same adjacent segment into a number of lower-level subsets by comparing the adjacent words of the original keywords and comparing the same or not.
对于目录和例句序列的排列顺序, 例如可以依据一个目标函数值的大小安排。 该目标 函数值为相应条目的相应子集之中目标函数最大的文本之值, 等于该文本的佩奇链接值与 近期点击率之和。 所述例句可以由相应子集目标函数值最大的文本中所摘引。 在其他实施例里, 目录或例句或题录或摘要等信息的序列的排序, 可以根据一个目标 函数值 F (Xi ,Χ2 · ' ·Χ n ) 的大小来决定。 For the order in which the directory and the example sentence sequence are arranged, for example, it can be arranged according to the size of an objective function value. The objective function value is the value of the largest text of the objective function among the corresponding subsets of the corresponding entries, which is equal to the sum of the text link value of the text and the recent click rate. The example sentence can be extracted from the text with the largest subset objective function value. In other embodiments, the ordering of the sequence of information such as a table of contents or an example sentence or a bibliography or abstract may be determined based on the size of an objective function value F (Xi , Χ 2 · ' · Χ n ).
对于附加广告内容的文本, 目标函数值可以等于相应出价。  For text with additional ad content, the objective function value can be equal to the corresponding bid.
由于巳有技术中有许多关于文本排序的具体处理方法, 此处不再详述。  Since there are many specific processing methods for text sorting in the prior art, it will not be described in detail here.
实施例 A的关键词索引可以采用按各个子集分布的体系,并不比现有的其他关键词索 引库占用更大的存贮空间, 这是其突出优点之一。  The keyword index of the embodiment A can adopt a system distributed according to each subset, and does not occupy more storage space than other existing keyword indexing libraries, which is one of its outstanding advantages.
在另一个实施例 B 里, 其关键词索引库没有采用子集分布, 由于其索引数据结构包 含着关键词项和几个邻接词段项,其査询器 11根据关键词段和一个或多个邻接词段组合的 关键语句, 可以分别直接将应该属于相应子集的索引搜索并展示出来。 在实施例 B中, 只 需要安排邻接词段或关键语句的树状目录,甚至可以不改变原有的传统关键词索引数据库。  In another embodiment B, the keyword index library does not adopt a subset distribution, and since the index data structure includes a keyword item and several adjacent term items, the querier 11 is based on the keyword segment and one or more The key statements of the adjacent segment combination can directly search and display the indexes that should belong to the corresponding subset. In the embodiment B, only the tree directory of the adjacent words or key sentences needs to be arranged, and the original traditional keyword index database may not be changed.
本发明的索引系统也可以采用通常的关键词倒排表索引系统或方法。  The indexing system of the present invention can also employ a conventional keyword inverted table indexing system or method.
当然, 也可以更一般地得到巳有子集的下级子集。 例如需要时可以利用类似图 8所示 的邻接词段树状目录来反映关键词不同级别邻接词段之间先后关系, 并展示在画面上将有 利于用户了解各子集或各级子集的总体状态, 以采取更好的搜索策略。 此图略去了各个子 集相应的文本数目标示。 其中关键词为 "布林", 邻接词段 1、 邻接词段 2、 邻接词段 3、 邻接词段 4都由单个实词构成, 它们也分别代表了各级子集分别含有的共同邻接词段。  Of course, it is also possible to obtain a lower subset of the subsets more generally. For example, if necessary, a neighboring segment tree directory similar to that shown in FIG. 8 can be used to reflect the relationship between adjacent segments of different levels of keywords, and display on the screen will facilitate the user to understand the overall state of each subset or subset of levels. To take a better search strategy. This figure omits the corresponding number of texts for each subset. The key words are "Bulin", the adjacent segment 1, the adjacent segment 2, the adjacent segment 3, and the adjacent segment 4 are all composed of a single real word, which also represent the common adjacent segments respectively contained in the subsets of each level. .
实施例 A可以执行选定操作, 即允许所述系统根据査询者在交互界面的页面的文本或 摘要上或目录上或者选择栏的光标指示, 确定相应的关键语句, 并且对该关键语句对应的 扩展的关键语句或或扩展的邻接词段或索引或文本摘要进行编组操作, 或者进行相应索引 或文本摘要或文本的排序展示, 或者进行移除操作, 将所述页面或其他多个页面含有该关 键语句的条目或索引或文本摘要或文本剔除或移动位置。  Embodiment A may perform a selected operation, that is, allowing the system to determine a corresponding key statement according to a queryer's text or summary on a page of the interactive interface or a directory or a cursor indication of a selection bar, and corresponding to the key statement An extended key statement or an extended contiguous segment or index or text summary for grouping operations, or a corresponding index or textual summary of the text or text, or a removal operation to include the page or other pages The key statement's entry or index or text summary or text culling or moving position.
图 10所示为本发明的一个实施例的搜索过程中一次光标点击及生成显示结果 (即进行 编组的选定操作) 的局部屏幕画面示意图。  Fig. 10 is a partial screen diagram showing a cursor click and a display result (i.e., a selection operation for grouping) in the search process according to an embodiment of the present invention.
其中搜索框 51供输入关键词 (此例中为 "布林 " ), 52为点击操作的两种选项: '点 击展开' 或 '点击剔除', 此例中选择了 '点击展开'。 此处供点击的对象是画面上摘要栏 53展示的摘要 55。査询者阅读时对"布林线指标"的相关内容感兴趣,将光标 54对准"标" 字点击, 这样, 从 "布林"到 "标"之间的 "布林线指标"就作为新的关键语句, 并按编 组操作, 列出几种进一步扩展的邻接词段或各自的例句 56。  The search box 51 is for inputting keywords (in this case, "Bulin"), and 52 is the two options for clicking operations: 'click to expand' or 'click to remove', in this case, 'click to expand' is selected. The object for clicking here is the summary 55 shown in the summary bar 53 on the screen. When the inquirer reads, he is interested in the related content of the "Bulin Line Index", and the cursor 54 is aligned with the "mark" word, so that the "Bulin Line Index" between "Bulin" and "Standard" is As a new key statement, and in a grouping operation, several further extended adjacent segments or respective example sentences 56 are listed.
实施例 A的搜索方法还包括忽视操作, 即可以把査询者浏览包含原关键语句的索引 或文本摘要或文本序列时在交互界面 2的页面上的操作(如换页)或者在页面上相应条目、 内容上所作的 "关注点击"或 "忽视点击"的数据加以记录或分析, 对在一定阅读时间或 空间内一直被忽视的或未被关注的关键语句及其在后面的相关索引和摘要进行移除操作。  The search method of Embodiment A further includes ignoring the operation, that is, the finder can browse the page of the interactive interface 2 (such as page change) or correspondingly on the page when browsing the index or text summary or text sequence containing the original key sentence. Recording or analyzing data on "itemous clicks" or "ignoring clicks" made on items or content, key statements that have been ignored or not noticed in a certain reading time or space, and related indexes and abstracts in the following Perform the removal.
在实施例 A的系统里, 当用户 1通过交互界面 2提出关键词要求后, 査询器 11能够 根据要求在所述数据库进行査询并将査询到的相关数据结果列表提供给交互界面 2; 如果 用户 1希望扩展关键词, 关键词扩展部件 10将生成相应关键语句, 并提取或由査询器 11 搜索提供所需数据。 In the system of the embodiment A, after the user 1 submits the keyword request through the interaction interface 2, the querier 11 can query the database according to the requirement and provide the related data result list of the query to the interaction interface 2 ; If you wish to extend a keyword, keyword expansion member 10 will generate the appropriate key phrases, and provide the desired data extraction or by the interrogator 11 searches.
该搜索引擎 8 (包括査询器 11和关键词扩展部件 10 ) 的工作流程可以通过图 9来说 明:  The workflow of the search engine 8 (including the querier 11 and the keyword expansion component 10) can be illustrated by Figure 9:
该系统按照模块 41开始工作, 査询有无关键词搜索要求 (42), 无则返回 (48 ) ; 有 则按照模块 43査询有无关键词扩展操作要求?若无, 则执行 44提供普通的搜索结果序列 展示, 如有, 则执行 45通过交互界面 2的屏幕上的提示框来査询用户 1的需求; 然后进行 相应的操作,按照模块 46提供相应的信息,继续査询用户 1的选择和需求……几次重复后 按照模块 47提供相应搜索信息, 按照用户 1的意愿执行模块 48返回或者 49结束。  The system starts working according to module 41, and queries for keyword search request (42), and returns no (48); and then, according to module 43, is there a keyword expansion operation request? If not, then execution 44 provides an ordinary search result sequence display, and if so, execution 45 queries the user 1's request via the prompt box on the screen of the interactive interface 2; then performs the corresponding operation, providing the corresponding module according to module 46. The information continues to query the selection and needs of the user 1. After a few iterations, the corresponding search information is provided in accordance with the module 47, and the execution of the module 48 returns or ends 49 as intended by the user 1.
与搜索引擎 8相对应的用户 1在交互界面的操作流程可以通过图 Ί来表示: 在打开交互界面 2开始工作 (31 ) 后, 选定关键词 (32), 可以进行常规浏览(34), 也可以选择扩展搜索 (33 ) ; 如选择 (33 ), 即利用扩展关键语句搜索技术, 则需要通过光 标点击选择适当的操作方式: 例如选取关键词第一邻接词段的长度 (所包含的词的数量)。 其长度短, 相应的关键语句的种类 (子集数) 较少, 但每个子集的内容庞杂; 其长度长, 相应的关键语句的种类 (子集数) 较多, 而每个子集的核心内容则比较单一或集中。  The operation flow of the user 1 corresponding to the search engine 8 in the interactive interface can be represented by a map: After the interactive interface 2 is started to start working (31), the keyword (32) is selected, and the normal browsing can be performed (34). You can also choose to expand the search (33); if you choose (33), that is, using the extended key sentence search technique, you need to select the appropriate operation mode by cursor click: For example, select the length of the first adjacent segment of the keyword (the included words) quantity). Its length is short, the number of corresponding key statements (subsets) is small, but the content of each subset is complex; its length is long, the corresponding key statement types (subsets) are more, and the core of each subset The content is more singular or concentrated.
显然, 当我们选择的关键语句的长度达到 5到 6个词时,得到的如前所述的单一性索 引或摘要编组序列, 将是一个核心内容基本不重复也无多少遗漏的 "精炼序列", 文件总 量却可能减少几个数量级。 Obviously, when the length of the key statement we choose reaches 5 to 6 words, the single sequence index or summary group sequence as mentioned above will be a "refined sequence" whose core content is basically non-repeating and there are not many omissions. , total document The amount may be reduced by several orders of magnitude.
在选择较长的关键语句时,第一级所述单一性索引的条数会比较多。本系统允许利用 点击操作改而适当减少关键语句的邻接词或邻接词段的数量, 可以大大减少第一级或该级 单一性索弓 I或关键语句或摘要或例句的条数。  When selecting a long key statement, the number of single indexes in the first level will be more. The system allows the use of click operations to appropriately reduce the number of contiguous words or contiguous segments of a key statement, which can greatly reduce the number of singularities or key sentences or abstracts or example sentences at the first level or level.
如果放弃对关键词第一邻接词段长度的选取以及其他类型的选项,系统将自动按照原 定的例如以每级邻接词段为单词或双词长度进行成组操作, 并将结果呈现(35 )。此时用户 1可以选择 37在结果中直接打开链接文本, 也可以按照 36在呈现于画面的结果中选定适 当的扩展关键语句(可参看图 10), 并得到模块 38所展示的进一步搜索结果(下一级子集 目录等内容)。  If you abandon the selection of the first adjacent segment length of the keyword and other types of options, the system will automatically perform group operations according to the original, for example, each adjacent segment of the word or the length of the double word, and present the result (35) ). At this time, the user 1 can select 37 to directly open the link text in the result, or select the appropriate extended key statement in the result presented on the screen according to 36 (see FIG. 10), and obtain further search results displayed by the module 38. (The next level of the subset directory and other content).
至此,用户 1仍然可以选择 40直接打开链接文本, 也可以选择 39继续选定某个扩展 的关键语句……依此类推, 直至返回 (301 )。  At this point, User 1 can still choose 40 to open the link text directly, or select 39 to continue selecting the key statement of an extension... and so on until it returns (301).
这种逐级扩展关键语句即逐级缩小搜索范围的方式, 将迅速有效锁定搜索目标。 在实施例 A中, 当然在本发明的方法的其他实施例中, 都可以记录或累计某个或某 些或所有査询者在某个时间段内对各种关键词的包含各种邻接词段的各种关键语句的相关 内容的点击次数, 或在需要时设置相应的统计模块。  This step-by-step expansion of the key statement, which narrows the scope of the search step by step, will quickly and effectively lock the search target. In Embodiment A, of course, in other embodiments of the method of the present invention, it is possible to record or accumulate certain or some or all of the queriers of various keywords in a certain period of time. The number of clicks on the relevant content of the various key statements of the segment, or the corresponding statistical module is set when needed.
在实施例 C 中, 上述的关键语句搜索技术将与现有的关键词搜索技术相结合, 在其 子集内部的索引排序时, 或者在编组操作选择各条例句时, 注意尊重或维持相关文件在巳 有技术的搜索系统的搜索结果中的排序或位置。 换句话说, 本发明的技术包括在上述基本 方法和基本结构基础上对巳有技术搜索排序原理  In the embodiment C, the above-mentioned key sentence search technology will be combined with the existing keyword search technology, and when respecting the index sorting inside the subset, or when selecting each example sentence in the grouping operation, pay attention to respect or maintain the relevant file. The sort or position in the search results of a technical search system. In other words, the technique of the present invention includes the principle of sorting technology search based on the above basic method and basic structure.
或方法的运用。 实施例 B与实施例 C在所指明之处以外的方面, 与实施例 A基本相同。 Or the use of methods. The embodiment B and the embodiment C are substantially the same as the embodiment A except for the points indicated.
以上实施例给出的技术特征都是提示性的,一个实施例的各种技术特征是可以独立使 用的, 不允许用来限制本发明包括的范围。  The technical features given in the above embodiments are all indicative, and the various technical features of one embodiment can be used independently, and are not allowed to limit the scope of the invention.

Claims

权 利 要 求 书 Claim
1、 一种计算机执行的对多个含有同样关键词的电子文本进行处理的方法, 包括:  1. A computer-implemented method for processing a plurality of electronic texts containing the same keyword, comprising:
获得多个含有同样关键词的电子文本;  Obtain multiple electronic texts containing the same keywords;
规定邻接词段所含字词数量或邻接词段截取方式;  Specify the number of words in the adjacent segment or the way in which the adjacent segments are intercepted;
根据部分或全部文本中的每个文本内容中所述关键词的邻接词段与其他文本相同还是 不同,将该文本与其他文本划分入同一或不同子集或类别或者进行相应的相同或不同处理; 所述的相应的相同或不同处理可以包括: 相应文本具有相同或不同的分布位置或存储 方式,或者得到相同或不同的子集标记,或者使得其索引具有相同或不同的标记或索引项, 或者具有相同或不同的编排方式, 或者在交互界面具有相同或不同的显示方式或位置, 或 者^许至少部分子集各有一个或多个邻接词段或文本进行跨子集组合或排序或在交互界面 展示;  According to whether the adjacent segment of the keyword in each text content in some or all of the text is the same as or different from the other text, the text is divided into the same or different subset or category as the other text or the corresponding same or different processing is performed. The corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, Or have the same or different arrangement, or have the same or different display modes or positions in the interactive interface, or at least some subsets have one or more adjacent segments or texts to be combined or sorted across subsets or in Interactive interface display;
所述的文本可以是电子文件或网页或者它们的摘要或索引或题录或题目。  The text may be an electronic file or web page or their abstract or index or title or title.
2、 按照权利要求 1所述的处理方法, 其中包括: 对于属于某个或某些同一第一级子集或 较高的子集或其内容含有同样关键词及邻接词段的不同文本, 根据其含有的所述同样关键 词及邻接词段的其他邻接词段的相同还是不同, 将部分或全部所述文本划分入上述子集同 一或不同的下一级或多级子集或者进行相应的相同或不同处理;  2. The processing method according to claim 1, comprising: for different texts belonging to one or some of the same first level subset or higher subset or contents thereof containing the same keyword and adjacent words, according to The same keyword and other adjacent segments of the adjacent segment are identical or different, and some or all of the text is divided into the same or different lower or multi-level subsets of the subset or corresponding Same or different treatment;
所述处理方法允许依次的邻接词段的合并或分开, 以减少或增加子集层次。  The processing method allows for the merging or separation of successive adjacent segments to reduce or increase the subset hierarchy.
3、 按照权利要求 1所述的处理方法, 包括:  3. The processing method according to claim 1, comprising:
编排一个反映所述文本的同样关键词的不同邻接词段或间接邻接词段或者包含这些词 段的语句或例句或摘要实例的并列或先后关系的一级或多级的目录或树状目录或序列, 其 中, 可以包括所述文本的一个或多个不同子集各自的所述的相同邻接词段或相同间接邻接 词段或者包含该词段的语句或例句或摘要实例, 或者包括这个或这些子集的下一级或下几 级多个子集各自的相同的邻接词段或间接邻接词段或者包含该词段的语句或例句或摘要实 例, 按照并列或隶属先后关系编排或分布或存储或展示; 其中所述的词段或语句或例句或 摘要实例可以是跨子集并列的。  Arranging a different contiguous segment or indirect contiguous segment of the same keyword that reflects the text or a directory or tree of one or more levels of the parallel or sequential relationship of the statement or example sentence or summary instance of the segment or a sequence, wherein the same contiguous segment or the same indirect contiguous segment of the one or more different subsets of the text or a statement or example sentence or summary instance containing the segment may be included, or include the one or more The same contiguous segment or indirect contiguous segment of each of the subset of the next or a subset of the subset, or a statement or example sentence or summary instance containing the segment, arranged or distributed or stored in a side-by-side or subordinate relationship or The phrase or statement or example sentence or summary instance described therein may be juxtaposed across a subset.
4、 按照权利要求 1所述的处理方法, 包括:  4. The processing method according to claim 1, comprising:
编排或获取多个同时包含任一关键词及其任一邻接词段或邻接词段组的多个文本的一 级或多级分类树状目录或例句序列的图表,其中每一个图表所涉及的多个文本含有同样的 关键词和同样的邻接词段或邻接词段组, 该分类是根据在所述各个文本中该关键词的邻接 词段或邻接词段组的下一级邻接词段相同与否, 或者进而逐级根据下面各级邻接词段相同 与否进行的。  Arranging or acquiring a plurality of charts of a first-level or multi-level classification tree or example sentence sequence containing a plurality of texts of any of the keywords and any of the adjacent segments or adjacent segments, wherein each of the charts relates to The plurality of texts contain the same keyword and the same adjacent segment or a group of adjacent segments, the classification being the same according to the adjacent segment of the keyword or the adjacent segment of the adjacent segment in the respective text. Whether or not, or further step by step, according to whether the adjacent segments of the following levels are the same or not.
5、 按照权利要求 1所述的处理方法, 包括:  5. The processing method according to claim 1, comprising:
编排含有同样关键词的多个文本或文本部分内容的序列, 它们含有的所述关键词由多 个词组成的邻接词段互不相同, 或基本上互不相同。  A sequence of a plurality of text or text partial contents containing the same keyword, which contain adjacent words composed of a plurality of words, which are different from each other, or substantially different from each other.
6、 按照权利要求 1所述的处理方法, 包括:  6. The processing method according to claim 1, comprising:
将所述文本的同样关键词的不同邻接词段进行相似比较, 将相互符合一定相似要求的 多个不同邻接词段划分入同一相似子集, 或者将相互不符合一定相似要求的多个不同邻接 词段划分入不同相似子集, 或者将相互不符合一定相似要求的多个不同邻接词段编成彼此 不相似邻接词段的序列或目录, 可以将同一相似子集的各元素的共同的内容作为该相似子 集的名称或标记, 或者将其列入相似子集名称序列或目录;  Similarly comparing different adjacent segments of the same keyword of the text, and dividing a plurality of different adjacent segments that meet certain similar requirements into the same similar subset, or multiple different adjacencies that do not meet certain similar requirements The word segments are divided into different similar subsets, or a plurality of different adjacent word segments that do not meet certain similar requirements are grouped into sequences or directories that are not similar to each other, and the common content of each element of the same similar subset can be As the name or tag of the similar subset, or include it in a similar subset name sequence or directory;
所述的一定相似要求至少包括对不同邻接词段所含有的同样的字或词或词组或字符的 数量或所占比例的要求。  The certain similarity requirements include at least the requirement for the number or proportion of the same word or word or phrase or character contained in different adjacent segments.
7、 按照权利要求 1或 3或 4或 5或 6所述的处理方法, 包括:  7. The method of processing according to claim 1 or 3 or 4 or 5 or 6, comprising:
将所述相应的各个目录或序列或它们的图表分别作为独立文本,令所述独立文本分别 具有包含相应的关键词的标题或题录,或者分别具有包含相应的关键词和相应的同样邻接 词段或邻接词段组或相似子集名称的标题或题录;  The respective respective directories or sequences or their charts are respectively used as independent texts, and the independent texts respectively have a title or an inscription containing the corresponding keywords, or respectively have corresponding keywords and corresponding identical adjacent words. The title or title of the segment or adjacent segment group or similar subset name;
需要时, 令所述目录或序列的独立文本的标题或题录里或附近具有提示其相关特点的 提示字符或其组合。  If necessary, the title or the title of the independent text of the catalog or sequence has a prompt character or a combination thereof that prompts its related features in or near the title.
8、 按照权利要求 7所述的处理方法, 包括:  8. The processing method according to claim 7, comprising:
令相关搜索引擎或检索系统的关键词搜索结果排序装置对所述目录或序列的独立文本 的标题或题录里或附近具有的提示其特点的提示字符或其组合具有识别能力, 需要时, 可 以将其标题或题录里或附近具有相应字符或其组合的独立文本或部分内容排在搜索结果的 前面。 Having the keyword search results of the relevant search engine or retrieval system sort the device to separate text of the directory or sequence The title character or the prompt character or combination thereof in the title or in the vicinity of the title has the ability to recognize, and if necessary, the independent text or part of the content having the corresponding character or a combination thereof in or near the title or the title may be arranged. The front of the search results.
9、 按照权利要求 1或 3或 4或 5或 6所述的处理方法, 包括:  9. The method of processing according to claim 1 or 3 or 4 or 5 or 6, comprising:
将其中所述提供的部分或全部目录文本或其标题或题录的关键词索引数据分布于搜索 引擎或检索系统专门的索引数据库或索引数据库的相应专门区域。  The keyword index data of some or all of the catalogue texts provided therein or their titles or titles is distributed in a corresponding specialized area of a search engine or a retrieval system-specific index database or index database.
10、 按照权利要求 1所述的处理方法, 包括:  10. The processing method according to claim 1, comprising:
在所述的处理方法或者目录中, 并列子集或者并列邻接词段或间接邻接词段或者并列 文本或者并列的语句或例句或摘要实例或代表性序列信息中的某一个的具体排序位置, 部 分或完全取决于下列其中某一个或多个因素:  In the processing method or directory, the specific sorting position of the parallel subset or the parallel adjacent segment or the indirect adjacent segment or the parallel text or the parallel sentence or the example sentence or the summary instance or the representative sequence information, part Or it depends entirely on one or more of the following factors:
该文本或者该词段或语句或例句或摘要实例或信息所在文本的佩奇链接值的大小或点 击率的高低或关键词出现率的高低,  The size or click rate of the text or the segment or sentence or the example or summary example or the text of the information, or the rate of occurrence of the keyword, or the keyword occurrence rate,
或者该子集的下级子集数目或下属文本数目的多少或者该子集点击率的高低或者该子 集的文本佩奇链接值的平均数值的大小,  Or the number of subordinate subsets or the number of subordinate texts of the subset or the click rate of the subset or the average value of the text page link value of the subset,
或者该词段或者文本或者语句或例句或摘要实例所在子集的下级子集数目或下属文本 数目的多少或者所在子集点击率的高低或者所在子集的文本佩奇链接值的平均数值的大 小, 或者该子集的佩奇链接值最高的文本或另外的文本实例的佩奇链接值的大小,  Or the number of subordinate subsets or the number of subordinate texts of the subset or text or sentence or instance or summary instance, or the click rate of the subset or the average value of the text of the subset. , or the size of the page with the highest value of the page link value or the text link value of the other text instance,
或者该子集的点击率最高的或关键词出现率最高的文本或另外的文本实例的点击率或 关键词出现率的高低,  Or the click rate or keyword occurrence rate of the text with the highest click rate or the highest keyword occurrence rate of the subset or another text instance,
或者相关文本或相关子集内的相关文本在其他搜索网站或检索系统搜索结  Or related texts in related texts or related subsets searched on other search sites or search systems
果中的排序, Sorting in the fruit,
或者有关文本或有关词段的出资人相关付费或竞价的高低,  Or the relevant payment or bidding level of the relevant text or the relevant fund segment,
或者相关的邻接词段的词字的拼写或拼音的字母顺序或笔划,  Or the spelling of the words of the related adjacent segments or the alphabetical order or strokes of the pinyin,
或者文本的来源网站或单位或人的评分,  Or the source of the text or the rating of the unit or person,
或者相关文本收录的时间先后或新旧,  Or the time of the relevant texts is either new or old.
或者是否属于某一级的同一子集,  Or whether it belongs to the same subset of a certain level,
或者可以通过一种目标函数值来决定, 目标函数值取决于一个或多个变量, 该目标函 数的变量部分或全部分别代表上述所列其中某一个或多个因素。  Or it can be determined by an objective function value, which depends on one or more variables, and the variable of the objective function partially or completely represents one or more of the factors listed above.
11、 一种包括存贮装置的计算机数据系统, 其特征在于, 所述存储装置或其中的数据部分 所含有的部分或全部的关键词索引的数据结构组成至少包括:  11. A computer data system comprising a storage device, wherein the data structure of the partial or total keyword index of the storage device or the data portion thereof comprises at least:
关键词段;  Keyword segment
一个或多个邻接词段, 由相应文本内容中或文本摘要中的关键词的依次邻接的预定数 目的各级邻接词段按原顺序映射组成, 依次为: 邻接词段 1, 邻接词段 2, …邻接词段 N; 相应文本 ID段, 或其相关信息的 ID段;  One or more adjacent segments are composed of a predetermined number of adjacent segments adjacent to each other in the corresponding text content or in the text summary, in order: adjacency segment 1, adjacent segment 2 , ... the adjacent segment N; the corresponding text ID segment, or the ID segment of its associated information;
必要时, 可以包括相应文本的含有所述关键词的摘要段或标题段。  If necessary, a summary segment or a title segment containing the keyword of the corresponding text may be included.
12、 一种包括存贮装置的计算机数据系统, 其特征在于, 所述存储装置或其中的数据部分 所含有的部分或全部关键词索引或文本摘要或文本的数据以下列方式分布:  12. A computer data system comprising a storage device, wherein the storage device or part or all of the keyword index or text summary or text data contained in the data portion thereof is distributed in the following manner:
其文本摘要或文本含有同一关键词而该关键词邻接词段相同或不同的索引或文本摘要 或文本的数据, 位于同一关键词集的同一或不同子集的分布区域。  The text or the text containing the same keyword and the same or different index or text summary or text of the keyword adjacent to the word segment, located in the same or different subset of the same keyword set.
需要时, 允许位于同一子集, 而其文本摘要或文本含有同一扩展关键语句而该语句的 邻接词段相同或不同的索引数据, 位于同一子集的同一或不同的低一级或多级子集分布区 域。  If necessary, allow the same subset, and the text summary or text contains the same extended key statement and the adjacent segments of the statement are the same or different index data, located in the same subset of the same or different lower or more levels Set distribution area.
13、 一种搜索引擎提供査询者所期望结果的搜索方法, 该搜索引擎系统响应査询者经由交 互界面提出的关键词査询要求, 从该系统相关的信息源或数据库搜索并提供符合上述关键 词要求的文本或文本摘要或索引或其相关信息; 本搜索方法的特点在于, 该方法包括: 该系统经由交互界面接收査询者的关键词査询要求;  13. A search engine that provides a search result desired by a queryer, the search engine system responding to a keyword query request submitted by a queryer via an interactive interface, searching from a related information source or database of the system and providing the above The text or text summary or index or related information of the keyword requirement; the method of the search method is characterized in that: the system receives the keyword query request of the queryer via the interaction interface;
确认后, 根据该关键词要求査询包含关键词索引的数据库;  After confirmation, query the database containing the keyword index according to the keyword requirement;
将在含有上述关键词的文本内容中或文本摘要中出现的上述关键词连同其邻接词段, 作为关键语句;  The above keywords appearing in the text content containing the above keywords or in the text abstract together with their adjacent words as key statements;
所述邻接词段所包含的字词或字符的数量或该邻接词段的截止方式, 是由上述系统预 定的或者査询者同意或默认的或选定的, 也可以根据邻接词段的端部或端部附近的符号或 字或词或其字体或颜色或空格来确定, 或者由査询者在交互界面呈现的选择栏里或包含某 具体索引的文本摘要或文本或相关内容的页面上的进行的光标指示的位置和方式来确定; 根据上述的邻接词段或关键语句归纳整理出各不相同的邻接词段或者各不相同的关键 语句 · The number of words or characters contained in the adjacent segment or the manner in which the adjacent segment is cut off is pre-predicted by the above system Determined or queried by the querier or by default or selected, may also be determined according to the symbol or word or word near the end or end of the contiguous paragraph or its font or color or space, or by the inquirer Determined by the position and manner of the cursor indication on the selection bar of the interactive interface or the text summary or text or related content of a specific index; according to the above-mentioned adjacent segments or key sentences, the different items are different. Adjacent segments or different key statements
n ½据得到的关键语句生成搜索结果, SP : 将含有所述的相同或不同关键语句的不同索 引或文本摘要或文本或题录进行检索或处理或编排或整理,以供査询者经由交互界面选用。 n 1⁄2 generates search results based on the obtained key statements, SP: retrieves or processes or arranges or organizes different indexes or text summaries or texts or titles containing the same or different key sentences for the queryer to interact with The interface is selected.
14、按照权利要求 13所述的搜索方法,该方法所述操作可以由该系统预先或在査询时进行。14. A search method according to claim 13 wherein said method can be performed by the system in advance or at the time of enquiry.
15、 按照权利要求 13所述的搜索方法, 该方法还包括: 15. The search method according to claim 13, further comprising:
将在含有上述关键语句的文本内容中或文本摘要中出现的上述关键语句连同其邻接词 段, 或者将原关键语句连同其邻接词段, 作为扩展的关键语句;  The above-mentioned key statement that appears in the text content containing the above-mentioned key statement or in the text summary together with its adjacent term, or the original key statement together with its adjacent segment as the key statement of the extension;
所述邻接词段所包含的字词或字符的数量或该邻接词段的截止方式或具体内容, 是由 上述系统预定的或者査询者同意或默认的或选定的;  The number of words or characters included in the adjacent segment or the cutoff mode or specific content of the adjacent segment is predetermined by the above system or agreed by the inquirer or default or selected;
根据上述的邻接词段或关键语句归纳整理出各不相同的邻接词段或者各不相同的扩展 的关键语句;  According to the above-mentioned adjacent segments or key sentences, the key sentences of different adjacent segments or different extensions are summarized;
根据得到的扩展的关键语句生成搜索结果, SP : 并将含有所述的相同或不同的扩展的 关键语句的不同索引或文本摘要或文本或题录进行检索或处理或编排或整理或分别存储, 以供査询者经由交互界面选用。  Generating search results based on the resulting extended key statements, SP: and retrieving or processing or arranging or organizing or separately storing different indexes or text summaries or texts or titles containing the same or different extended key statements as described, For the queryer to select through the interactive interface.
需要时, 以上操作可以由该系统预先或在査询时进行。  The above operations can be performed by the system in advance or at the time of inquiry, as needed.
16、 按照权利要求 13所述的搜索方法, 其中包括编组操作:  16. The search method according to claim 13, comprising a grouping operation:
即允许将含有同样关键词的各种不同的关键语句或邻接词段或索引或文本摘要或文 本, 或者将含有同样原关键语句的各种不同的扩展的关键语句或邻接词段或索引或文本摘 要或文本, 各自编组以目录或序列形式排列或显示, 其中对每一种邻接词段所在的关键语 句或索引或文本摘要或文本仅收进各一个或多个。  That is, allowing a variety of different key statements or contiguous segments or index or text summaries or texts containing the same keywords, or various different extended key statements or contiguous segments or indexes or texts containing the same original key statements The abstracts or texts are grouped or displayed in a catalog or sequence, with only one or more of the key statements or indexes or text summaries or texts for each of the adjacent segments.
17、 按照权利要求 13所述的搜索方法, 其中包括:  17. The search method according to claim 13, comprising:
令所述部分或全部关键词索引或文本摘要或文本的数据, 根据其含有的关键词或关键 语句或者扩展关键语句的不同或相同, 分布于不同或相同的子集区域或者不同或相同的更 低级的子集区域存储;  Having the data of some or all of the keyword indexes or text summaries or texts distributed in different or identical subset regions or different or identical according to different or identical keywords or key sentences or extended key sentences Low-level subset area storage;
以便在关键词査询时, 直接提取或提供相应的关键语句或关键词索引或文本摘要或文 本的数据。  In order to directly extract or provide the corresponding key statement or keyword index or text summary or text data when the keyword is queried.
18、 按照权利要求 13所述的搜索方法, 其中包括:  18. The search method according to claim 13, comprising:
对所述数据库内的文本或摘要或搜索引擎附带的数据采集服务器从互联网或其他信息 源得到的文本进行分析, 产生所述文本相应的至少包含关键词段和邻接词段和文本 ID段 或相关内容的 ID段的索引, 必要时包括文本摘要或标题, 并存储;  Parsing the text obtained from the Internet or other information source by the text or abstract in the database or the data collection server attached to the search engine, and generating the text corresponding to at least the keyword segment and the adjacent segment and the text ID segment or related The index of the ID segment of the content, including a text summary or title if necessary, and stored;
搜索时, 根据其所包含的关键词段和邻接词段在存储中检索和提供相应的索引或摘要 或文本。  When searching, the corresponding index or summary or text is retrieved and provided in the store according to the keyword segments and adjacent segments it contains.
19、 按照权利要求 13所述的搜索方法, 其中包括:  19. The search method according to claim 13, comprising:
编排一个反映具有同一关键词的文本或文本摘要中的该关键词不同级别邻接词段之间 先后或并列关系的树状目录, 或者一个反映该关键词不同级别扩展的关键语句之间先后或 并列关系的树状目录; 以供査询时使用。  Arranging a tree-like directory reflecting the sequential or juxtaposed relationship between different levels of adjacent words in the text or text summary with the same keyword, or a key statement reflecting the different levels of expansion of the keyword The tree of the relationship; used for querying.
20、 按照权利要求 13或 14或 15所述的搜索方法, 其中包括选定操作:  20. A search method according to claim 13 or 14 or 15, comprising the selected operation:
即允许所述系统根据査询者在交互界面的页面的上述文本或文本摘要上或关键语句或 邻接词段目录上或者在选择栏或框中的光标指示, 确定相应的关键语句, 并且对该关键语 句对应的各种不同的扩展的关键语句或扩展的邻接词段或索引或题录或文本摘要或文本进 行编组操作或目录展示, 或者进行相应索引或题录或文本摘要或文本的排序展示, 或者根 据确定的相应的关键语句进行移除操作, 将所述页面或其他多个页面含有该关键语句的条 目或索引或题录或文本摘要或文本剔除或移动位置。  That is, the system is allowed to determine a corresponding key statement according to the queryer's text or text summary on the page of the interactive interface or the key statement or the adjacent word segment directory or the cursor indication in the selection column or the box, and A key statement corresponding to a variety of different extended key statements or extended contiguous segments or indexes or titles or text summaries or texts for grouping operations or catalog presentations, or indexing of corresponding indexes or bibliographies or text summaries or texts Or performing a removal operation according to the determined corresponding key statement, culling or moving the page or other plurality of pages containing the entry or index or the title or text summary or text of the key statement.
21、 按照权利要求 11所述的搜索方法, 其中包括忽视操作:  21. The search method according to claim 11, comprising ignoring the operation:
即根据査询者浏览包含原关键词的或者包含原关键语句的索引或文本摘要或题录或文 本序列时在交互界面上对页面或在页面上的操作, 判断査询者浏览该索引或文本序列的即 时位置; 如果可以确定,排列在该位置前面一定范围里包含某种关键语句的索引或文本摘 要或文本或该关键语句本身一直或连续一定次数未被打开或链接, 也未被以其他方式点击 或关注或提示保留, 则根据该关键语句进行移除操作, 将所述页面或其他多个页面含有该 关键语句的条目或索引或文本摘要或文本剔除或移动位置。 The queryer browses the index or text according to the operation of the page or the page on the interactive interface when the queryer browses the index or text summary or the title or text sequence containing the original keyword or contains the original key sentence. The immediate position of the sequence; if it can be determined, an index or text extract containing a certain key sentence in a certain range in front of the position If the text or the key statement itself has not been opened or linked for a certain number of times, or has not been clicked or followed or prompted to be retained, the removal operation is performed according to the key statement, and the page or other The page contains an entry or index or text summary or text culling or moving position of the key statement.
22、 一种响应査询者经由交互界面提出的要求,提供所期望搜索结果的搜索引擎系统, 包 括:  22. A search engine system for providing a desired search result in response to a request by a queryer via an interactive interface, comprising:
服务器, 该服务器经由通讯网络或线路与所述交互界面所在的客户机耦合; 位于服务器的搜索引擎, 所述搜索引擎包括: 包括关键词索引在内的数据库, 以及査 询器, 该査询器能够根据査询者提出的关键词要求在所述数据库进行査询并将査询到的相 关数据结果列表提供给交互界面;  a server, the server being coupled to a client where the interaction interface is located via a communication network or a line; a search engine located at the server, the search engine comprising: a database including a keyword index, and a querier, the querier The query can be performed in the database according to the keyword requirement of the queryer, and the related data result list obtained by the query can be provided to the interactive interface;
特点、  Characteristics,
^述 ϊί据库还包括包含关键词的文本摘要或文本;  ^ ϊ 据 The database also includes a text summary or text containing keywords;
所述査询器或搜索引擎包括关键词扩展部件,该部件可以对査询的关键词进行扩展操 作: 将在含有上述关键词的文本内容中或文本摘要中出现的上述关键词连同其邻接词段, 作为各个不同的扩展关键语句, 并将其列表或将上述关键词的不同邻接词段列表以供査询 者经由交互界面选用, 或者将含有相同或不同的所述扩展关键语句的不同索引或文本摘要 或文本进行检索或处理或编排或整理, 以供査询者经由交互界面选用;  The querier or search engine includes a keyword expansion component that can perform an expansion operation on the keywords of the query: the above keywords appearing in the text content containing the keywords or in the text summary together with the adjacent words thereof Segments, as distinct extension key statements, and list or list different adjacent segments of the above keywords for the queryer to select via the interactive interface, or to have different indexes of the same or different extended key statements Or text summary or text for retrieval or processing or arrangement or organization for the queryer to select via the interactive interface;
所述关键词扩展部件同样可以对关键语句进行一次或多次扩展操作;  The keyword expansion component can also perform one or more expansion operations on the key statement;
或者进行相应的査询或处理。  Or make the corresponding query or processing.
23、 按照权利要求 22所述的搜索引擎系统, 其中:  23. The search engine system of claim 22, wherein:
包括一个反映具有同一关键词的文本或文本摘要或题录中的该关键词不同级别邻接词 段之间先后关系的邻接词段树状目录, 或者包括一个反映该关键词不同级别扩展的关键语 句之间先后关系的树状目录。  Included in a text or text summary with the same keyword or a contiguous segment tree between the different levels of adjacent words in the keyword, or a key statement reflecting the different levels of expansion of the keyword A tree directory of relationships.
24、 按照权利要求 22所述的搜索引擎系统, 其中:  24. The search engine system of claim 22, wherein:
所述的搜索引擎系统, 还可以包括一种用户图形交互界面, 允许査询者添加附加査询 信息, 其界面可以包含一种对话框或选择框, 以接收査询者对操作方式或模式等方面的选 择; 其界面或者可以包含可以点击的关键语句或邻接词段或语句或段落或操作命令或选择 的文字或符号或图形, 允许査询者添加附加査询信息。  The search engine system may further include a user graphical interaction interface, allowing the queryer to add additional query information, and the interface may include a dialog box or a selection box to receive the queryer's operation mode or mode, etc. The choice of aspects; its interface may contain key statements that can be clicked or adjacent words or sentences or paragraphs or action commands or selected text or symbols or graphics, allowing the queryer to add additional query information.
25、 一种具有反映具有同一关键词的文本或文本摘要或题录中的该关键词不同级别邻接词 段之间先后关系的邻接词段树状目录或者反映该关键词不同级别扩展的关键语句之间先后 关系的树状目录的计算机可读介质。  25. An adjacent segment tree directory having a text or text summary having the same keyword or a sequential relationship between adjacent segments of the keyword in the keyword or a key sentence reflecting a different level of expansion of the keyword A computer readable medium of a tree of related relationships.
26、 一种计算机系统或搜索系统执行的方法, 可以在无论事后还是事前, 从含有任一实际 提出的或可能的关键词的索引或文件序列中得到彼此各不相同关键语句, 将包含同样关键 语句的索引或文件分别排列成组, 或者得到所需的目录和例句序列。  26. A method of executing a computer system or a search system that can obtain different key statements from an index or a sequence of files containing any actually proposed or possible keywords, whether after or after the event, and will contain the same key The index or file of the statement is arranged in groups, or the desired directory and example sequence are obtained.
PCT/CN2008/000190 2007-02-15 2008-01-25 Convenient method and system of electric text processing and retrieve WO2008098467A1 (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN200710079309.4 2007-02-15
CN200710079309 2007-02-15
CN200710087104.0 2007-03-21
CN 200710087104 CN101063975A (en) 2007-02-15 2007-03-21 Method and system for electronic text-processing and searching
CN200710147578 2007-08-28
CN200710147578.X 2007-08-28

Publications (1)

Publication Number Publication Date
WO2008098467A1 true WO2008098467A1 (en) 2008-08-21

Family

ID=39689642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/000190 WO2008098467A1 (en) 2007-02-15 2008-01-25 Convenient method and system of electric text processing and retrieve

Country Status (1)

Country Link
WO (1) WO2008098467A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475536A (en) * 2019-01-23 2020-07-31 百度在线网络技术(北京)有限公司 Data analysis method and device based on search engine
CN111680130A (en) * 2020-06-16 2020-09-18 深圳前海微众银行股份有限公司 Text retrieval method, device, equipment and storage medium
CN111767719A (en) * 2020-06-29 2020-10-13 北京百度网讯科技有限公司 Bibliographic generation method and device, computer system and readable storage medium
CN113240486A (en) * 2021-05-10 2021-08-10 北京沃东天骏信息技术有限公司 Traffic distribution method and device in search scene
CN116167352A (en) * 2023-04-03 2023-05-26 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08314975A (en) * 1995-05-22 1996-11-29 Matsushita Electric Ind Co Ltd Information retrieving device
CN1306258A (en) * 2001-03-09 2001-08-01 北京大学 Method for judging position correlation of a group of query keys or words on network page
CN101063975A (en) * 2007-02-15 2007-10-31 刘二中 Method and system for electronic text-processing and searching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08314975A (en) * 1995-05-22 1996-11-29 Matsushita Electric Ind Co Ltd Information retrieving device
CN1306258A (en) * 2001-03-09 2001-08-01 北京大学 Method for judging position correlation of a group of query keys or words on network page
CN101063975A (en) * 2007-02-15 2007-10-31 刘二中 Method and system for electronic text-processing and searching

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111475536A (en) * 2019-01-23 2020-07-31 百度在线网络技术(北京)有限公司 Data analysis method and device based on search engine
CN111475536B (en) * 2019-01-23 2023-10-17 百度在线网络技术(北京)有限公司 Data analysis method and device based on search engine
CN111680130A (en) * 2020-06-16 2020-09-18 深圳前海微众银行股份有限公司 Text retrieval method, device, equipment and storage medium
CN111767719A (en) * 2020-06-29 2020-10-13 北京百度网讯科技有限公司 Bibliographic generation method and device, computer system and readable storage medium
CN111767719B (en) * 2020-06-29 2023-08-22 北京百度网讯科技有限公司 Method and device for generating title, computer system and readable storage medium
CN113240486A (en) * 2021-05-10 2021-08-10 北京沃东天骏信息技术有限公司 Traffic distribution method and device in search scene
CN116167352A (en) * 2023-04-03 2023-05-26 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium
CN116167352B (en) * 2023-04-03 2023-07-21 联仁健康医疗大数据科技股份有限公司 Data processing method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
US9323827B2 (en) Identifying key terms related to similar passages
JP5531033B2 (en) Methods and systems
Chirita et al. P-tag: large scale automatic generation of personalized annotation tags for the web
US7958128B2 (en) Query-independent entity importance in books
CN100501745C (en) Convenient method and system for electronic text-processing and searching
US20180004850A1 (en) Method for inputting and processing feature word of file content
US20070250501A1 (en) Search result delivery engine
Manjari et al. Extractive Text Summarization from Web pages using Selenium and TF-IDF algorithm
CA2637239A1 (en) System for searching
CN101246484A (en) Electric text similarity processing method and system convenient for query
WO2008098467A1 (en) Convenient method and system of electric text processing and retrieve
Varadarajan et al. Beyond single-page web search results
Boughareb et al. A graph-based tag recommendation for just abstracted scientific articles tagging
Katta et al. An improved approach to English-Hindi based cross language information retrieval system
JP4009937B2 (en) Document search device, document search program, and medium storing document search program
Bhaskar et al. Cross lingual query dependent snippet generation
Tanaka-Ishii et al. Multilingual phrase-based concordance generation in real-time
Li et al. Web Search Based on Micro Information Units
Rao Recall oriented approaches for improved indian language information access
Hossain A Bangla Text Search Engine Using Pointwise Approach of Learn to Rank (LtR) Algorithm
Hema Extraction and Search of Relevant Chemical Documents from the Web
Onwuchekwa Indexing and abstracting services
KR101057997B1 (en) Search engines and search methods using initial text
Prakash et al. Design and Implementation of Novel Techniques for Content-Based Ranking of Web Documents
Shimizu et al. Development of an XML information retrieval system for queries on contents and structures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08706405

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08706405

Country of ref document: EP

Kind code of ref document: A1