WO2008098467A1

WO2008098467A1 - Convenient method and system of electric text processing and retrieve

Info

Publication number: WO2008098467A1
Application number: PCT/CN2008/000190
Authority: WO
Inventors: Erzhong Liu
Original assignee: Erzhong Liu
Priority date: 2007-02-15
Filing date: 2008-01-25
Publication date: 2008-08-21

Abstract

A method and system for processing the same keywords in a plurality of electronic texts by the computer, including: obtaining a plurality of electronic texts with the same keywords, defining the number of the key words in the adjacent phrases, or the truncating way of the adjacent words, classifying the text and other texts into the same or different subsets or catalogues, executing the same or different processing for them, according to whether the adjacent words of the key words in each text content of a part of or all the texts are the same as those of other texts or not. The method is capable of forming the multiple stages of the subsetssystem or catalogues or example sequences with the non-repeated core contents from a great lot of the results of the retrieve of the keywords, helping the user reduce the search region rapidly and accurately, obtaining the desired enquiring result completely and accurately.

Description

Convenient method and system for electronic text processing and retrieval

(1) Technical field

The present invention relates to techniques for computer and search engine processing and retrieval of electronic text.

(2) Background technology

Over the decades, computer database retrieval technology has developed greatly, especially the advancement of network technologies such as the World Wide Web, which has enabled the scale of databases that people can share to reach astronomical figures. In order to facilitate the user to find the information or documents required, a classification or directory retrieval system has appeared. This technique is applicable in the mature classification field that people are very familiar with, but in the broader field of massive information, it is difficult to establish and difficult to master and use.

Search engine technology with keyword search as the core brings convenience to users. A search engine-centric search system is generally located on one or more servers or other computer devices, and is composed of a text (page) library, a text index library, an index constructor that obtains a text index based on text analysis of the text library, and a search. It is composed of parts such as a querier that generates search results, and often comes with a data collection server that collects and adds text to the text library from the Internet or other information sources. The system can obtain the query query keyword query request through the interactive interface on the client and the communication network or communication line, query in the text index library or the text library, and perform correlation analysis between the keyword request and the text. The relevant results are obtained and sorted, and then provided to the interactive interface via a communication network or line. This kind of search system is very convenient and quick to use, but the total number of indexes included in the return result is still very large and difficult to check one by one.

Techniques have also been developed to compare keywords to anchor text descriptions of related texts to determine relevance, and still do not satisfy the searcher. In order to be able to rank the most valuable query results to the inquirer as much as possible to facilitate the queryer, U.S. Patent No. 6,285,999 proposes a ranking based on webpage hyperlink structure analysis (Page link) for ranking search results. Technology, surpassing other sorting techniques, was adopted by Google and achieved unprecedented success.

However, this technique, as well as various other sorting techniques, only improves the efficiency of keyword search in a statistical sense, and does not guarantee that everyone's desired query results will be ranked in front of a large index table. For example, we use the "Google" Chinese website to search for the word "Bulin" and get nearly 300,000 indexes. We still can't guarantee that we can see the desired content in the front position without any omission, which is both strict and convenient. At the same time, before we read the expected information, we were reluctant to read irrelevant information that repeated the main content.

In order to solve this problem, in the past ten years, people have been trying to develop various new search engine technologies, for example, the "Priority List of Importance" technology related to U.S. Patent No. 6,421,675, which is incorporated herein by reference. The technique of forming a dynamic object table based on the history of the user's query data, the technique of "communicating information with other inquirers" by the Chinese patent CN1151457, and the technique of measuring the similarity of electronic texts in the US Patent No. 6990628. These techniques have certain advantages, but the results are very limited.

The technique of U.S. Patent No. 7089236 can perform semantic analysis on keywords proposed by a queryer, and present different possible semantics on the interactive interface to help the inquirer narrow down the search range. The technology of Chinese Patent Application No. 200510081867.5, which is similar to the above, uses the webpage category information to disperse the keyword search results of the search engine. The problem with these two technologies is that, first of all, it is necessary to build a very complex and large but impossible accurate classification database. It is very difficult for the machine to judge which one or which semantics or categories of a certain keyword or text belong to a certain keyword. Its reliability is not high. It is likely that there is a gap between different semantics or categories of a keyword that is likely to overlap. If you increase the level of classification, the overlap will result in a surge in storage space. At the same time, the queriers of keyword search face unfamiliar fields, and it is difficult to accurately grasp many semantics or classifications. These have seriously affected the improvement of query efficiency.

Therefore, there is an urgent need for a keyword search engine system technology that is both rigorous and efficient, which can effectively help the inquirer to narrow down or even narrow the scope of the search multiple times. It is required that the boundaries between the different ranges are clear, easy to judge, and there is no overlap or no gap, so as to greatly speed up the speed at which the inquirer obtains the desired result and ensure the rigor of the search. This has also become a worldwide problem that has not been solved for many years.

(3) Summary of the invention

The object of the present invention is to provide a technique for electronic text processing and retrieval or search of a computer or a search engine, which can quickly and rigorously narrow the search range when the user performs keyword search and faces a large number of search results, or Eliminate all kinds of irrelevant information or duplicate information, and accurately get the desired result with little omission.

One aspect of the present invention provides a computer-implemented method of processing a plurality of electronic texts containing the same keyword, including:

Obtain multiple electronic texts containing the same keyword; specify the number of words in the adjacent segment or the interceptor of the adjacent segment Depending on whether the contiguous segment of the keyword in each text content in some or all of the text is the same as or different from the other text, the text is divided into the same or a different subset or category or correspondingly the same or Different treatments.

The corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or ^ at least some of the subsets have one or more adjacent segments or texts to be combined or sorted or sorted across subsets Interface display

The text may be an electronic file or a web page or their abstract or index or title or title.

It can also be various information contents of databases, books, dictionaries, manuals, and patent documents.

The above-mentioned adjacent segment or indirectly adjacent segment may be preceded by a keyword or may be followed by a keyword; generally a segment composed of one or more words or words or even roots in the text content, including a certain part when needed Some characters, such as abbreviations, punctuation, etc.; in some cases, to judge the same or different words, you can also omit the prefix or suffix of some words or some virtual words or non-real words or punctuation or spaces difference. When necessary, the adjectives or quantifiers or quantifiers or adjectives or adverbs may or may not be considered or differentiated, even with or without regard to the presence or absence of articles or conjunctions.

When the keyword at the time of retrieval is a plurality of words that can be separated, the adjacency segment may refer to one of the words (such as the preceding word) or the adjacent segment of the plurality of words.

The number of words or characters contained in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined or the querier agrees or defaults or selects.

In some cases, judging the length of a segment is similar to judging whether the two segments are the same or different. You can also omit or ignore the prefix or suffix of certain words or some virtual words or auxiliary words or numerals or quantifiers. Or non-real words or punctuation or spaces or even the presence or difference of adjectives or adverbs.

The benefits of the method of the invention for retrieval are significant. When the inquirer is interested in a certain adjacent segment of the keyword, it is easy to obtain all the texts of the category containing the adjacent segment, and instead, it is easy for him to skip the text.

The key point of the present invention is that the contiguous content of the keyword is most likely to determine the specific meaning or direction or direction or direction of the keyword in the text, which should be of most interest to the searcher. At the same time, if the method is appropriate, it can completely avoid the phenomenon of "content overlap and blank of different categories or subsets" which is difficult to avoid by using the classification retrieval method. This phenomenon will eventually result in a multi-level classification subset system. The consequences of being difficult to use. This determines that the search effect of the method or system of the present invention will be significantly improved.

The processing method may further include: for the different texts belonging to one or some of the same first-level subset or higher subsets or the content thereof containing the same keyword and the adjacent segment, according to the same The keywords are identical or different from other adjacent segments of the adjacent segment, and some or all of the text is divided into the same or different lower or multi-level subsets of the above subsets or correspondingly the same or different processes are performed.

This is actually subdividing the original subset of the same adjacent segment into a number of lower-level subsets. The described processing method allows for the merging or separation of successive adjacent segments to reduce or increase the subset hierarchy.

The processing method may further include:

Arranging a different contiguous segment or indirect contiguous segment of the same keyword that reflects the text or a directory or tree of one or more levels of the parallel or sequential relationship of the statement or example sentence or summary instance of the segment or sequence.

The processing method may further include: a keyword adjacent segment or an indirect adjacent segment in the directory or tree directory or sequence, if there is only one adjacent segment in the next level or the next level, the segment It may be distributed or stored or displayed together with its next level or next level of adjacent words in its original location.

The processing method may further include:

Arranging or acquiring a plurality of charts of a first-level or multi-level classification tree or example sentence sequence containing a plurality of texts of any of the keywords and any of the adjacent segments or adjacent segments, wherein each of the charts relates to The plurality of texts contain the same keyword and the same adjacent segment or a group of adjacent segments, the classification being the same according to the adjacent segment of the keyword or the adjacent segment of the adjacent segment in the respective text. Whether or not, or further step by step, according to whether the adjacent segments of the following levels are the same or not.

The processing method may further include, in the above-mentioned text or directory or sentence or example sentence or abstract instance or in the vicinity of the keyword or the adjacent word segment or the indirectly adjacent word segment they contain, may have their corresponding juxtaposition subsets. A hint of the number or number of subordinate subsets or the number of juxtaposed subsets of the subset of related words or segments or the number of subordinate subsets or texts contained. The processing method of the present invention may further include:

A sequence of a plurality of text or text partial contents containing the same keyword, which contain adjacent words composed of a plurality of words, which are different from each other, or substantially different from each other. In other words, only one or more of the contents of the text or text portion containing the same adjacent segment are represented.

In this way, the representative information sequence of the same keyword can be reduced by about half of the number to replace the original mass information, which makes reading more convenient.

The processing method of the present invention may further include:

Different adjacent segments of the same keyword of the plurality of texts are similarly compared, divided into various similar subsets, or programmed into sequences or directories that are not similar to each other.

The method may further include: respectively, respectively, the respective respective directories or sequences or their charts as independent texts, respectively, so that the independent texts respectively have a title or a title containing a corresponding keyword, or respectively have corresponding keywords a title or title that corresponds to the same adjacent segment or adjacent segment group or similar subset name;

If desired, the title or the title of the separate text of the catalog or sequence or their chart is provided with a prompt character or a combination thereof that prompts its associated feature in or near the title.

The processing method may further include: causing a keyword search result sorting device of the related search engine or the search system to have a prompt character for prompting the feature in the title or the title of the independent text of the directory or sequence or The combination has the ability to recognize, if desired, to place independent text or portions of the title or in the title with or without the corresponding character or combination thereof in front of the search results.

The processing method may further include: distributing part or all of the catalogue text or the keyword index data of the title or the title (for example, a keyword index inverted list) in the search engine or the retrieval system. The corresponding specialized area of the index database or index database.

The present technique allows a querier to indicate text or graphics or symbols in a directory or sequence or other content on an interactive interface, such as clicking a cursor to determine or expand or link related content.

The processing method of the present invention may further include: in the processing method or the directory, the specific sorting position of the parallel subset or the parallel adjacent word segment or the parallel text or the parallel text portion or the representative sequence information, Partial or complete depending on the relevant subset or related text or paragraph or content or information or the text's page link value, click rate, keyword occurrence rate, number of subordinate subsets or subordinate text, subset click rate, The average or maximum value of the text page link value, the sorting of the search results on the website or system, the bid, the spelling, the stroke, the source rating, the time of inclusion, and others, etc., or by The corresponding objective function value is determined.

The processing method of the present invention may further comprise: allowing for additional or reduced additional keywords that are or may not be available, or increasing or decreasing time or geographic or language or other type or scope or requirement, in the method or result of the processing. Limits, get further refined results or more broad results.

Another aspect of the present invention is a computer data system including a storage device, wherein the storage device or data portion thereof contains part or all of a keyword index or text summary or text data in the following manner Distribution:

The text or the text containing the same keyword and the same or different index or text summary or text of the keyword adjacent to the word segment, located in the same or different subset of the same keyword set.

Still another aspect of the present invention is a computer data system including a storage device, wherein the data structure of the partial or total keyword index contained in the storage device or the data portion thereof includes at least: Word segment

One or more adjacent segments are composed of a predetermined number of adjacent segments adjacent to each other in the corresponding text content or the keywords in the text summary, in order: adjacency segment 1, adjacent segment 2 , ... adjacent segment N;

The ID segment of the corresponding text ID segment, or its associated information;

The ID segment refers to an address segment;

If necessary, may include a summary segment or a title segment of the keyword contained in the corresponding text;

The system may allow the computer data system to search or transform the corresponding index or by transforming according to the combination of the keyword segment and one or more of the adjacent segments included in the search specification or the number of combined segments. content.

Obviously, it can be specified that the above N minimum is 1, and the correlation index has only one adjacent segment.

The present invention may comprise a computer readable medium having the data structure of the above keyword index, or stored in a real A computer readable medium of instructions for data structures that utilize the above-described keyword indexing.

The present invention can be a computer-readable medium storing instructions executable by one or more processing devices for implementing a method of processing a plurality of electronic texts containing the same keywords. , can include:

Used to obtain or search for multiple instructions containing electronic text of the same keyword;

An instruction to specify the number of words contained in an adjacent segment or the manner in which adjacent segments are intercepted;

According to whether the adjacent segment of the keyword in each text content in some or all of the text is the same as or different from other texts, the text is divided into the same or different subsets or categories or correspondingly the same or different processing. Instruction

The corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or ^ at least some of the subsets have one or more adjacent segments or texts to be combined or sorted or sorted across subsets Interface display.

Yet another aspect of the present invention is to provide a search method for a search engine to provide a desired result of a queryer, the search engine system responding to a keyword query request submitted by a queryer via an interactive interface, from a system-related information source. Or the database searches for and provides a text or text summary or index or related information that meets the above keyword requirements; this search method is characterized in that the method includes:

The system receives the query query keyword query request via the interactive interface;

After confirmation, query the database containing the keyword index according to the keyword requirement;

The above keywords appearing in the text content containing the above keywords or in the text abstract together with their adjacent words as key statements;

The number of words or characters included in the adjacent segment or the manner in which the adjacent segment is cut off is predetermined by the above system or agreed by the queryer or default or selected.

If necessary, it can also be determined according to the symbol or word or word near the end or end of the adjacent segment or its font or color or space, or by the queryer in the selection bar presented by the interactive interface or containing a specific index. The position and manner of the cursor indication on the page of the text summary or text or related content is determined;

According to the above-mentioned adjacent words or key sentences, the different adjacent segments or different key statements are sorted out.

n 1⁄2 generates search results according to the obtained key sentences, SP : and retrieves or processes or arranges or organizes different or different texts or texts or texts of the same or different key sentences for the queryer to interact with The interface is selected.

If necessary, the above operations can be performed by the system in advance or at the time of inquiry.

The search method may further include:

The above-mentioned key statement that appears in the text content containing the above-mentioned key statement or in the text summary together with its adjacent term, or the original key statement together with its adjacent segment as the key statement of the extension;

The number of words or characters included in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined by the above system or agreed by the inquirer or default or selected.

It can also be determined according to the symbol or word or word near the end or end of the adjacent segment or its font or color or space, or by the queryer in the selection bar presented by the interactive interface or containing a specific cable. The position and manner of the cursor indication on the page of the text summary or text or related content is determined;

According to the above-mentioned adjacent segments or key sentences, the key sentences of different adjacent segments or different extensions are summarized;

Generate search results based on key statements obtained by extensions (which can be expanded multiple times if needed), SP: Retrieve or process different indexes or text summaries or texts or titles containing the same or different extended key statements as described or Arrange or organize or store separately for the queryer to select via the interactive interface.

The above operations can be performed by the system in advance or at the time of inquiry, as needed.

When necessary, the cut-off position of the adjacent segment mentioned above may, for example, specify the number of words of each adjacent segment, for example, the number of words is one.

The processing method or data system or search method may be used for an Internet search engine system, or may be used for a local or independent computer information base search system, such as a digital library system, a digital database of a document database. system. In this way, the original preliminary results of the keyword search will be subdivided into a subset system, which is convenient for users to choose. The search method may also include a grouping operation as needed:

That is, it is allowed to have a variety of different key sentences or adjacent paragraphs containing the same keyword or a title or text summary or text, or a variety of different key words or adjacent words that will contain the same original key statement. Segments or indexes or bibliographies or text summaries or texts, each grouped or displayed in a catalog or sequence, with only one key sentence or index or title or text summary or text for each adjacent segment being included in each Or multiple.

This grouping has a certain unity or representativeness, which can help users to select a small amount of interactive interface screen information. When the length of the key statement we choose reaches a certain level, the core content of the index or the title or summary or the sequence of the title sequence will not be repeated.

The search method may further include:

Having the data of some or all of the keyword indexes or text summaries or texts distributed in different or identical subset regions or different or identical according to different or identical keywords or key sentences or extended key sentences Low-level subset area storage. This allows you to directly extract or provide data for key statements or key word indexes or bibliographic or text summaries or texts when searching for keywords.

The search method may also include:

Parsing the text obtained from the Internet or other information source by the text or abstract in the database or the data collection server attached to the search engine, and generating the text corresponding to at least the keyword segment and the adjacent segment and the text or related content. The index of the ID segment (address segment), including a text summary or title if necessary, and stored. At the time of the search, the corresponding index or abstract or title or text can be retrieved and provided in the store according to the keyword segments and the adjacent segments it contains.

The text address, which can be a database address or an internet address or URL or a form that represents a hash of the URL field, can provide a link to access or open the text.

The search method may further include:

A tree-like directory (Fig. 8) that reflects the sequential or juxtaposed relationship between adjacent segments of the keyword in the text or text summary with the same keyword, or a key statement reflecting the different levels of expansion of the keyword. A tree directory of successive or juxtaposed relationships; used for querying.

The search method may also include a selected operation:

That is, the system is allowed to determine the corresponding key statement according to the above text or text summary or the title or key sentence or the adjacent segment directory of the page of the queryer or the selection column or the frame of the queryer, and Performing a grouping operation or catalog display of various different extended key sentences or extended adjacent paragraphs or indexes or text summaries or texts or titles corresponding to the key statement, or performing corresponding index or text summary or text or bibliography Sorting the presentation, or performing a removal operation based on the determined corresponding key statement, culling or moving the page or other plurality of pages containing the entry or index or text summary or text or title of the key statement.

The search method may also include an ignore operation:

The queryer browses the index or text according to the operation of the page or the page on the interactive interface when the queryer browses the index or text summary or the title or text sequence containing the original keyword or contains the original key sentence. The immediate position of the sequence; if it can be determined, an index or text summary or a bibliography or text containing a certain key sentence in a certain range in front of the position or the key statement itself has not been opened or linked for a certain number of times, or not Being clicked or followed or prompted to be retained in other ways, the removal operation is performed according to the key statement, and the page or other pages containing the key statement or the index or text summary or the title or text are removed or moved. .

In this method, the similar file information of the file that the querier does not pay attention to for a long time in the reading process can be moved backward or eliminated from the following sequence, thereby reducing the trouble of excessive useless information.

Another aspect of the present invention is to provide a search engine system that provides a desired search result in response to a request by the querier 1 via the interactive interface 2, including:

Server 5, the server is connected to the client where the interaction interface 2 is located via the communication network 4 or the line

Machine 3 affinity

a search engine 8 of the server 5, the search engine 8 comprising: a database 9 including a keyword index, and a querier 11 capable of requesting the database in accordance with a keyword request by the querier 9 querying and providing the queryed related data result list to the interactive interface 2 _;

Characteristics,

The library 9 also includes a text summary or text containing keywords that can be included in the key Within the word index;

The querier 11 or the search engine 8 includes a keyword expansion component 10 that can perform an expansion operation on the keywords of the query: the above keywords that appear in the text content containing the above keywords or in the text summary together with Adjacent segments, as different extended key statements, and list or list different adjacent segments of the above keywords for the queryer to select via the interactive interface, or will contain the same or different said extended key statements Different indexes or text summaries or texts are retrieved or processed or arranged or organized for the queryer to select via the interactive interface;

The keyword expansion unit 10 can also perform one or more expansion operations on key statements and perform corresponding queries or processing.

The "keyword of the query" described herein may refer to "a keyword being queried" or "a keyword to be queried". The search engine system described above may be a search system for the online customer service located on the Internet, or may be a stand-alone computer information base search system. The server 5 is a computer storage and processing device, which may be a single device or a plurality of groups or distributed configurations. The client 3 can be a personal computer or workstation or other computer device, and an appropriate browser can be configured as needed.

The search engine system may further allow: the search engine includes an index construction component 13 for performing text in the database or text obtained by the data collection server 12 attached to the search engine from the Internet 4 or other information source. The analysis generates an index corresponding to at least the keyword segment and the adjacent segment and the text ID segment, and stores the corresponding text.

When needed, the number of words in each adjacent segment can be simply specified here, for example, the number of words is one.

The search engine system may further comprise a tree of adjacent segments reflecting the relationship between adjacent segments of the keyword in the text or text summary having the same keyword (Fig. 8), or including a key reflecting the key A tree-like directory of the relationship between key statements of different levels of word expansion.

The present invention may further comprise a tree of adjacent segments having a relationship between different levels of adjacent words in the text or text summary or the title of the keyword, or a key statement reflecting the different levels of expansion of the keyword. tree has between computer readable media relations (computer-readable medium) ₀

In fact, it can also be said that this means that the computer readable medium stores instructions which can be executed by one or more processing means to reflect the tree directory. The above tree directory can actually reflect the parallel relationship of peer segments or key statements at the same time.

The corresponding key segment of the tree directory of the adjacent segment tree directory, or the corresponding key sentence of the tree directory reflecting the relationship between the key statements of different levels of the keyword, may also display the number of subsets or the number of subsequent segments thereof. Contains the number of files.

The search engine system may further include a user graphical interaction interface (FIG. 10), which allows the queryer to add additional query information, and the interface may include a dialog box or selection box 51 to receive the queryer pair. Choice of operation mode or mode; its interface can contain key statements that can be clicked or adjacent words or sentences or paragraphs or operation commands or selected text or symbols or graphics, allowing the querier to add additional query information.

The search technology with keywords and adjacent words as the core of the invention has the rigorousness of dictionary and the convenience of obviously surpassing the prior art in dividing and continuously narrowing the range of search results of the same keyword, and can often be hundreds of times. 10,000 pieces of the same keyword online information, condensed into a representative information sequence or directory of the same keyword with 2, 3 orders of magnitude, and the core content of each piece of information (contents of several adjacent words near the keyword) It will not repeat or miss, and will better meet the long-term urgent needs of information search users.

(4) BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a schematic diagram showing an example of the number of words contained in an adjacent segment or the way in which adjacent segments are intercepted.

Figure 2 is a schematic diagram showing an example of a tree directory of different adjacent segments or corresponding subsets of the same keyword.

FIG. 3 is a schematic diagram showing examples of different similar subsets of the same keyword and subsets of different adjacent segments of the lower level.

4 is a block diagram showing the structure of an embodiment of a search system in accordance with the present invention.

Figure 5 is a schematic diagram showing the generation of key statements in accordance with one embodiment of the present invention.

FIG. 6 is a schematic diagram showing another key sentence generation manner according to an embodiment of the present invention.

FIG. 7 is a flow chart showing an exemplary operation of a user at an interactive interface according to an embodiment of the present invention.

FIG. 8 is a block diagram showing a different level of adjacent words of a keyword according to an embodiment of the present invention.

A schematic diagram of a tree of adjacent segments in a sequential relationship.

FIG. 9 is a flow chart showing the operation of a search engine according to an embodiment of the present invention.

FIG. 10 shows a cursor click (selected operation) and a display display during a search process according to an embodiment of the present invention. A partial screen shot showing the results.

Figure 11 shows an exemplary flow chart of one embodiment of the method of the present invention.

(5) Specific implementation

The following is further described in detail on the basis of the "invention" in the following with reference to the accompanying drawings.

A computer-implemented method for processing a plurality of electronic texts containing the same keyword provided by the present invention includes, by way of example only:

First, obtain a plurality of electronic texts containing the same keyword from a computer or a database or the Internet; the texts may be electronic files or documents or web pages or their abstracts or indexes or titles or topics, or may be databases, works, or dictionaries. , manuals, and various information content of patent documents.

Specify the number of words in the adjacent segment of the keyword in the text or the way in which the adjacent segment is intercepted:

Specifically, the adjacent word segment generally refers to a directly adjacent segment, and may also be an indirectly adjacent segment when necessary; a direct adjacency refers to the adjacent segment having no text between the original text and the above keyword. The interval, and the indirect adjacent segment adjacent segment means that the adjacent segment has a small amount of text (one or two words) between the original text content and the above keyword, and the large interval will obviously affect the use effect of the method.

The adjacent segment may be preceded by a keyword or may be followed by a keyword; generally a segment consisting of one or more words or words or even roots in the text content, and including certain characters when needed, such as Abbreviations, punctuation, etc.

The number of words or characters included in the adjacent segment or the manner or specific content of the adjacent segment may be predetermined by the computer system or agreed by the querier or selected or selected, or may be The position and manner of the cursor indication on the selection bar of the interactive interface presentation or the text summary or text or related content of a specific index is determined.

Figure 1 and Figure 5, Figure 6 show the number of words in the adjacent segment or the way in which the adjacent segments are intercepted.

A few examples. In the example of Figure 1, the keyword is "search engine." 101 means "the way to intercept the first 2 real words of the keyword"; 102 means "the way to capture the first 2 words after the keyword"; 103 means the "2 words after the keyword is intercepted"; 104 means "after the keyword is intercepted" The first comma or the period before the actual word "105; 105 means "the way to intercept the keyword after the first comma or the word before the period" is not less than 2 words.

In some cases, the length of the segment may be judged, or the prefix or suffix of certain words or the presence or absence of certain words or auxiliary words or numerals or quantifiers or non-real words or punctuation or spaces may be omitted or not considered. Or different (see the following embodiment A), the presence or absence or difference or difference of the adjectives or adverbs therein may even be omitted or not considered.

When the keyword at the time of retrieval is a plurality of words that can be separated, for example, the above-mentioned adjacent segment may refer to each adjacent segment in which one word (such as a preceding word) or a plurality of words. In the latter case, it may be necessary to compare adjacent segments of different parts of the keyword to determine whether the adjacent words of different texts are identical.

When the same keyword appears multiple times in a text, only the contiguous content of any of the appearing keywords can be considered, and the text can be appropriately separated and treated as multiple texts. This is useful for searches with longer texts.

° According to whether the adjacent segment of the keyword in each text content in some or all of the text is the same as or different from other texts, the text is divided into the same or different subsets or categories or correspondingly the same or Different treatments. (See Example A below)

In general, the so-called "identical" means that the two segments are exactly the same; but in some cases, the same or different words are judged, and the prefix or suffix of some words may be omitted or not considered. The presence or absence or difference or difference of certain function words or auxiliary words or numerals or quantifiers or non-real words or punctuation or spaces may even omit or disregard the presence or absence or difference or difference of adjectives or adverbs.

For example, if needed, if you follow the loose criteria, you can think: "The power of science is very powerful" and "the science is very powerful" is two identical adjacent segments.

The adjacent segment of the keyword in each text content is the same as or different from other texts. After the text is divided into the same or different categories, the queryer can directly refer to a certain adjacent word of the keyword. The interest of the segment, obtained or skips all texts of the category containing the adjacent segment by type.

The corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, or Have the same or different orchestration, or have the same or different display modes or positions in the interactive interface, or allow at least some of the subsets to have one or more adjacent segments or texts to be combined or sorted across the subset or in the interactive interface. Show. (See Example A below) For different texts belonging to one or some of the same first-level subset or higher subset, or different texts whose contents contain the same keyword and adjacent segments, may be based on the same keywords and The other adjacent segments of the adjacent segment are the same or different, and some or all of the text is divided into the same or different lower or multi-level subsets of the above subset or the corresponding identical or different processing is performed.

The corresponding identical or different processes described herein may also include: the respective texts having the same or different distribution locations or storage methods, or obtaining the same or different subset markers, or having their indices having the same or different markers or indexes Items, either having the same or different arrangement, or having the same or different display modes or positions in the interactive interface, or allowing at least some of the subsets to have one or more adjacent segments or texts to be combined or sorted across subsets or Displayed in the interactive interface. (See the contents of Example A and Figure 10 below)

This is actually to further subdivide the original subset of the same adjacent segment into a number of lower-level subsets by expanding the adjacent words of the original keywords and comparing the same or not. If necessary, continue Go on until you get the results that the inquirer is satisfied with. This is another advantage of this method.

For example, we divide a plurality of texts whose keywords are "search engines" according to a subset of adjacent words, and the first adjacent segment is intercepted in the manner of "1 word before word + 1 word after word", which results in more a subset of which contains a contiguous segment of the "professional K company" subset (where K stands for the keyword "search engine"), which contains 185 texts; if we press the 185 texts after "K company" The second adjacent segments formed by the three words are divided into the same, and the second adjacent segment is obtained as 13 secondary subsets such as "through professional technology" and "trying to open up the market"; The text contained in the secondary subset of the technical segment continues to be divided by the third adjacency segment (for the latter 2 word segments), and several third-level subsets are also available. (Refer to Figure 2 and Figure 8).

Utilizing the processing method of the present invention, for example, may also allow for the merging or separation of successive adjacent segments to reduce or increase the subset hierarchy. For example, if the above keyword is a plurality of texts of "search engine", if we first intercept the first adjacent segment in the manner of "1 word before word + after 4 words", the number of first-level subsets obtained should be Equal to the sum of the number of sub-level subsets divided by the previous method, the results are similar, but the subset level is reduced.

In fact, in the face of the same large amount of text, if the length of the adjacent segment of the keyword is set longer, the number of subsets obtained will be more, but the number of texts in each subset will be less; conversely, if The length of the adjacent segment of the keyword is set to be shorter, and the number of subsets obtained will be less, but the number of texts for each subset will be larger.

Faced with a large number of subsets of the above methods or their subordinate subsets, for ease of reference, we can arrange a different contiguous segment or indirect contiguous segment that reflects the same keyword of the text or a statement containing the segments or A one or more hierarchical directory or tree directory or sequence of side-by-side or sequential relationships of example sentences, wherein the same or adjacent segments of the one or more different subsets of the text may be included An indirect adjacency segment or a statement or example sentence or summary instance containing the segment, or an identical adjacent segment or indirect adjacency segment including each of the lower or lower subsets of the subset or subsets of the subset or subsets A statement or example sentence or a summary instance of a segment is arranged or distributed or stored or displayed in a side-by-side or subordinate relationship; wherein the segment or statement or example sentence or summary instance may be juxtaposed across the subset.

The directories shown in Figures 2 and 8 are two examples of such directories. Figure 2 reflects the first or multi-level directory or tree directory of the adjacent words of the plurality of texts of the "search engine" or the parallel or sequential relationship of the adjacent segments of the segments. (The example shown in Figure 8 will be explained later)

In the tree directory example of Figure 2, the keyword is "search engine", represented by the symbol "K", where the first adjacent segment is intercepted in the manner of "1 word before word + word after 1 word", the second adjacency The word segment is the 3 word segment behind it, and the third adjacent segment is the 2 word segment behind it.

If we feel that it is difficult to understand the core content when reading a table of contents with different adjacent segments, we would like to see more content containing each adjacent segment. Thus, we may need to orchestrate a sequence of one or more levels of a directory or tree of different adjacent segments that reflect the same keyword of the text or the collocated or sequential relationship of the segments. Any adjacent segment in the original directory or tree directory can be appended or replaced with more content containing the adjacent segment.

For example, the content may be a sentence or example sentence or a summary instance or a bibliographic or representative text containing the contiguous segment. The keywords and adjacent segments in the sentence or example sentence or summary text or representative text may have fonts or colors or features different from other content; wherein the word segment or statement or example sentence or summary instance or Representative text can be juxtaposed across subsets.

Of course, we can also arrange or obtain a plurality of charts of one or more levels of classification tree or example sentence sequences containing multiple words of any keyword and any of its adjacent segments or groups of adjacent segments, each of which More than one chart The text contains the same keyword and the same contiguous segment or contiguous segment group, which is based on the contiguous segment of the keyword or the next contiguous segment of the contiguous segment group in the respective texts. No, or further step by step according to whether the adjacent segments of the following levels are the same or not.

It can also be said that, when needed, we can get a classification tree or example sentence sequence for any subset of the given keyword-related text and its subordinate subsets.

In fact, we can allow each subset or subordinate subset to represent the subset by the corresponding contiguous segment or the statement or example sentence or summary instance containing the segment, so that on a limited interaction interface, more The representative content of the subset forms a directory or sequence, and the queryer can select the subset of interest and the included text by clicking on the representative content. For example, if we click on the "Google Company" section after "Different K Technology 672" in the directory shown in Figure 2, we can get the phrase "Different K Technology by Yahoo! Google Company". A subset of all text directories or related content. The same result can be obtained if we click on an adjacent segment in the relevant content of the relevant sequence or directory.

In fact, the present technique allows a queryer to indicate text or graphics or symbols in a directory or sequence or other content on an interactive interface, such as clicking a cursor, determining or expanding or linking related content.

We can also make the method more convenient, for example, it can be arranged: for the keyword adjacent column or indirect adjacent segment in the directory or tree directory or sequence, if there is only one adjacent segment in the next level or the next level Alternatively, the segment may be distributed or stored or displayed in its original location along with its next or next level of adjacent segments.

For the convenience of the query, for example, we can also arrange the present technology to allow it to be in the text or directory or sentence or example sentence or summary example described above or in the vicinity of the keywords or adjacent segments or indirect adjacent segments they contain. A hint of the number of corresponding parallel subsets or subordinate subsets or the number of juxtaposed subsets of the subset of related words or segments or the number of subordinate subsets or texts contained. (Figure 2)

It is also possible to further utilize the processing method of the present invention, for example, to arrange a sequence of a plurality of text or text partial contents containing the same keyword, and the adjacent words of the keyword containing the plurality of words are different from each other or substantially mutually Not the same. It can be considered that only one or more of the text or text portion contents containing the same adjacent sentence segment are represented in the sequence.

The text portion content may refer to a digest or index or a bibliography or an example sentence or a phrase or the like containing the same keyword. It can also be said that the keyword adjacent segments of the plurality of words (two or more words) contained in the contents of the text or the text portion of the representative sequence are different from each other or substantially different from each other. Multiple words can generally better reflect the inclusion of the core content of the keyword.

In this way, the towel method can condense millions of online information with keywords into a representative information sequence with the same number of keywords reduced by 2 and 3 orders of magnitude, and the core content of each message (a few near the keyword) The content of the adjacent words) is neither repeated nor missing. This is also a very effective way to refine the core content of a web page, and there has been a significant improvement over the way that only technology can only eliminate mirrored web pages.

If we still feel too much content for the catalogue of different adjacent segments of the same keyword obtained by the chanting and the so-called representative information sequence, we can, for example, similarly different adjacent segments of the same keyword of the plurality of texts. In comparison, a plurality of different adjacent segments that meet certain similar requirements are divided into the same similar subset, or a plurality of different adjacent segments that do not meet certain similar requirements are divided into different similar subsets, or may not meet each other. A plurality of different adjacent segments of similar requirements are grouped into sequences or directories that are not similar to each other, and the common content of each element of the same similar subset may be used as the name or tag of the similar subset, or may be included A similar subset name sequence or directory.

There are many similar comparison methods or similar requirements, which can be specified when needed:

The certain similarity requirements include at least the requirement for the number or proportion of the same word or word or phrase or character contained in different adjacent segments.

For example, if a sequence or directory of different contiguous segments of 4 words in length (with or without imaginary words) is involved, at least 4 or 3 words between different adjacent segments may be required to each other. The same (but the word order is not necessarily the same), as a similar requirement. Similar requirements can be preset by the system or by the querier. In this example, the four or three co-occurring words can be separated by punctuation as the title or a directory of the corresponding similar subset.

Figure 3 constitutes another example of similarity subsets, which is formed by similar comparisons based on the sequence of different adjacent words (taken by the following three words). The keyword similarity requirements of different adjacent segments are preset by the system as "must have the same 3 words, and the order of each other is not limited." The keyword of each text in this example is "search engine", which is represented by the symbol "K", in which the names of similar subsets composed of different adjacent segments of the keyword are represented by their common words, respectively, in uppercase letters. representative. The different adjacent segments of the first similar subset shown in Figure 3 contain X, Y, Z these three words. The adjacent segments of the same similar subset may constitute different first adjacent segments due to the different order of the shared words, and constitute the next subset of the similar subset.

If necessary, a subset containing a certain adjacent segment or an indirectly adjacent segment may be classified as a lower subset of the above-mentioned similar subset in which the adjacent segment is located, or a text containing a certain adjacent segment or a portion thereof The content is classified into the above-mentioned similar subset in which the adjacent segment is located.

Obviously, the similar subset can be regarded as being compiled on the basis of the original directory or sequence of adjacent segments, so the number of similar subsets or the length of the catalog is significantly smaller than the original catalog or the original sequence. It is more convenient to see the main components of the related adjacent segments (several parallel independent words) by the names of similar subsets in the directory. If you are interested, you can open the similar subsets and get the subordinate subsets to which they belong. Different keyword segments and related information.

The technique can adopt a more efficient method: specifying that the number of words of the adjacent segment of the keyword is one of 2 to 10, for example, six, so that after processing, different adjacent segments can be obtained or The directory, if needed (such as too much content, more than a few hundred) can be further similarly compared to get a different similar subset of the directory or sequence (such as reduced to dozens). This is very convenient for the query.

The method may also use the respective respective directories or sequences or their charts as separate texts, so that the independent texts respectively have a title or an inscription containing the corresponding keywords, or respectively have corresponding keywords and corresponding The title or title of the same adjacent segment or adjacent segment group or similar subset name. In this way, even if these independent texts adopt a general webpage or text storage method, the search engine system is relatively easy to place the corresponding title or title according to each independent text during data collection or when generating an index or keyword query. Keywords or key statements (ie, keywords along with their adjacent segments) are among the search results of the query.

If desired, it is also possible to have prompt characters or combinations thereof in or near the title or title of the individual text of the catalog or sequence or their chart.

In the prompt characters or combinations thereof, there may be unique prompt characters or combinations thereof specified by the provider of the catalogue chart text, or may be accompanied by prompting the chart or text to form or publish time or order. Or prompt the number of characters in the directory shown in the chart or a combination thereof.

For example, you can add a "directory" or ", example sentence" after the corresponding keyword (such as "Bulin") in the title or along with the adjacent segment (such as "Bulin line indicator") or the adjacent segment group. ",, M" or "+ +L" or "==directory" or "==directory 09" (the title can now be "Bulin == directory 09" or "Bulin line indicator == directory 09") Or other agreed-upon characters that are easily identifiable by search engines, and so on. This is more convenient and accurate for the search of the querier or search engine.

We can also simultaneously have the keyword search result sorting means of the relevant search engine or retrieval system have the ability to recognize the prompt characters or combinations thereof in the title or the title of the independent text of the directory or sequence that have their characteristics. For example, if necessary, the device is arbitrarily placed in the keyword search result with the text of the "== directory" in the title or the title, and is placed in front of the other search results of the keyword. Thus, when needed, for example, the queryer enters the prompt character "==directory" (all inputs are "search engine == directory") when inputting keywords (such as "search engine"), or enters keywords instead. Then, after clicking the corresponding selection key (such as "directory" key) of the interactive interface with the cursor or the like, the information about the chart of the classification directory of the plurality of texts containing the "search engine" is ranked in other contents. "Search engine" in front of the search results.

It is worth noting that, in this way, regardless of whether the relevant directory or sequence chart is from the relevant search engine database or another computer data processing system or the Internet, the related search engine based on the prior art does not even need to make major changes, It is very convenient to find or provide the relevant directory or sequence chart to the querier preferentially when searching for keywords.

The processing method may further distribute keyword index data (such as keyword inversion table) of some or all of the provided catalogue text or its title or title to a dedicated index database or index of the search engine or the retrieval system. In the corresponding specialized area of the database. In this way, when the search engine system needs to provide the various directories and sequences to the querier, the keyword search can be directly performed in the special database or the special area, which can significantly improve the search speed and avoid the search results. Information that is not related to the directory or sequence appears.

The ordering of the contents of the above various directories or sequences obtained by the method of the present invention may sometimes be randomly distributed, or may be performed using a well-known sorting technique, or a parallel subset or a parallel adjacent segment may be used as needed. Or the specific ordering position of one of the indirect contiguous segments or side-by-side text or side-by-side sentences or example sentences or summary instances or representative sequence information, depending in part or in whole on one or more of the following factors:

The size or point of the text or the page link value of the segment or statement or example sentence or summary text or text The rate of hitting or the rate of occurrence of keywords,

Or the number of subordinate subsets or the number of subordinate texts of the subset or the click rate of the subset or the average value of the text page link value of the subset,

Or the number of subordinate subsets or the number of subordinate texts of the segment or text or statement or example sentence or summary instance or information, or the number of sub-sets of the subset or the average value of the textual link value of the subset the size of,

Or the size of the Page link value of the subset with the highest link text value or another text instance, or the highest click rate of the subset or the highest keyword occurrence rate or the click rate of another text instance Or the level of keyword occurrence,

Or the ordering of related texts in related texts or related subsets in other search sites or search system search results,

Or the relevant payment or bidding level of the relevant text or the relevant fund segment,

Or the spelling of the words of the related adjacent segments or the alphabetical order or strokes of the pinyin,

Or the source of the text or the rating of the unit or person,

Or the time of the relevant texts is either new or old.

Or whether it belongs to the same subset of a certain level.

When needed, the specific sort position can be determined by an objective function value, which depends on one or more variables, and some or all of the variables of the objective function can respectively represent one or more of the factors listed above.

For example, an objective function value can be expressed as F (x _l x ₂ ·'·χ _η )

For example, F ( _Xl , x _{2 -} x _n ) = F (xi) + F (x ₂ ) + + F (x _n ) ; where , x ₂ ^^. One or more factors (variables) or other factors that determine the specific ranking position as mentioned in the previous section of the invention. Since there are many specific treatment methods in the prior art (for example, US Pat. No. 6,285,999), it will not be described in detail here.

The processing method of the present invention may also allow for the addition or subtraction of additional keywords that are or may not be available, or increase or decrease time or geographic or linguistic or other types or ranges or requirements, if desired. Limits, get further refined results or more broad results.

For example, the present invention allows for comparison of contiguous segments of more stringent requirements (eg, not ignoring differences in function words) for the content of subsets obtained by comparing adjacent segments under loose requirements (eg, ignoring differences in lexical words in adjacent segments) , and divide the next level of the subset or get a more detailed list of adjacent segments or corresponding information; or reverse the operation.

Add or subtract a keyword (such as "China"), or change the time (such as within six months or within two years) or region (Hebei or Baoding or North China) or language (such as English or Spanish) or other types (such as items) The limits of the scope or range (such as boys or children or people) can be easily narrowed and expanded.

Still another aspect of the present invention is another computer data system including a storage device, wherein the data structure of the memory device or a portion of the data index contained in the data portion thereof includes at least:

Keyword segment

One or more adjacent segments are composed of a predetermined number of adjacent segments adjacent to each other in the corresponding text content or the keywords in the text summary, in order: adjacency segment 1, adjacent segment 2 , ... the adjacent segment N; the corresponding text ID segment, or the ID segment of its associated information, (where the ID segment refers to the address segment);

If necessary, a summary segment or a title segment of the keyword contained in the corresponding text may be included.

In general, keyword indexing is established to facilitate search or retrieval systems for keyword retrieval. In order to facilitate the retrieval of multiple keyword searches, the same text often has multiple indexes of different keywords. As an example of the present invention, the index data structure of a text for the keyword "Yangtze River" is as follows:

Key words adjacent segment 1 adjacent segment 2 ... adjacent segment N text address title content

Yangtze River 3⁄4 ΈΪ ...... Benefits XSTSS96 —

For such a data structure, whether the search engine searches for "Yangtze River" or searches for the long search term "Yangtze River Basin" or the longer "Yangtze River Basin Hydraulics", it is very convenient to access the index, and then find the address by address. The text facilitates the specific implementation of the invention. That is, the system may allow the computer data system to search or transform the corresponding number according to the combination of the keyword segment and one or more of the adjacent segments or the number of combined segments included in the search specification. Index or content.

For example, if each adjacent segment in the index is the length of a word, determine the keywords and each of the queries. After the adjacent words, the computer can easily get the index of the keyword segment and the adjacent segment content that meet the query requirements.

The above address can be a database text address, or a web page address or other address.

The computer data system can also be a search engine system. (See Example B below)

The invention may comprise a computer readable medium comprising a data structure having the above keyword index.

The present invention may also be a computer data system including a storage device, which may arrange for some or all of the keyword index or text summary or text data contained in the data portion or the data portion thereof to be distributed in the following manner: A summary or text containing the same keyword and the same or different index or text summary or text of the adjacent words, located in the same or different subset of the same keyword set.

When needed, it is allowed to be in the same subset, and its text summary or text contains the same extended key statement (extension key statement is the keyword together with one or more levels of adjacent segments) and the adjacent segments of the statement are the same or different index data , the same or different lower-level or multi-level subset distribution areas in the same subset.

For example, individual indexes of individual text or textual parts of the same keyword (eg, abstract or bibliographic or statement or paragraph, etc.) may be included, and if necessary, a table of contents of various contiguous segments of the keyword may be included (or Index of the subset directory table or the multi-level contiguous segment tree table (or multi-level subset table) or the corresponding example sequence sequence table or summary instance sequence table, all or part of which is concentrated or continuously arranged in the key The centralized storage area corresponding to the word. (For example, the following embodiment A)

The data structure of each index described herein constitutes at least an address segment of the indexed storage object (such as text, directory table, sequence table, etc.).

When the keyword is queried in this way, the relevant index can be conveniently or continuously accessed, the address or number of the indexed address segment (ID segment) is obtained, and the relevant directory or text or other content is accessed or extracted.

Similarly, each index of each text or text portion content having the same keyword may be further included, and may include a directory table or a tree table table of various next-level or multi-level adjacent segments as needed. The index of the corresponding example sentence sequence table or summary instance sequence table is respectively distributed in whole or in part or continuously arranged in a centralized storage area corresponding to different adjacent word segments of the keyword.

The computer data system may be a search engine system, which may more conveniently query or process or provide the user with the same subset of the queryed keywords and adjacent segments and data of the lower or more subsets. .

The present invention may include data of a keyword index or text digest or text distributed in a manner that is readable in the manner described above.

The specific flow of the method of the present invention implemented by a computer system can be illustrated by several examples (including embodiments A, B, C, etc.) of Figs. 11 and 7, 9, and 10. In the example of FIG. 11, the related computer processing device starts work 61, receives the keyword query 62 submitted by the querier, obtains a large amount of text containing the keyword, and determines the contiguousness of the keyword according to the preset or the querier designation. The number or range of the word segments (for example, 5 real words) 63, the respective adjacent segments within the range from different texts are compared and classified 64, and the respective subsets 65 whose adjacent segments are respectively identical are divided. On this basis, the obtained subset can be subdivided 66, for example, according to the same or not of the next-level adjacent segment, the next-level subset is divided, or the stricter required adjacent segment is compared, and the division is performed. First-level subsets; can also arrange representative sequences or adjacent segments not identical sequences or arrange corresponding directories 67, (including the number of texts labeled with the corresponding subset and appropriate sorting 71), display 70 in the interface for the queryer to choose Action, expand related subsets or content or display related text 72. If there are too many entries in these sequences or directories, similar comparisons can be made to the keyword query term contiguous segments of these entries, in which a similar subset 68 or a sequence or directory 69 of different similar content is arranged, which is more convenient. When the queryer finds the content of interest, click operation 70 to expand the related subset or more detailed content 72. (The examples in Fig. 7, Fig. 9, and Fig. 10 will be described later)

Embodiment A shown in Fig. 4 is an example of a computer data system capable of executing the computer electronic text processing method of the present invention. An Internet search engine system capable of providing extended key sentence search. It comprises: a search engine 8 arranged on a server 5 with a memory 6 and a processor, the search engine 8 being connected to a client 3 with an interactive interface 2 via a communication network 4 of the Internet; the search engine 8 has a database 9. The querier 11 and the keyword expansion unit 10 or module are connected to the data collector 12 and the index constructor 13;

The data collector 12 collects and adds text to the text library of the database 9 from the Internet or other information sources, and the index constructor 13 analyzes the text of the text library to obtain a text index and provides the keyword index library of the database 9;

Each index obtained by the index constructor 13 according to the analysis of the text includes a keyword segment, six single adjacent segments, an ID segment of the corresponding text, a text title segment, and a text summary segment, so that the search engine can Find the desired text index according to the required keyword segment, or with the required one or more single adjacent segments, to get the text The title segment or the text summary segment or the ID segment of the corresponding text, which can be easily linked to the original text when needed. The index of the keyword index library is distributed according to the similarities and differences of adjacent word segments at various levels, so as to facilitate retrieval or extraction. The corresponding adjacent segment directory, the adjacent segment tree directory (Figure 8), and the key statement directory are also pre-stored.

The client application browser (Microsoft Internet Explorer) on client 3 of embodiment A allows user 1 to retrieve HTML documents (including web forms) from server 5 over communication network 4. The interactive interface (UI) 2 on client 3 allows user 1 to interact with the retrieved web form using a monitor, keyboard or mouse, submit search requests, make selections and receive search results.

An important issue in the search method of the present invention is the way in which the adjacent segments are selected (or the way in which the keywords are joined to the adjacent segments), i.e., the manner in which the key statements are generated. An exemplary key statement of the embodiment A shown in Fig. 5 is to incrementally add adjacent words (in this case, words) one by one along the keyword 21 in the text summary. Among them, 22 is a level 1 key statement, 23 is a level 2 key statement, 24 is a level 3 key statement, and 25 is a level 4 key statement.

Figure 6 shows the key statement generation method of another embodiment B. The first level adjacent word segment is located in front of the keyword 21, and the second level adjacent word segment and other adjacent word segments are located behind the keyword. Among them, 22 is a level 1 key statement, 23 is a level 2 key statement, 24 is a level 3 key statement, and 25 is a level 4 key statement. This type of integration before and after seems to be more suitable for searching Western documents. The length of the adjacent segments of the key statements at all levels (number of words) can also be pre-specified or arranged by the queryer at the time of the search or by default.

In other extreme embodiments, it is also possible to allow the extension of the adjacent words from the keyword to the front to form key statements at all levels.

For the extension of the keyword search that allows multiple words to be separated, one should be selected as the core keyword, and the key words of each level are formed by combining its adjacent words. These key sentences all have the remaining keywords that can be separated. . It is also possible to sequentially add adjacent segments in the desired order near each word or segment of a keyword of a plurality of words to form key statements at all levels.

The system of the ^H example A can calculate the number of words in the adjacent segment and compare the similarity of the adjacent segments, and can choose not to count the virtual words, quantifiers, punctuation, spaces, etc., and merge them into the adjacent real words. This example can have specific regulations for Western languages. In other embodiments, adjectives or adverbs may not be counted, if desired, even when calculating the number of words in a contiguous segment and comparing the similarity of adjacent segments.

In Embodiment A, the query request of the user 1 is authenticated by the querier 11, and a query is made in the database 9 according to the proposed keyword request and the related data result list is queried for provision to the interaction. The keyword expansion component 10 is supplemented by the querier 11. When necessary, the key statement corresponding to the keyword, the corresponding example sentence, and the adjacent segment tree structure directory (see FIG. 8) are temporarily stored or processed to meet the requirements. Then, the search or display needs; if the content is not arranged in the database 9 or the keyword expansion unit 10, the keyword expansion unit 10 will establish it based on the keyword query data of the querier 11.

In fact, it is very easy to achieve the above purpose, and various methods can be utilized. For example, whether after the event or beforehand, for a possible keyword or an actual proposed keyword, whether it is the keyword expansion component 10 of the embodiment A or a computer or other search system, an index or a file sequence containing the keyword may be used. Find a (for example, the first) index or file view keyword and adjacent words or phrases that are adjacent segments (according to the predetermined length), store them as the first key statement; then find the second index or Does the file look at the words or phrases adjacent to the keyword that are the same as the first one? If they are different, they are stored in order, and the same is discarded; then the third index or file is compared and compared with the first two... and so on, a set of key statements different from each other will be obtained; in the above comparison process, If, by the way, the indexes or files containing the same key sentences are respectively arranged into groups, each subset is formed. Otherwise, the index or file sequence is respectively retrieved by the querier 11 with each key sentence as a standard, and the corresponding index can be obtained. Each subset. If in each sequence of index or subset of files, a variety of Level 2 adjacency segments are searched for as described above, various Level 2 key phrases and corresponding lower level subsets are obtained... and so on. If one (for example, the first one) or several indexes or digests are selected as the example sentences in each of the obtained subsets, the desired directory and example sentence sequences are obtained, thereby completing the grouping operation.

This directly indicates that the method can handle the relevant texts in the same way, whether it is based on the possible keywords or the actual key words, in order to facilitate the query.

This is actually subdividing the original subset of the same adjacent segment into a number of lower-level subsets by comparing the adjacent words of the original keywords and comparing the same or not.

For the order in which the directory and the example sentence sequence are arranged, for example, it can be arranged according to the size of an objective function value. The objective function value is the value of the largest text of the objective function among the corresponding subsets of the corresponding entries, which is equal to the sum of the text link value of the text and the recent click rate. The example sentence can be extracted from the text with the largest subset objective function value. In other embodiments, the ordering of the sequence of information such as a table of contents or an example sentence or a bibliography or abstract may be determined based on the size of an objective function value F (Xi , Χ ₂ · ' · Χ _n ).

For text with additional ad content, the objective function value can be equal to the corresponding bid.

Since there are many specific processing methods for text sorting in the prior art, it will not be described in detail here.

The keyword index of the embodiment A can adopt a system distributed according to each subset, and does not occupy more storage space than other existing keyword indexing libraries, which is one of its outstanding advantages.

In another embodiment B, the keyword index library does not adopt a subset distribution, and since the index data structure includes a keyword item and several adjacent term items, the querier 11 is based on the keyword segment and one or more The key statements of the adjacent segment combination can directly search and display the indexes that should belong to the corresponding subset. In the embodiment B, only the tree directory of the adjacent words or key sentences needs to be arranged, and the original traditional keyword index database may not be changed.

The indexing system of the present invention can also employ a conventional keyword inverted table indexing system or method.

Of course, it is also possible to obtain a lower subset of the subsets more generally. For example, if necessary, a neighboring segment tree directory similar to that shown in FIG. 8 can be used to reflect the relationship between adjacent segments of different levels of keywords, and display on the screen will facilitate the user to understand the overall state of each subset or subset of levels. To take a better search strategy. This figure omits the corresponding number of texts for each subset. The key words are "Bulin", the adjacent segment 1, the adjacent segment 2, the adjacent segment 3, and the adjacent segment 4 are all composed of a single real word, which also represent the common adjacent segments respectively contained in the subsets of each level. .

Embodiment A may perform a selected operation, that is, allowing the system to determine a corresponding key statement according to a queryer's text or summary on a page of the interactive interface or a directory or a cursor indication of a selection bar, and corresponding to the key statement An extended key statement or an extended contiguous segment or index or text summary for grouping operations, or a corresponding index or textual summary of the text or text, or a removal operation to include the page or other pages The key statement's entry or index or text summary or text culling or moving position.

Fig. 10 is a partial screen diagram showing a cursor click and a display result (i.e., a selection operation for grouping) in the search process according to an embodiment of the present invention.

The search box 51 is for inputting keywords (in this case, "Bulin"), and 52 is the two options for clicking operations: 'click to expand' or 'click to remove', in this case, 'click to expand' is selected. The object for clicking here is the summary 55 shown in the summary bar 53 on the screen. When the inquirer reads, he is interested in the related content of the "Bulin Line Index", and the cursor 54 is aligned with the "mark" word, so that the "Bulin Line Index" between "Bulin" and "Standard" is As a new key statement, and in a grouping operation, several further extended adjacent segments or respective example sentences 56 are listed.

The search method of Embodiment A further includes ignoring the operation, that is, the finder can browse the page of the interactive interface 2 (such as page change) or correspondingly on the page when browsing the index or text summary or text sequence containing the original key sentence. Recording or analyzing data on "itemous clicks" or "ignoring clicks" made on items or content, key statements that have been ignored or not noticed in a certain reading time or space, and related indexes and abstracts in the following Perform the removal.

In the system of the embodiment A, after the user 1 submits the keyword request through the interaction interface 2, the querier 11 can query the database according to the requirement and provide the related data result list of the query to the interaction interface 2 _; If you wish to extend a keyword, keyword expansion member 10 will generate the appropriate key phrases, and provide the desired data extraction or by the interrogator 11 searches.

The workflow of the search engine 8 (including the querier 11 and the keyword expansion component 10) can be illustrated by Figure 9:

The system starts working according to module 41, and queries for keyword search request (42), and returns no (48); and then, according to module 43, is there a keyword expansion operation request? If not, then execution 44 provides an ordinary search result sequence display, and if so, execution 45 queries the user 1's request via the prompt box on the screen of the interactive interface 2; then performs the corresponding operation, providing the corresponding module according to module 46. The information continues to query the selection and needs of the user 1. After a few iterations, the corresponding search information is provided in accordance with the module 47, and the execution of the module 48 returns or ends 49 as intended by the user 1.

The operation flow of the user 1 corresponding to the search engine 8 in the interactive interface can be represented by a map: After the interactive interface 2 is started to start working (31), the keyword (32) is selected, and the normal browsing can be performed (34). You can also choose to expand the search (33); if you choose (33), that is, using the extended key sentence search technique, you need to select the appropriate operation mode by cursor click: For example, select the length of the first adjacent segment of the keyword (the included words) quantity). Its length is short, the number of corresponding key statements (subsets) is small, but the content of each subset is complex; its length is long, the corresponding key statement types (subsets) are more, and the core of each subset The content is more singular or concentrated.

Obviously, when the length of the key statement we choose reaches 5 to 6 words, the single sequence index or summary group sequence as mentioned above will be a "refined sequence" whose core content is basically non-repeating and there are not many omissions. , total document The amount may be reduced by several orders of magnitude.

When selecting a long key statement, the number of single indexes in the first level will be more. The system allows the use of click operations to appropriately reduce the number of contiguous words or contiguous segments of a key statement, which can greatly reduce the number of singularities or key sentences or abstracts or example sentences at the first level or level.

If you abandon the selection of the first adjacent segment length of the keyword and other types of options, the system will automatically perform group operations according to the original, for example, each adjacent segment of the word or the length of the double word, and present the result (35) ). At this time, the user 1 can select 37 to directly open the link text in the result, or select the appropriate extended key statement in the result presented on the screen according to 36 (see FIG. 10), and obtain further search results displayed by the module 38. (The next level of the subset directory and other content).

At this point, User 1 can still choose 40 to open the link text directly, or select 39 to continue selecting the key statement of an extension... and so on until it returns (301).

This step-by-step expansion of the key statement, which narrows the scope of the search step by step, will quickly and effectively lock the search target. In Embodiment A, of course, in other embodiments of the method of the present invention, it is possible to record or accumulate certain or some or all of the queriers of various keywords in a certain period of time. The number of clicks on the relevant content of the various key statements of the segment, or the corresponding statistical module is set when needed.

In the embodiment C, the above-mentioned key sentence search technology will be combined with the existing keyword search technology, and when respecting the index sorting inside the subset, or when selecting each example sentence in the grouping operation, pay attention to respect or maintain the relevant file. The sort or position in the search results of a technical search system. In other words, the technique of the present invention includes the principle of sorting technology search based on the above basic method and basic structure.

Or the use of methods. The embodiment B and the embodiment C are substantially the same as the embodiment A except for the points indicated.

The technical features given in the above embodiments are all indicative, and the various technical features of one embodiment can be used independently, and are not allowed to limit the scope of the invention.

Claims

Claim

1. A computer-implemented method for processing a plurality of electronic texts containing the same keyword, comprising:

Obtain multiple electronic texts containing the same keywords;

Specify the number of words in the adjacent segment or the way in which the adjacent segments are intercepted;

According to whether the adjacent segment of the keyword in each text content in some or all of the text is the same as or different from the other text, the text is divided into the same or different subset or category as the other text or the corresponding same or different processing is performed. The corresponding identical or different processing may include: the corresponding texts have the same or different distribution positions or storage manners, or obtain the same or different subset marks, or have their indexes have the same or different mark or index items, Or have the same or different arrangement, or have the same or different display modes or positions in the interactive interface, or at least some subsets have one or more adjacent segments or texts to be combined or sorted across subsets or in Interactive interface display;

The text may be an electronic file or web page or their abstract or index or title or title.

2. The processing method according to claim 1, comprising: for different texts belonging to one or some of the same first level subset or higher subset or contents thereof containing the same keyword and adjacent words, according to The same keyword and other adjacent segments of the adjacent segment are identical or different, and some or all of the text is divided into the same or different lower or multi-level subsets of the subset or corresponding Same or different treatment;

The processing method allows for the merging or separation of successive adjacent segments to reduce or increase the subset hierarchy.

3. The processing method according to claim 1, comprising:

Arranging a different contiguous segment or indirect contiguous segment of the same keyword that reflects the text or a directory or tree of one or more levels of the parallel or sequential relationship of the statement or example sentence or summary instance of the segment or a sequence, wherein the same contiguous segment or the same indirect contiguous segment of the one or more different subsets of the text or a statement or example sentence or summary instance containing the segment may be included, or include the one or more The same contiguous segment or indirect contiguous segment of each of the subset of the next or a subset of the subset, or a statement or example sentence or summary instance containing the segment, arranged or distributed or stored in a side-by-side or subordinate relationship or The phrase or statement or example sentence or summary instance described therein may be juxtaposed across a subset.

4. The processing method according to claim 1, comprising:

5. The processing method according to claim 1, comprising:

A sequence of a plurality of text or text partial contents containing the same keyword, which contain adjacent words composed of a plurality of words, which are different from each other, or substantially different from each other.

6. The processing method according to claim 1, comprising:

Similarly comparing different adjacent segments of the same keyword of the text, and dividing a plurality of different adjacent segments that meet certain similar requirements into the same similar subset, or multiple different adjacencies that do not meet certain similar requirements The word segments are divided into different similar subsets, or a plurality of different adjacent word segments that do not meet certain similar requirements are grouped into sequences or directories that are not similar to each other, and the common content of each element of the same similar subset can be As the name or tag of the similar subset, or include it in a similar subset name sequence or directory;

7. The method of processing according to claim 1 or 3 or 4 or 5 or 6, comprising:

The respective respective directories or sequences or their charts are respectively used as independent texts, and the independent texts respectively have a title or an inscription containing the corresponding keywords, or respectively have corresponding keywords and corresponding identical adjacent words. The title or title of the segment or adjacent segment group or similar subset name;

If necessary, the title or the title of the independent text of the catalog or sequence has a prompt character or a combination thereof that prompts its related features in or near the title.

8. The processing method according to claim 7, comprising:

Having the keyword search results of the relevant search engine or retrieval system sort the device to separate text of the directory or sequence The title character or the prompt character or combination thereof in the title or in the vicinity of the title has the ability to recognize, and if necessary, the independent text or part of the content having the corresponding character or a combination thereof in or near the title or the title may be arranged. The front of the search results.

9. The method of processing according to claim 1 or 3 or 4 or 5 or 6, comprising:

The keyword index data of some or all of the catalogue texts provided therein or their titles or titles is distributed in a corresponding specialized area of a search engine or a retrieval system-specific index database or index database.

10. The processing method according to claim 1, comprising:

In the processing method or directory, the specific sorting position of the parallel subset or the parallel adjacent segment or the indirect adjacent segment or the parallel text or the parallel sentence or the example sentence or the summary instance or the representative sequence information, part Or it depends entirely on one or more of the following factors:

The size or click rate of the text or the segment or sentence or the example or summary example or the text of the information, or the rate of occurrence of the keyword, or the keyword occurrence rate,

Or the number of subordinate subsets or the number of subordinate texts of the subset or text or sentence or instance or summary instance, or the click rate of the subset or the average value of the text of the subset. , or the size of the page with the highest value of the page link value or the text link value of the other text instance,

Or the click rate or keyword occurrence rate of the text with the highest click rate or the highest keyword occurrence rate of the subset or another text instance,

Or related texts in related texts or related subsets searched on other search sites or search systems

Sorting in the fruit,

Or the source of the text or the rating of the unit or person,

Or the time of the relevant texts is either new or old.

Or whether it belongs to the same subset of a certain level,

Or it can be determined by an objective function value, which depends on one or more variables, and the variable of the objective function partially or completely represents one or more of the factors listed above.

11. A computer data system comprising a storage device, wherein the data structure of the partial or total keyword index of the storage device or the data portion thereof comprises at least:

Keyword segment

One or more adjacent segments are composed of a predetermined number of adjacent segments adjacent to each other in the corresponding text content or in the text summary, in order: adjacency segment 1, adjacent segment 2 , ... the adjacent segment N; the corresponding text ID segment, or the ID segment of its associated information;

If necessary, a summary segment or a title segment containing the keyword of the corresponding text may be included.

12. A computer data system comprising a storage device, wherein the storage device or part or all of the keyword index or text summary or text data contained in the data portion thereof is distributed in the following manner:

If necessary, allow the same subset, and the text summary or text contains the same extended key statement and the adjacent segments of the statement are the same or different index data, located in the same subset of the same or different lower or more levels Set distribution area.

13. A search engine that provides a search result desired by a queryer, the search engine system responding to a keyword query request submitted by a queryer via an interactive interface, searching from a related information source or database of the system and providing the above The text or text summary or index or related information of the keyword requirement; the method of the search method is characterized in that: the system receives the keyword query request of the queryer via the interaction interface;

The number of words or characters contained in the adjacent segment or the manner in which the adjacent segment is cut off is pre-predicted by the above system Determined or queried by the querier or by default or selected, may also be determined according to the symbol or word or word near the end or end of the contiguous paragraph or its font or color or space, or by the inquirer Determined by the position and manner of the cursor indication on the selection bar of the interactive interface or the text summary or text or related content of a specific index; according to the above-mentioned adjacent segments or key sentences, the different items are different. Adjacent segments or different key statements

n 1⁄2 generates search results based on the obtained key statements, SP: retrieves or processes or arranges or organizes different indexes or text summaries or texts or titles containing the same or different key sentences for the queryer to interact with The interface is selected.

14. A search method according to claim 13 wherein said method can be performed by the system in advance or at the time of enquiry.

15. The search method according to claim 13, further comprising:

The number of words or characters included in the adjacent segment or the cutoff mode or specific content of the adjacent segment is predetermined by the above system or agreed by the inquirer or default or selected;

Generating search results based on the resulting extended key statements, SP: and retrieving or processing or arranging or organizing or separately storing different indexes or text summaries or texts or titles containing the same or different extended key statements as described, For the queryer to select through the interactive interface.

16. The search method according to claim 13, comprising a grouping operation:

That is, allowing a variety of different key statements or contiguous segments or index or text summaries or texts containing the same keywords, or various different extended key statements or contiguous segments or indexes or texts containing the same original key statements The abstracts or texts are grouped or displayed in a catalog or sequence, with only one or more of the key statements or indexes or text summaries or texts for each of the adjacent segments.

17. The search method according to claim 13, comprising:

Having the data of some or all of the keyword indexes or text summaries or texts distributed in different or identical subset regions or different or identical according to different or identical keywords or key sentences or extended key sentences Low-level subset area storage;

In order to directly extract or provide the corresponding key statement or keyword index or text summary or text data when the keyword is queried.

18. The search method according to claim 13, comprising:

Parsing the text obtained from the Internet or other information source by the text or abstract in the database or the data collection server attached to the search engine, and generating the text corresponding to at least the keyword segment and the adjacent segment and the text ID segment or related The index of the ID segment of the content, including a text summary or title if necessary, and stored;

When searching, the corresponding index or summary or text is retrieved and provided in the store according to the keyword segments and adjacent segments it contains.

19. The search method according to claim 13, comprising:

Arranging a tree-like directory reflecting the sequential or juxtaposed relationship between different levels of adjacent words in the text or text summary with the same keyword, or a key statement reflecting the different levels of expansion of the keyword The tree of the relationship; used for querying.

20. A search method according to claim 13 or 14 or 15, comprising the selected operation:

That is, the system is allowed to determine a corresponding key statement according to the queryer's text or text summary on the page of the interactive interface or the key statement or the adjacent word segment directory or the cursor indication in the selection column or the box, and A key statement corresponding to a variety of different extended key statements or extended contiguous segments or indexes or titles or text summaries or texts for grouping operations or catalog presentations, or indexing of corresponding indexes or bibliographies or text summaries or texts Or performing a removal operation according to the determined corresponding key statement, culling or moving the page or other plurality of pages containing the entry or index or the title or text summary or text of the key statement.

21. The search method according to claim 11, comprising ignoring the operation:

The queryer browses the index or text according to the operation of the page or the page on the interactive interface when the queryer browses the index or text summary or the title or text sequence containing the original keyword or contains the original key sentence. The immediate position of the sequence; if it can be determined, an index or text extract containing a certain key sentence in a certain range in front of the position If the text or the key statement itself has not been opened or linked for a certain number of times, or has not been clicked or followed or prompted to be retained, the removal operation is performed according to the key statement, and the page or other The page contains an entry or index or text summary or text culling or moving position of the key statement.

22. A search engine system for providing a desired search result in response to a request by a queryer via an interactive interface, comprising:

a server, the server being coupled to a client where the interaction interface is located via a communication network or a line; a search engine located at the server, the search engine comprising: a database including a keyword index, and a querier, the querier The query can be performed in the database according to the keyword requirement of the queryer, and the related data result list obtained by the query can be provided to the interactive interface;

Characteristics,

^ ϊ 据 The database also includes a text summary or text containing keywords;

The querier or search engine includes a keyword expansion component that can perform an expansion operation on the keywords of the query: the above keywords appearing in the text content containing the keywords or in the text summary together with the adjacent words thereof Segments, as distinct extension key statements, and list or list different adjacent segments of the above keywords for the queryer to select via the interactive interface, or to have different indexes of the same or different extended key statements Or text summary or text for retrieval or processing or arrangement or organization for the queryer to select via the interactive interface;

The keyword expansion component can also perform one or more expansion operations on the key statement;

Or make the corresponding query or processing.

23. The search engine system of claim 22, wherein:

Included in a text or text summary with the same keyword or a contiguous segment tree between the different levels of adjacent words in the keyword, or a key statement reflecting the different levels of expansion of the keyword A tree directory of relationships.

24. The search engine system of claim 22, wherein:

The search engine system may further include a user graphical interaction interface, allowing the queryer to add additional query information, and the interface may include a dialog box or a selection box to receive the queryer's operation mode or mode, etc. The choice of aspects; its interface may contain key statements that can be clicked or adjacent words or sentences or paragraphs or action commands or selected text or symbols or graphics, allowing the queryer to add additional query information.

25. An adjacent segment tree directory having a text or text summary having the same keyword or a sequential relationship between adjacent segments of the keyword in the keyword or a key sentence reflecting a different level of expansion of the keyword A computer readable medium of a tree of related relationships.

26. A method of executing a computer system or a search system that can obtain different key statements from an index or a sequence of files containing any actually proposed or possible keywords, whether after or after the event, and will contain the same key The index or file of the statement is arranged in groups, or the desired directory and example sequence are obtained.