CN1920814A

CN1920814A - Extensible intelligent internet search system

Info

Publication number: CN1920814A
Application number: CN 200610026381
Authority: CN
Inventors: 邱致中; 沈超
Original assignee: SHANGHAI TAIKOR MEDIA CO Ltd
Current assignee: SHANGHAI TAIKOR MEDIA CO Ltd
Priority date: 2006-05-09
Filing date: 2006-05-09
Publication date: 2007-02-28

Abstract

The invention relates to an expandable intelligent internet index system, wherein it is formed by basic layer, function layer, logic layer and data document software module, to obtain the content of internet; clearing obtained content; analyzing cleared content to obtain the expression said content; based on the content and expression, extracting key word; based on content and key word, abreacting the content; based on the content and expression, classifying the content; based on the content and expression, grouping the content, into several groups in appointed number, or comparing the content with groups; providing the content to the sensor to be detected. The invention can support the index need of variable networks, to realize more functions via adjusting or replacing additional element.

Description

Extendible intelligent internet directory system

Technical field

The present invention relates to Internet resources are collected automatically and set up the mechanism of index, particularly a kind of extendible intelligent internet directory system.

Background technology

Along with the growth of internet information, people more and more pay attention to the index to internet information, and then could realize inquiring about efficiently and retrieving.Internet (Internet) has comprised world wide web (www) and LAN (Local Area Network) etc., and multiple kinds (as HTTP, FTP, file etc.) and polytype content (as the Web page, file, music, film etc.) are arranged.Common Web search engine (as Google, Baidu etc.) has been realized a kind of full text text index to the World Wide Web page content, and the inquiry service of this index is provided.The internet is carried out index needs following step usually:

1, obtain the content on the internet;

2, parse the text in the content;

3, the text in the content is set up full-text index.

To the inquiry of index then based on character match technology to full-text index.

Common search engine is finished the process of obtaining content with web crawlers software, and web crawlers (Crawler) is a kind of software of on network distributed resource being collected automatically, is mainly used in the following aspects:

For carrying out index to the web page resources on the WWW, search engine provides web page source;

Assist the specific user to collect the particular Web page collection;

The assistance people carry out the statistical study of carrying out to the internet present situation, or the like;

Development and development of technology along with society, people have higher requirement to internet retrieval, for example need a kind of can according to the search system of Search Results autopolymerization topic, a kind of system that can retrieve the various files that distribute in the intranet, a kind of can be with the search system of automatic content classification etc.Yet present search engine and other software products all can not well satisfy these demands, and the internet directory system of therefore inventing a kind of intelligence is current needs.

Summary of the invention

The objective of the invention is to solve the deficiency that the retrieval of aforementioned internet exists, a kind of extendible intelligent internet directory system is provided.

This system is made of basal layer, functional layer, logical layer and data file, and wherein: basal layer is provided with storer, algorithm device and event capturing device; Functional layer has sensor, Xie Zheqi, washer and downloader; Logical layer has web crawlers device, searcher and index.

Storer is used to deposit all or part of of following information: obtain the keyword of the raw information of content, the content of cleaning, the expression formula of representing this content meaning, content, the summary of content, the classified information of content, the clustering information of content, record, additional information (update time, number of links etc.) and the index information of sensor.Storer can be based on any file system, data system or other mediums.

Algorithms library provides the realization of needed all algorithms of this method, comprising the algorithm of CNLU aspect, as sorter, cluster device, keyword extractor, summahzer etc.

Event monitor is responsible for monitoring and is write down all system events, mistake.Downloader selects suitable agreement to obtain content automatically.

Washer cleans content, comprises removing duplicate contents, remove the advertisement that may exist, removing useless content etc.

The content that the resolver analysis was cleaned, acquisition can be represented the expression formula of this content meaning, and this expression formula may be but be not limited only to: the text that parses; Characteristic set to figure, music or film extraction.

Sensor is the parts that certain content is made a response.

Index is the parts that content is carried out index.

Web crawlers is the program that Web content is collected.

Searcher is accepted the parts that query requests is returned Search Results.

According to basal layer, functional layer, logical layer software configuration, indexing means of the present invention is:

Do (1) the meaningful needs handle? if do not have, then finish;

(2) if meaningful will the processing then obtained content;

Does does (3) examining this look into content and upgrade? if do not upgrade, then calculate the next update time;

(4) if this content is upgraded, then clean this content;

(5) separate this content of folding;

(6) extract keyword;

(7) extract summary;

(8) classification automatically;

(9) automatic cluster;

(10) sensor;

(11) calculate the next update time;

(12) memory contents and additional information;

(13) interpolation or renewal index;

(14) wait for the fixed time;

(15) turn back to (1).

Wherein: obtaining the content and method that needs to handle is:

(1) obtains waiting to obtain the URI of content;

(2) divide folding URI, select suitable downloader earlier;

(3) select HTTP downloader, FTP downloader, file downloader or expansion downloader;

(4) download and preserve the whole of content or part;

(5) finish.

Wherein: the method for separating the folding content is:

(1) obtains content to be cleaned;

(2) select suitable branch folding device according to content type;

(3) select HTML to divide folding device, WORD to divide folding device, PDF to divide folding device or expansion to divide the folding device;

(4) according to a minute folding device, remove html tag respectively and obtain the Title content, remove the Word format information, obtain text, extraction PDF content of text or removing garbage and obtain the content expression formula;

(5) with content cutting word;

(6) finish.

Wherein: the method for extracting keyword is:

(1) obtains the word segmentation form of content;

(2) the word occurrence number is added up;

(3) remove occurrence number too high with low excessively word;

(4) according to vocabulary each word is given a mark;

(5) several score is the highest words are as this content keyword.

Wherein: the method for extracting synopsis is:

(1) obtains the word segmentation form of content;

(2) take out the sentence that comprises keyword;

(3) to all the word marking in each sentence;

(4) with the total points of all words in the sentence score as this sentence;

(5) all sentences are sorted from high to low according to score;

(6) first sentence is exported as summary;

Does (7) the summary number of words reach requirement? if then finish;

(8) if not, add next sentence to summary.

Wherein: the method for work of classification is automatically:

(1) obtains treating classified content;

(2) extract the characteristic of division of this content;

(3) feature of the existing classification of contrast finds all classification of coupling;

The classification of (4) output coupling;

(5) finish.

Wherein: the method for work of cluster device is:

(1) obtains the word segmentation form of content to be clustered;

(2) according to vocabulary with this content vector quantization;

(3) find with bunch center vector angle minimum and angle and surpass existing bunch of minimum value;

(4) is there angle to surpass existing bunch of minimum value?

(5) if add this bunch and upgrade the center of this bunch;

(6) if not, create new bunch, and with the vector of this content as this bunch center;

(7) finish.

Wherein: the method that sensor is detected is:

(1) obtains treating the content that sensor detects;

(2) content is sent to each sensor;

(3) keyword sensor 1, keyword sensor 2, similar content sensor or extension sensor;

(4) for the keyword sensor, if comprise the designated key speech then alarm;

(5) if comprise the designated key speech then alarm;

(6), be the then alarm of similar content for similar content transmission chamber receptor;

(7) for other reception room sensors, satisfy alert consitions and then give the alarm;

(8) gather alarm output;

(9) finish.

Wherein: the construction method to content foundation or renewal index is:

(1) obtains treating the word segmentation form of index content;

(2) Term that sets up this content tabulates;

(3) set up the mapping relations of this content and these Term;

(4) preservation or renewal Term and mapping relations

(5) finish.

Wherein: the method for search index is:

(1) obtains query requests to be retrieved;

(2) query requests is decomposed into Term;

(3) mapping relations according to Term and content find related content;

(4) export satisfactory content

(5) finish.

Generally inquire about qualified content by the index of setting up, this process comprises all or part of of following several steps:

1. obtain content on the internet according to a definite sequence;

2. clean the content obtain: comprise and remove duplicate contents, remove the advertisement that may exist, remove useless content etc.;

3. analyze the content of cleaning, acquisition can be represented the expression formula of this content meaning, and this expression formula may be but be not limited only to: the text that parses; Characteristic set to figure, music or film extraction;

4. according to content and expression formula content is carried out keyword extraction, take out the keyword of specifying number;

5. according to content and expression formula content is made a summary, obtain comparatively brief summary content;

6. according to content and expression formula content is classified,, represent that then this content belongs to this topic or theme if should classification represent a topic or theme;

7. according to content and expression formula content is carried out cluster, with properties collection be gathered into specify number bunch, or with content and bunch comparing of having formed, add certain bunch or form new bunch;

8. content is submitted to sensor;

9. content is estimated and is determined to check next time the time of whether upgrading;

Memory contents and additional information (comprise keyword, theme, bunch, sensor output etc.);

11. for content is set up index;

12. whether the scope of examination also content, index and the additional information of updated stored have taken place to change behind certain hour;

Extendability of the present invention is embodied in following several aspect:

Algorithm in the algorithms library can customize and change when operation;

Storer can customize and change when operation, to adapt to various storage demands, as file, database etc.;

The event capturing device can customize and change when operation, incident can be offered the parts of real-time monitoring or record the designated store parts;

Downloader can be expanded to adapt to more host-host protocol;

Resolver can be expanded to resolve the more content of multi-format;

Sensor can be expanded and change when operation, so that certain content is made a response;

Index can be expanded, to support more index stores mode, as file, database etc.;

Reptile can expand and change when operation, to support the more contents acquisition strategy;

Intellectuality of the present invention is embodied in following several aspect:

Before processing, content is cleaned, to obtain better effect;

Automatically extract content keyword, make that content can be according to the same keyword association;

Content is classified, with classify, theme or vertical search;

Content is carried out cluster, reducing the duplicate contents among the result, and related content is merged in the same clauses and subclauses;

In the index process, sensor can be made a response to related content immediately;

The strategy that obtains content is intelligent, according to circumstances can upgrade immediately, also can calculate according to the value of content and upgrade the interval time of checking.

Advantage of the present invention is:

1, versatility.This method and system is applicable to diverse network index demand, and can realize more function by adjusting or change parts.

2, that as above narrates is intelligent.

3, aforesaid extendability.

Description of drawings

Fig. 1 is a logical block block diagram of the present invention, has shown the general structure of system, and parts wherein are not limited to the realization of a certain particular technology or form.

Fig. 2 is that a kind of possible physics of this system is disposed structural drawing, has shown the deployment architecture of this system under distributed environment.

Fig. 3 is the general flow chart of indexing means, has shown the treatment step of this system.

Fig. 4 is a kind of process flow diagram of downloader, has shown a kind of treatment step of this system downloads content.

Fig. 5 is a kind of process flow diagram of analyzer, has shown a kind of treatment step of this systematic analysis content.

Fig. 6 is a kind of process flow diagram of keyword extractor, has shown that this system extracts a kind of treatment step of keyword.

Fig. 7 is a kind of process flow diagram of summahzer, has shown that this system extracts a kind of treatment step of synopsis.

Fig. 8 is a kind of process flow diagram of sorter, has shown a kind of treatment step that this system classifies automatically to content.

Fig. 9 is a kind of process flow diagram of cluster device, has shown that this system carries out a kind of treatment step of automatic cluster to content.

Figure 10 is a kind of process flow diagram of sensor, has shown that this system carries out a kind of treatment step of sensor to content.

Figure 11 is a kind of process flow diagram of setting up and upgrading index, has shown that this system sets up or upgrade a kind of treatment step of index to content.

Figure 12 is a kind of process flow diagram of search index, has shown a kind of treatment step when this system is inquired about index.

Concrete embodiment

With reference to flow process shown in Figure 3, in the present embodiment, keep a circulation after the system start-up, up to the end of text that does not need to handle.This flow process realizes based on system shown in Figure 1, specifies as follows:

As Fig. 1, system is made of 101 logical layers, 102 functional layers, three logical levels of 103 basal layers, and wherein: basal layer 103 is provided with storer 113, algorithm device 112 and event capturing device 111; Functional layer 102 has sensor 110, separates folding device 109, washer 108 and downloader 107; Logical layer 101 has web crawlers device 104, searcher 105 and index 106 to constitute.Wherein basal layer 103 provides the infrastructural support of system's operation, and functional layer 102 provides the low layer function of system's operation to realize, logical layer 101 provides the HLF high layer function of system to realize.These three levels only are used for better understanding the relation of each module, system action and structure are not exerted an influence.Data file has been included needed all data files of system's operation, and data file comprises three in the present embodiment: vocabulary, when extracting keyword and summary, need give a mark to word, and this vocabulary has been stored score value; Dictionary commonly used need remove the extra high everyday words of frequency when extracting keyword, these speech exist in the dictionary commonly used; The feature of each classification is stored in the characteristic of division storehouse.

See also accompanying drawing 3: after the startup, need to judge whether the content of processing, basis for estimation is the data that storer is stored, and may retrieve by searcher.

The content of Chu Liing then enters 301 and obtains content flow if needed, and system downloads the content of appointment by calling suitable downloader; As shown in Figure 4, present embodiment is judged host-host protocol according to the URI of content, and selects corresponding downloader, and specific downloader is downloaded all or part of of content with mode separately.As the content for " http://www.sina.com.cn ", system selects to judge selection HTTP downloader according to URI, and acquisition HTML character string is returned as content.

After getting access to content all or part of, whether the scope of examination was upgraded after last subsystem visit, if upgrade, then content is downloaded fully and the delivery of content to 302 that downloads to is cleaned content, otherwise turn to 309; 302 clean content flow cleans 301 contents that pass over by washer, and the result is passed to 303 parsing contents, wherein may call a plurality of washers, and use mixed strategy to clean various contents, as advertisement and format information; As the content for " http://www.sina.com.cn ", system will attempt removing all advertisements.303 resolve content parses significant expression formula by calling suitable resolver from content, this expression formula may be but be not limited only to: the text that parses; To the characteristic set that figure, music or film extract, different resolvers is used to handle the content of different-format; As shown in Figure 5, extracting after finishing is that word is further handled for subsequent flow process with the content cutting.As " http://www.sina.com.cn ", system will remove all html tags, and extract the standby title of the interior character string of Title label as content.

304 extract keywords, by calling keyword extractor in the algorithms library according to selecting suitable and word this relevance in the content, as keyword; As shown in Figure 6, keyword offers subsequent flow process in the lump as the additional information of content.

305 extract summary, extract the representative content of part by the algorithm that calls in the algorithms library from this content, as summary; As shown in Figure 7, summary offers subsequent flow process in the lump as the additional information of content.

306 classification are automatically classified to content by the algorithm that calls in the algorithms library, and it is associated with relevant classification, the more than classification of possibility; As shown in Figure 8, classification offers subsequent flow process in the lump as the additional information of content.

307 automatic clusters are integrated into related content in the middle of one bunch by the algorithm that calls in the algorithms library; As shown in Figure 9, bunch information offers subsequent flow process in the lump as the additional information of content.

308 sensor are submitted to all the sensors in the system with content, the relevant information that each sensor of making a response will notification event grabber 111 these contents; As shown in figure 10.

309 calculate the next update time, by calling the algorithm in the algorithms library, estimate the time that change next time according to the record that this content was upgraded in the past; 310 memory contentss and additional information store the additional information that produces in content itself and the said process into storer; 311 set up or the renewal index, and by the searcher inquiry, if this content not in index, is then added it into index, otherwise upgraded already present index, interpolation and renewal are finished by index; 312 wait for the fixed times, and web crawlers can be at the appointed time, and the next update time that is generally in 309 to be calculated obtains this content once more and checks whether it upgrades.

309 calculate the next update time; 310 memory contentss and additional information; 311 add or the renewal index; 312 wait for the fixed time.

Figure 2 shows that a kind of physics deployment diagram of this system, many apps servers can be arranged, and storage area can be for distributed, and according to different different servers or the server clusters of being divided into of canned data, also can be according to circumstances with a plurality of program subordinates on same station server.Storer divides work four clusters in the present embodiment, is respectively: 202 content servers, memory contents and additional information; 203 index servers, the index of memory contents; 204 mainframe memories, the main frame of all accessed mistakes in the storage networking; 205 web site stores devices, the website of all accessed mistakes in the storage networking, wherein the website refers to the set of content.The relevant information in the system is obtained in event capturing device associated working in 206 monitors and the system.

Claims

1, a kind of extendible intelligent internet directory system, it is characterized in that: this system is made of basal layer, functional layer, logical layer and data file, and wherein: basal layer is provided with storer, algorithm device and event monitor; Functional layer has sensor, Xie Zheqi, washer and downloader; Logical layer has web crawlers device, searcher and index to constitute.

2, by the indexing means of the described extendible intelligent internet directory system of claim 1, it is characterized in that: the step of this method is:

Do (1) the meaningful needs handle? if do not have, then finish;

(2) if meaningful will the processing then obtained content;

(4) if this content is upgraded, then clean this content;

(5) separate this content of folding;

(6) extract keyword;

(7) extract summary;

(8) classification automatically;

(9) automatic cluster;

(10) sensor;

(11) calculate the next update time;

(12) memory contents and additional information;

(13) interpolation or renewal index;

(14) wait for the fixed time;

(15) turn back to (1).

3, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the content step of obtaining the needs processing is:

(1) obtains waiting to obtain the URI of content;

(2) divide folding URI, select suitable downloader earlier;

(4) download and preserve the whole of content or part;

(5) finish.

4, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the step of separating the folding content is:

(1) obtains content to be cleaned;

(2) select suitable branch folding device according to content type;

(5) with content cutting word;

(6) finish.

5, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the step of extracting keyword is:

(1) obtains the word segmentation form of content;

(2) the word occurrence number is added up;

(3) remove occurrence number too high with low excessively word;

(4) according to vocabulary each word is given a mark;

(5) several score is the highest words are as this content keyword.

6, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the step of extracting synopsis is:

(1) obtains the word segmentation form of content;

(2) take out the sentence that comprises keyword;

(3) to all the word marking in each sentence;

(4) with the total points of all words in the sentence score as this sentence;

(5) all sentences are sorted from high to low according to score;

(6) first sentence is exported as summary;

Does (7) the summary number of words reach requirement? if then finish;

(8) if not, add next sentence to summary.

7, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the job step of classification is automatically:

(1) obtains treating classified content;

(2) extract the characteristic of division of this content;

The classification of (4) output coupling;

(5) finish.

8, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the job step of cluster device is:

(1) obtains the word segmentation form of content to be clustered;

(2) according to vocabulary with this content vector quantization;

(4) is there angle to surpass existing bunch of minimum value?

(5) if add this bunch and upgrade the center of this bunch;

(7) finish.

9, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the job step that sensor is detected is:

(1) obtains treating the content that sensor detects;

(2) content is sent to each sensor;

(4) for the keyword sensor, if comprise the designated key speech then alarm;

(5) if comprise the designated key speech then alarm;

(6), be the then alarm of similar content for similar sensing receiver;

(7), satisfy alert consitions and then give the alarm for other sensors;

(8) gather alarm output;

(9) finish.

10, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the job step to content foundation or renewal index is:

(1) obtains treating the word segmentation form of index content;

(2) Term that sets up this content tabulates;

(3) set up the mapping relations of this content and these Term;

(4) preservation or renewal Term and mapping relations

(5) finish.

11, by the indexing means of the described extendible intelligent internet directory system of claim 2, it is characterized in that: the job step of search index is:

(1) obtains query requests to be retrieved;

(2) query requests is decomposed into Term;

(3) mapping relations according to Term and content find related content;

(4) export satisfactory content

(5) finish.