CN107220367A - Internet data full-text search method - Google Patents

Internet data full-text search method Download PDF

Info

Publication number
CN107220367A
CN107220367A CN201710432267.1A CN201710432267A CN107220367A CN 107220367 A CN107220367 A CN 107220367A CN 201710432267 A CN201710432267 A CN 201710432267A CN 107220367 A CN107220367 A CN 107220367A
Authority
CN
China
Prior art keywords
data
affairs
unit
chinese
search
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710432267.1A
Other languages
Chinese (zh)
Inventor
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING BLTSFE INFORMATION TECHNOLOGY Co Ltd
Priority to CN201710432267.1A priority Critical patent/CN107220367A/en
Publication of CN107220367A publication Critical patent/CN107220367A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of internet data full-text search method, this method includes:Basic data is carried out by the acquisition module of search engine to internet sites to crawl and distributed storage;The analysis module of search engine is analyzed and processed the basic data of collection, is that keyword sets up index, is easy to user to search for.The present invention proposes a kind of internet data full-text search method, and high efficient data capture is carried out using transaction controlling strategy, and data mining is carried out for the coupled relation between multi dimensional object.

Description

Internet data full-text search method
Technical field
The present invention relates to data retrieval, more particularly to a kind of internet data full-text search method.
Background technology
With continuing to develop for Web technologies, network information resource is just increased in the way of geometry speed.How from internet Quick-searching goes out the useful data related to user in magnanimity information turns into current urgent problem.Search engine is exactly Grow up on the basis of information retrieval technique.Search engine helps the present invention preferably to express and store in real world Essential information, and by analyzing the connection information in search engine, having for hiding information can be excavated as a kind of Use instrument.The dependence limited search word of existing search engine merely goes to express user's request, there is this and expresses incomplete problem. Even same search term, the desired result of different users may be also different.Such as microblog system, if it is considered that The relation of microblogging and related interactive object, it can with it is abstract be a heterogeneous network, wherein containing microblogging, information, label And the node such as user.Concern and bean vermicelli relation is there is between microblogging and microblogging, existed between microblogging and information deliver and Forwarding relation, is an inclusion relation between microblogging and label, and holding relationship is there is between user and microblogging.It is existing to search Rope instrument does not consider that the complex environment of above-mentioned multi dimensional object formation carries out data mining.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of internet data full-text search side Method, including:
Basic data is carried out by the acquisition module of search engine to internet sites to crawl and distributed storage;
The analysis module of search engine is analyzed and processed the basic data of collection, is that keyword sets up index, is easy to User searches for.
Preferably, the acquisition module includes crawling site database, crawls website scheduling unit, transaction management control Device, affairs container, recording controller, basic database;
The transaction management controller is used for multiple establishments for crawling affairs, startup, operation control and destroyed;Affairs have certainly Oneself independent container, the management for transaction resource;The data exchange that the recording controller is used between program and database Processing;Including global data source cache unit, data scheduling unit, data access administrative unit;The overall situation crawls buffer unit use In handling transaction latency of many affairs when to critical resource access, each web crawlers only one of which overall situation crawls buffer unit Example;Data access administrative unit is used for database and the data interaction of program is handled;Data scheduling unit realizes that single affairs are climbed The scheduling taken, when single affairs, which are crawled, not to be crawled in buffer unit, by data scheduling unit from the overall situation crawls buffer unit Obtain some affairs that crawl and crawl buffer unit;Data scheduling unit only one of which example in whole program.
Preferably, the affairs container further comprises:
Save buffer unit is crawled, the affairs station data to be crawled is cached for setting up queue in internal memory;
Transactional cache unit, the data for caching affairs itself;
Memory buffers unit, caches the related data to be stored to database;
Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized;
Data cleansing extracting unit, is cleaned to the code of collection, extracts effective information, obtains web page quality grade The relevant information of evaluation and obtain and new in webpage crawl website;
Data storage analytic unit, will clean and extracts data and be converted to the form that easily stores, data are compressed, group Into pending database search character string.
Preferably, the font class occurred in title of the effective information including webpage, keyword, summary, text, webpage Type, media characteristic information.
Preferably, the analysis module keyword sets up index, further comprises:
Keyword extraction is carried out to text successively;For numeral or Chinese figure Chinese character, if continued presence, treat as One keyword is handled;For English alphabet, if run into the non-English letters such as space, divided;For in The sentence of literary Chinese character composition, then handled by following order:If 1. character is a continuous Chinese-character digital, by Chinese character Numeral is continuously put together;If 2. continuous three Chinese characters are all independent, phrase is not formed, by this three independent words Divided as a new phrase.If 3. 1., be 2. either way not present, use and drawn based on dictionary algorithm Point;For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special Character, each special character is a keyword.
The present invention compared with prior art, with advantages below:
The present invention proposes a kind of internet data full-text search method, and carrying out efficient data using transaction controlling strategy adopts Collection, data mining is carried out for the coupled relation between multi dimensional object.
Brief description of the drawings
Fig. 1 is the flow chart of internet data full-text search method according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of internet data full-text search method.Fig. 1 is according to embodiments of the present invention Internet data full-text search method flow chart.
The search engine of the present invention can be divided into acquisition module and analysis module.Acquisition module include crawl site database, Crawl website scheduling unit, transaction management controller, affairs container, recording controller, basic database.Transaction management controller For multiple establishments for crawling affairs, startup, operation control and destruction.Affairs have oneself independent container, for transaction resource Management, specifically include:Save buffer unit is crawled, the affairs website number to be crawled is cached for setting up queue in internal memory According to;Transactional cache unit, the data for caching affairs itself;Memory buffers unit, caches the correlation to be stored to database Data;Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized;Number According to cleaning extracting unit, the code of collection is cleaned, effective information is extracted.These effective informations include the mark of webpage Font type, the media characteristic information occurred in topic, keyword, summary, text, webpage.Obtain what web page quality grade was evaluated Relevant information and obtain and new in webpage crawl website;Data storage analytic unit, extracts data by cleaning and is converted to easily The form of storage, data are compressed, and constitute pending database search character string.Recording controller is used for program and data Data exchange processing between storehouse;Including global data source cache unit, data scheduling unit, data access administrative unit.Entirely Office, which crawls buffer unit, to be used to handle transaction latency of many affairs when to critical resource access, reduces many affairs to data storehouse Operational access number of times.Each web crawlers only one of which overall situation crawls buffer unit example.Data access administrative unit is used to count Handled according to the data interaction of storehouse and program.Data scheduling unit realizes the scheduling that single affairs are crawled, when single affairs crawl caching list When not crawled in member, by data scheduling unit from the overall situation crawl buffer unit in obtain and some crawl to affairs that to crawl caching single Member.Data scheduling unit only one of which example in whole program.
The web crawlers of search engine operationally, reading program configuration file first, and will when preloading caching collection The data used;Task manager.According to configuration information, each affairs is initialized, and control the operation of affairs;At affairs acquisition Reason task, first carries out crawling link duplicate removal inspection, analysis crawls the type of link, performs different grab types at different places Reason mode, in collection, analysis is new collection affairs or more new task, and after the webpage source code of link is got, it is right The webpage source code collected performs cleaning, filtering, according to info web correlated characteristic rule, extracts effective information;Affairs pair The information extracted carries out conversion process, is cached;When caching data to be saved reach certain amount, affairs perform caching Data loading processing;Task manager timing simultaneously monitors the execution state of each affairs, and management is controlled to abnormal transaction.
Before crawling, all possible combination domain name is traveled through according to domain name create-rule successively, combination domain name is carried out Detect successively, identification effective domain name and invalid domain name, set up rhizosphere name storehouse;Then the webpage source code of navigation website is obtained, according to Rhizosphere name composition rule extracts root site address and link text from webpage source code, updates rhizosphere name storehouse.
The global collection transaction scheduling unit of each web crawlers only one of which.When the save buffer that crawls of affairs is sky When, transactions requests or wait data scheduling unit crawl website from overall situation collection caching website or database transmission.Transaction scheduling Process is specially:Affairs obtain data scheduling unit control authority;Whether be empty, be such as sky if judging global collection caching website, Then a number of crawl is obtained from database and be cached to global collection caching source, if being not sky, then crawl website from the overall situation and obtain Take it is a number of crawl to Current transaction crawl buffer unit.
For single affairs, its collecting flowchart is specific as follows:
1. affairs obtain a non-NULL collection transaction object from transaction queues.If getting sky transaction object, hold Row transaction scheduling.
2. judge whether the depth for gathering affairs exceedes maximum depth;Affairs obtain it from current collection transaction object Sampling depth where Current transaction object.If sampling depth exceedes the website sampling depth of system configuration, Current transaction Collection terminates.As sampling depth not less than system configuration website sampling depth, then affairs continue step 3.
3. judge the type of collection affairs;If web retrieval affairs, then step 4 is performed, if not web retrieval Affairs, then perform step 5.
4. judge whether new web page or unfinished web page interlinkage;If this collection transactions access address is not in history Capture in storehouse, be then acquired i.e. step 7 by newfound webpage.If this collection affairs is in history crawl storehouse, from history The last time collection information of this web page address is obtained in crawl storehouse:Reference address, access time, page-size, renewal frequency, rhizosphere Name.Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, if do not surpassed Cross, then without collection, collection terminates;If it has been exceeded, then comparing the content of pages size and upper one of current web page address Secondary content of pages size, if equal, without collection, if unequal, continues step 6.
5. if media or file link, then perform corresponding collection document process;If illegal link, then record This is linked extremely.
6. obtaining this web page interlinkage page source code, the collection information of this web page address in history access database is updated, step is performed Rapid 8.
7. gathering new task webpage, the source code of this web page interlinkage page is obtained, with increasing this webpage in history access database The access record of location.
8. performing Web Cleanout to extract, the Web Cleanout extraction step is used to extract the feature letter specified from webpage source code Breath, removes the garbage or noise data in webpage source code, then goes out the information of needs from cleaned extracting data again.Enter One step, according to text similarity measurement algorithm, extracts title from webpage, keyword, descriptive text, mark defined in webpage Link, media resource in topic, text, webpage are used for Ordination.In cleaning, affairs capture program is obtained currently first The web page coding of affairs.Start washer, and initialize.Remove the pattern coding in web page coding and explain coding;Remove net Scripted code in page number, and recognize that current web page whether there is media file according to script code information simultaneously.If deposited Then preserved.Information after extraction is compressed to the form for being converted into being easy to storage and stored.
It is that each affairs dispose special duplicate removal container, each container has only stored oneself for page repeated links The mapping code of the chained address accessed.Duplicate removal container only needs to record the chain oneself accessed under same root domain name website Connect, discard processing is carried out to being not belonging to the web page address of this rhizosphere under one's name.When affairs start to gather another root site information, The history access record of duplicate removal container is emptied, new root site access record is recorded again.Information acquisition device is crawled into website Depth sets threshold value, during the operation of each affairs, the internal memory shared by actual duplicate removal container can by crawl website depth threshold come Control.
To reduce transaction latency, the present invention uses multilayer buffer structure, each layer is cached according to the memory size of computer Size is configured.The overall situation is crawled first and cached.In the access connection procedure to crawling database, using disposable Acquisition batch crawls result and cached.Secondly caching is crawled using single affairs itself.Each affairs each possess one Gathered data source cache region.Then the data of the generation to affairs in processing procedure are cached, and are included in link duplicate removal During inspection, webpage that cache access is crossed, media links address.Last layer of caching is caching data to be saved.When to be saved Data reach certain amount after, affairs just carry out storage preservation to data.
The analysis module of the search engine is used to gather at the basic data progress analysis of the text returned, media Reason, is that keyword sets up index, is easy to the search of searching system.It is right successively when analysis module carries out keyword extraction to text Text carries out keyword extraction.For numeral or Chinese figure Chinese character, if continued presence, the present invention is as a key Word is handled.For English alphabet, if run into the non-English letters such as space, divided.For Chinese character group Into sentence, then handled by following order:If 1. character is a continuous Chinese-character digital, and Chinese-character digital is continuous Put together.If 2. continuous three Chinese characters are all independent, phrase is not formed, the present invention makees this three independent words Divided for a new phrase.If 3. 1., be 2. either way not present, use and drawn based on dictionary algorithm Point.For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special Character, each special character is a keyword.
Keyword extraction step is carried out to text to further comprise:
The Chinese phrase of loading.Text to be divided is obtained from participle object;Obtain analysis position and character, discriminatory analysis Position whether be text to be divided end, if the rearmost position of text to be divided, then ready-portioned text adds Upper cutting symbol, along with the last character of text to be divided, group has been divided in new ready-portioned text, now text Into.Since analysis position, cutting symbol position is found.Behind the position for finding cutting symbol, interception analysis position to cutting accords with position Between character, cutting symbol is plus the text after dividing, composition ready-portioned text.
The storage of Chinese basis dictionary is using double-deck Hash list object storage.First Chinese character using word phrase as key, Using another Hash list object as the storage organization of key assignments;The Hash list object storage of key assignments is the phrase started with key The first Chinese character of removing after phrase remainder.The storage organization of the key assignments of first Chinese character be using second Chinese character of word as Key, using chain type array as key assignments.The first two Chinese character of word in this chain type storage of array dictionary is identical, the from word the 3rd Individual Chinese character starts different text sequences.
It is that foundation is divided to text by using dictionary, whether there is to find phrase in dictionary.If in word Exist in storehouse, then continue matching process, if not present in dictionary, then matching terminates, and is divided.
Further, after search obtains collections of web pages, constructed and be based on respectively according to search and webpage own content feature The similar diagram of feature, while search and webpage binary crelation figure is built based on the interest relation between search and webpage, given few The classification of unmarked search and webpage is predicted in the case of amount search and webpage category label.
First have to build a figure, node table sample notebook data, while between representing sample according to sample data and its contact Contact, the weight on side represents the tightness degree contacted between sample.The search for containing a variety of different objects structures is drawn G=(V, E) can be expressed as by holding up, and wherein V=Q ∪ D can be expressed as the set on different types of summit, and E is connection summit The set on side.Q is the set of search, and D is the set of webpage.E=EQQ∪EQD∪EDD, wherein EQQ=Q × Q, EQD=Q × D, EDD =D × D.Make GQ=(Q, EQQ), GQD=(Q, D, EQD), GD=(D, EDD), then G=GQQ∪GQD∪GDD.Wherein GQRepresent by searching for The subgraph that node is built, GDRepresent the subgraph built by web page joint, GQDRepresent by search node and web page joint according to interest The subgraph that relation is built.
The weight w on side is defined based on the following distance function between nodeij
wij=exp (- d (xi,xj)/2σ2)
It is two text vector x in text calculatingi, xjBetween included angle cosine;xiAnd xjSection in k neighbours each other Point;Wherein d (xi,xj) be distance function | | xi-xj| |, σ is regulation parameter.
Original graph structure is changed according to the discriminant information of advance marker samples:
1. construct GQAnd GDAnd GQD.Calculate GQThe average weight w on side between upper all nodesq, GDSide between upper all nodes Average weight wd
2:If search flag data is divided into c class, P is expressed asq={ Pq 1, Pq 2..., Pq c, wherein Pq iRepresent i-th The core set of the mark search of individual classification.Make Mq iThe paired constraint set of i-th of search category mark is represented, if x ∈ Pq iAnd y ∈ Pq i, then (x, y) is added into Mq i.If Web Page Tags data are divided into c class, P is expressed asd={ Pd 1, Pd 2..., Pd c, wherein Pd iRepresent the core set of the marking of web pages of i-th of classification.Make Md iThe paired constraint set of i-th of webpage category label is represented, if x ∈Pd iAnd y ∈ Pd i, then (x, y) is added into Md i
3:If searching for sample to (ql, qm)∈Mq i, qkFor qlAnd qmNeighbours, wlk<wqAnd wmk>wq, then by qkAdd Pq i, By (ql, qk) and (qm, qk) add Mq i
4:3 are repeated, until PqNo longer change.
5:If webpage sample is to (dl, dm)∈Md i, dkFor dlAnd dmNeighbours, wlk<wdAnd wmk>wd, then by dkAdd Pd i, By (dl, dk) and (dm, dk) add Md i
6:5 are repeated, until PdNo longer change.
7:If searching for sample to (ql, qm)∈Mq i, then G is changedQMake wlm=1, if ql∈Pq iThen change GQMake wlm=0.If Webpage sample is to (dl, dm)∈Md i, then G is changeddMake wlm=1, if dl∈Pd iThen change GdMake wlm=0.
8:If searching for ql∈gP:And webpage dmgPj, then and modification GQDMake wlm=1.
Wherein wlk, wmk, wlmRespectively in search ql、qm、qlRespectively in webpage dk, dk, dmIn weight.
Relation between search and webpage can be enriched by said process, make the contact between similar node more tight Gather, the contact between different classes of node is looser, so as to preferably be classified using procedure below:
1. for classification j ∈ { 1 ..., c }, and the node i ∈ { 1 ..., n } in above-mentioned subgraph, construction n × c's is first Beginningization mark matrix Y.
2. construct the adjacency matrix W in homogeneous network respectively according to the similarity measurement between isomorphism nodeQQ, WDD, according to Relation construction adjacency matrix W between heterogeneous nodesQDAnd corresponding transposed matrix WDQ
3. structural matrixWherein
4. F (0)=Y is taken, iterative calculation F (t+1)=μαSF(t)+(1-μα) Y, wherein μαFor the ginseng between (0,1) Number.
5. setting F* as { F (t) } limit of a sequence, then the node v in G is schemediAccording to yi=argmaxj<cFijCarry out contingency table Note.
The above-mentioned neighbor node that in an iterative process, label information is constantly broadcast to oneself by each node of figure is straight A stable state is reached to them.
In summary, the present invention proposes a kind of internet data full-text search method, is carried out using transaction controlling strategy High efficient data capture, data mining is carried out for the coupled relation between multi dimensional object.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims (5)

1. a kind of internet data full-text search method, it is characterised in that including:
Basic data is carried out by the acquisition module of search engine to internet sites to crawl and distributed storage;
The analysis module of search engine is analyzed and processed the basic data of collection, is that keyword sets up index, is easy to user Search.
2. according to the method described in claim 1, it is characterised in that the acquisition module includes crawling site database, crawling Website scheduling unit, transaction management controller, affairs container, recording controller, basic database;
The transaction management controller is used for multiple establishments for crawling affairs, startup, operation control and destroyed;Affairs have oneself only Vertical container, the management for transaction resource;The data exchange processing that the recording controller is used between program and database; Including global data source cache unit, data scheduling unit, data access administrative unit;The overall situation, which crawls buffer unit, to be used to handle Transaction latency of many affairs when to critical resource access, each web crawlers only one of which overall situation crawls buffer unit example; Data access administrative unit is used for database and the data interaction of program is handled;Data scheduling unit realizes the tune that single affairs are crawled Degree, when single affairs, which are crawled, not to be crawled in buffer unit, if by data scheduling unit from the overall situation crawl buffer unit in obtain The dry affairs that crawl crawl buffer unit;Data scheduling unit only one of which example in whole program.
3. method according to claim 2, it is characterised in that the affairs container further comprises:
Save buffer unit is crawled, the affairs station data to be crawled is cached for setting up queue in internal memory;
Transactional cache unit, the data for caching affairs itself;
Memory buffers unit, caches the related data to be stored to database;
Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized;
Data cleansing extracting unit, is cleaned to the code of collection, extracts effective information, is obtained web page quality grade and is evaluated Relevant information and obtain and new in webpage crawl website;
Data storage analytic unit, will clean and extracts data and be converted to the form that easily stores, data are compressed, and composition is treated The database search character string of execution.
4. method according to claim 3, it is characterised in that the effective information includes the title of webpage, keyword, plucked Font type, the media characteristic information, to occur in text, webpage.
5. according to the method described in claim 1, it is characterised in that the analysis module keyword sets up index, further bag Include:
Keyword extraction is carried out to text successively;For numeral or Chinese figure Chinese character, if continued presence, as one Keyword is handled;For English alphabet, if run into the non-English letters such as space, divided;For the Chinese Chinese The sentence of word composition, then handled by following order:If 1. character is a continuous Chinese-character digital, by Chinese-character digital Continuously put together;If 2. continuous three Chinese characters are all independent, do not form phrase, using this three independent words as One new phrase is divided.If 3. 1., be 2. either way not present, use and divided based on dictionary algorithm; For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special word Symbol, each special character is a keyword.
CN201710432267.1A 2017-06-09 2017-06-09 Internet data full-text search method Pending CN107220367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710432267.1A CN107220367A (en) 2017-06-09 2017-06-09 Internet data full-text search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710432267.1A CN107220367A (en) 2017-06-09 2017-06-09 Internet data full-text search method

Publications (1)

Publication Number Publication Date
CN107220367A true CN107220367A (en) 2017-09-29

Family

ID=59947568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710432267.1A Pending CN107220367A (en) 2017-06-09 2017-06-09 Internet data full-text search method

Country Status (1)

Country Link
CN (1) CN107220367A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190010A (en) * 2018-09-20 2019-01-11 河南智慧云大数据有限公司 Internet data acquisition system is carried out based on customized keyword acquisition mode

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281697A (en) * 2014-10-15 2015-01-14 安徽华贞信息科技有限公司 Semantic-based hadoop system
CN105468744A (en) * 2015-11-25 2016-04-06 浪潮软件集团有限公司 Big data platform for realizing tax public opinion analysis and full text retrieval

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104281697A (en) * 2014-10-15 2015-01-14 安徽华贞信息科技有限公司 Semantic-based hadoop system
CN105468744A (en) * 2015-11-25 2016-04-06 浪潮软件集团有限公司 Big data platform for realizing tax public opinion analysis and full text retrieval

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周庭安: ""分布式搜索引擎研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190010A (en) * 2018-09-20 2019-01-11 河南智慧云大数据有限公司 Internet data acquisition system is carried out based on customized keyword acquisition mode
CN109190010B (en) * 2018-09-20 2021-05-11 河南智慧云大数据有限公司 Internet data acquisition system based on user-defined keyword acquisition mode

Similar Documents

Publication Publication Date Title
KR101122942B1 (en) New word collection and system for use in word-breaking
CN102053991B (en) Method and system for multi-language document retrieval
CN110543595B (en) In-station searching system and method
CN112256939B (en) Text entity relation extraction method for chemical field
Fujimura et al. Topigraphy: visualization for large-scale tag clouds
CN107239558A (en) Common interconnection network collecting method
CN106815307A (en) Public Culture knowledge mapping platform and its use method
CN101694658A (en) Method for constructing webpage crawler based on repeated removal of news
JP2009110513A (en) Automatic generation of ontologies using word affinities
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
CN111104801B (en) Text word segmentation method, system, equipment and medium based on website domain name
CN107506472B (en) Method for classifying browsed webpages of students
Geng et al. Evaluating web content quality via multi-scale features
CN110334343A (en) The method and system that individual privacy information extracts in a kind of contract
CN109284441B (en) Dynamic self-adaptive network sensitive information detection method and device
JP2008059442A (en) Document aggregate analyzer, document aggregate analytical method, program mounted with method, and recording medium for storing program
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
CN107220367A (en) Internet data full-text search method
Huang et al. Design a batched information retrieval system based on a concept-lattice-like structure
CN107133366A (en) Data based on meta-search engine find method
Hardik et al. Link analysis of Wikipedia documents using mapreduce
Li et al. Geospatial data mining on the web: Discovering locations of emergency service facilities
JP2007041700A (en) Topic extraction device, topic extraction method, topic extraction program, and storage medium
JP5321258B2 (en) Information collecting system, information collecting method and program thereof
Page TBMap: a taxonomic perspective on the phylogenetic database TreeBASE

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170929

RJ01 Rejection of invention patent application after publication