CN107220367A - Internet data full-text search method - Google Patents
Internet data full-text search method Download PDFInfo
- Publication number
- CN107220367A CN107220367A CN201710432267.1A CN201710432267A CN107220367A CN 107220367 A CN107220367 A CN 107220367A CN 201710432267 A CN201710432267 A CN 201710432267A CN 107220367 A CN107220367 A CN 107220367A
- Authority
- CN
- China
- Prior art keywords
- data
- affairs
- unit
- chinese
- search
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of internet data full-text search method, this method includes:Basic data is carried out by the acquisition module of search engine to internet sites to crawl and distributed storage;The analysis module of search engine is analyzed and processed the basic data of collection, is that keyword sets up index, is easy to user to search for.The present invention proposes a kind of internet data full-text search method, and high efficient data capture is carried out using transaction controlling strategy, and data mining is carried out for the coupled relation between multi dimensional object.
Description
Technical field
The present invention relates to data retrieval, more particularly to a kind of internet data full-text search method.
Background technology
With continuing to develop for Web technologies, network information resource is just increased in the way of geometry speed.How from internet
Quick-searching goes out the useful data related to user in magnanimity information turns into current urgent problem.Search engine is exactly
Grow up on the basis of information retrieval technique.Search engine helps the present invention preferably to express and store in real world
Essential information, and by analyzing the connection information in search engine, having for hiding information can be excavated as a kind of
Use instrument.The dependence limited search word of existing search engine merely goes to express user's request, there is this and expresses incomplete problem.
Even same search term, the desired result of different users may be also different.Such as microblog system, if it is considered that
The relation of microblogging and related interactive object, it can with it is abstract be a heterogeneous network, wherein containing microblogging, information, label
And the node such as user.Concern and bean vermicelli relation is there is between microblogging and microblogging, existed between microblogging and information deliver and
Forwarding relation, is an inclusion relation between microblogging and label, and holding relationship is there is between user and microblogging.It is existing to search
Rope instrument does not consider that the complex environment of above-mentioned multi dimensional object formation carries out data mining.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of internet data full-text search side
Method, including:
Basic data is carried out by the acquisition module of search engine to internet sites to crawl and distributed storage;
The analysis module of search engine is analyzed and processed the basic data of collection, is that keyword sets up index, is easy to
User searches for.
Preferably, the acquisition module includes crawling site database, crawls website scheduling unit, transaction management control
Device, affairs container, recording controller, basic database;
The transaction management controller is used for multiple establishments for crawling affairs, startup, operation control and destroyed;Affairs have certainly
Oneself independent container, the management for transaction resource;The data exchange that the recording controller is used between program and database
Processing;Including global data source cache unit, data scheduling unit, data access administrative unit;The overall situation crawls buffer unit use
In handling transaction latency of many affairs when to critical resource access, each web crawlers only one of which overall situation crawls buffer unit
Example;Data access administrative unit is used for database and the data interaction of program is handled;Data scheduling unit realizes that single affairs are climbed
The scheduling taken, when single affairs, which are crawled, not to be crawled in buffer unit, by data scheduling unit from the overall situation crawls buffer unit
Obtain some affairs that crawl and crawl buffer unit;Data scheduling unit only one of which example in whole program.
Preferably, the affairs container further comprises:
Save buffer unit is crawled, the affairs station data to be crawled is cached for setting up queue in internal memory;
Transactional cache unit, the data for caching affairs itself;
Memory buffers unit, caches the related data to be stored to database;
Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized;
Data cleansing extracting unit, is cleaned to the code of collection, extracts effective information, obtains web page quality grade
The relevant information of evaluation and obtain and new in webpage crawl website;
Data storage analytic unit, will clean and extracts data and be converted to the form that easily stores, data are compressed, group
Into pending database search character string.
Preferably, the font class occurred in title of the effective information including webpage, keyword, summary, text, webpage
Type, media characteristic information.
Preferably, the analysis module keyword sets up index, further comprises:
Keyword extraction is carried out to text successively;For numeral or Chinese figure Chinese character, if continued presence, treat as
One keyword is handled;For English alphabet, if run into the non-English letters such as space, divided;For in
The sentence of literary Chinese character composition, then handled by following order:If 1. character is a continuous Chinese-character digital, by Chinese character
Numeral is continuously put together;If 2. continuous three Chinese characters are all independent, phrase is not formed, by this three independent words
Divided as a new phrase.If 3. 1., be 2. either way not present, use and drawn based on dictionary algorithm
Point;For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special
Character, each special character is a keyword.
The present invention compared with prior art, with advantages below:
The present invention proposes a kind of internet data full-text search method, and carrying out efficient data using transaction controlling strategy adopts
Collection, data mining is carried out for the coupled relation between multi dimensional object.
Brief description of the drawings
Fig. 1 is the flow chart of internet data full-text search method according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with illustrating the accompanying drawing of the principle of the invention
State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right
Claim is limited, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with
Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details
Some or all details can also realize the present invention according to claims.
An aspect of of the present present invention provides a kind of internet data full-text search method.Fig. 1 is according to embodiments of the present invention
Internet data full-text search method flow chart.
The search engine of the present invention can be divided into acquisition module and analysis module.Acquisition module include crawl site database,
Crawl website scheduling unit, transaction management controller, affairs container, recording controller, basic database.Transaction management controller
For multiple establishments for crawling affairs, startup, operation control and destruction.Affairs have oneself independent container, for transaction resource
Management, specifically include:Save buffer unit is crawled, the affairs website number to be crawled is cached for setting up queue in internal memory
According to;Transactional cache unit, the data for caching affairs itself;Memory buffers unit, caches the correlation to be stored to database
Data;Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized;Number
According to cleaning extracting unit, the code of collection is cleaned, effective information is extracted.These effective informations include the mark of webpage
Font type, the media characteristic information occurred in topic, keyword, summary, text, webpage.Obtain what web page quality grade was evaluated
Relevant information and obtain and new in webpage crawl website;Data storage analytic unit, extracts data by cleaning and is converted to easily
The form of storage, data are compressed, and constitute pending database search character string.Recording controller is used for program and data
Data exchange processing between storehouse;Including global data source cache unit, data scheduling unit, data access administrative unit.Entirely
Office, which crawls buffer unit, to be used to handle transaction latency of many affairs when to critical resource access, reduces many affairs to data storehouse
Operational access number of times.Each web crawlers only one of which overall situation crawls buffer unit example.Data access administrative unit is used to count
Handled according to the data interaction of storehouse and program.Data scheduling unit realizes the scheduling that single affairs are crawled, when single affairs crawl caching list
When not crawled in member, by data scheduling unit from the overall situation crawl buffer unit in obtain and some crawl to affairs that to crawl caching single
Member.Data scheduling unit only one of which example in whole program.
The web crawlers of search engine operationally, reading program configuration file first, and will when preloading caching collection
The data used;Task manager.According to configuration information, each affairs is initialized, and control the operation of affairs;At affairs acquisition
Reason task, first carries out crawling link duplicate removal inspection, analysis crawls the type of link, performs different grab types at different places
Reason mode, in collection, analysis is new collection affairs or more new task, and after the webpage source code of link is got, it is right
The webpage source code collected performs cleaning, filtering, according to info web correlated characteristic rule, extracts effective information;Affairs pair
The information extracted carries out conversion process, is cached;When caching data to be saved reach certain amount, affairs perform caching
Data loading processing;Task manager timing simultaneously monitors the execution state of each affairs, and management is controlled to abnormal transaction.
Before crawling, all possible combination domain name is traveled through according to domain name create-rule successively, combination domain name is carried out
Detect successively, identification effective domain name and invalid domain name, set up rhizosphere name storehouse;Then the webpage source code of navigation website is obtained, according to
Rhizosphere name composition rule extracts root site address and link text from webpage source code, updates rhizosphere name storehouse.
The global collection transaction scheduling unit of each web crawlers only one of which.When the save buffer that crawls of affairs is sky
When, transactions requests or wait data scheduling unit crawl website from overall situation collection caching website or database transmission.Transaction scheduling
Process is specially:Affairs obtain data scheduling unit control authority;Whether be empty, be such as sky if judging global collection caching website,
Then a number of crawl is obtained from database and be cached to global collection caching source, if being not sky, then crawl website from the overall situation and obtain
Take it is a number of crawl to Current transaction crawl buffer unit.
For single affairs, its collecting flowchart is specific as follows:
1. affairs obtain a non-NULL collection transaction object from transaction queues.If getting sky transaction object, hold
Row transaction scheduling.
2. judge whether the depth for gathering affairs exceedes maximum depth;Affairs obtain it from current collection transaction object
Sampling depth where Current transaction object.If sampling depth exceedes the website sampling depth of system configuration, Current transaction
Collection terminates.As sampling depth not less than system configuration website sampling depth, then affairs continue step 3.
3. judge the type of collection affairs;If web retrieval affairs, then step 4 is performed, if not web retrieval
Affairs, then perform step 5.
4. judge whether new web page or unfinished web page interlinkage;If this collection transactions access address is not in history
Capture in storehouse, be then acquired i.e. step 7 by newfound webpage.If this collection affairs is in history crawl storehouse, from history
The last time collection information of this web page address is obtained in crawl storehouse:Reference address, access time, page-size, renewal frequency, rhizosphere
Name.Calculate the last visit time and whether the interval time of this access current time alreadys exceed renewal frequency, if do not surpassed
Cross, then without collection, collection terminates;If it has been exceeded, then comparing the content of pages size and upper one of current web page address
Secondary content of pages size, if equal, without collection, if unequal, continues step 6.
5. if media or file link, then perform corresponding collection document process;If illegal link, then record
This is linked extremely.
6. obtaining this web page interlinkage page source code, the collection information of this web page address in history access database is updated, step is performed
Rapid 8.
7. gathering new task webpage, the source code of this web page interlinkage page is obtained, with increasing this webpage in history access database
The access record of location.
8. performing Web Cleanout to extract, the Web Cleanout extraction step is used to extract the feature letter specified from webpage source code
Breath, removes the garbage or noise data in webpage source code, then goes out the information of needs from cleaned extracting data again.Enter
One step, according to text similarity measurement algorithm, extracts title from webpage, keyword, descriptive text, mark defined in webpage
Link, media resource in topic, text, webpage are used for Ordination.In cleaning, affairs capture program is obtained currently first
The web page coding of affairs.Start washer, and initialize.Remove the pattern coding in web page coding and explain coding;Remove net
Scripted code in page number, and recognize that current web page whether there is media file according to script code information simultaneously.If deposited
Then preserved.Information after extraction is compressed to the form for being converted into being easy to storage and stored.
It is that each affairs dispose special duplicate removal container, each container has only stored oneself for page repeated links
The mapping code of the chained address accessed.Duplicate removal container only needs to record the chain oneself accessed under same root domain name website
Connect, discard processing is carried out to being not belonging to the web page address of this rhizosphere under one's name.When affairs start to gather another root site information,
The history access record of duplicate removal container is emptied, new root site access record is recorded again.Information acquisition device is crawled into website
Depth sets threshold value, during the operation of each affairs, the internal memory shared by actual duplicate removal container can by crawl website depth threshold come
Control.
To reduce transaction latency, the present invention uses multilayer buffer structure, each layer is cached according to the memory size of computer
Size is configured.The overall situation is crawled first and cached.In the access connection procedure to crawling database, using disposable
Acquisition batch crawls result and cached.Secondly caching is crawled using single affairs itself.Each affairs each possess one
Gathered data source cache region.Then the data of the generation to affairs in processing procedure are cached, and are included in link duplicate removal
During inspection, webpage that cache access is crossed, media links address.Last layer of caching is caching data to be saved.When to be saved
Data reach certain amount after, affairs just carry out storage preservation to data.
The analysis module of the search engine is used to gather at the basic data progress analysis of the text returned, media
Reason, is that keyword sets up index, is easy to the search of searching system.It is right successively when analysis module carries out keyword extraction to text
Text carries out keyword extraction.For numeral or Chinese figure Chinese character, if continued presence, the present invention is as a key
Word is handled.For English alphabet, if run into the non-English letters such as space, divided.For Chinese character group
Into sentence, then handled by following order:If 1. character is a continuous Chinese-character digital, and Chinese-character digital is continuous
Put together.If 2. continuous three Chinese characters are all independent, phrase is not formed, the present invention makees this three independent words
Divided for a new phrase.If 3. 1., be 2. either way not present, use and drawn based on dictionary algorithm
Point.For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special
Character, each special character is a keyword.
Keyword extraction step is carried out to text to further comprise:
The Chinese phrase of loading.Text to be divided is obtained from participle object;Obtain analysis position and character, discriminatory analysis
Position whether be text to be divided end, if the rearmost position of text to be divided, then ready-portioned text adds
Upper cutting symbol, along with the last character of text to be divided, group has been divided in new ready-portioned text, now text
Into.Since analysis position, cutting symbol position is found.Behind the position for finding cutting symbol, interception analysis position to cutting accords with position
Between character, cutting symbol is plus the text after dividing, composition ready-portioned text.
The storage of Chinese basis dictionary is using double-deck Hash list object storage.First Chinese character using word phrase as key,
Using another Hash list object as the storage organization of key assignments;The Hash list object storage of key assignments is the phrase started with key
The first Chinese character of removing after phrase remainder.The storage organization of the key assignments of first Chinese character be using second Chinese character of word as
Key, using chain type array as key assignments.The first two Chinese character of word in this chain type storage of array dictionary is identical, the from word the 3rd
Individual Chinese character starts different text sequences.
It is that foundation is divided to text by using dictionary, whether there is to find phrase in dictionary.If in word
Exist in storehouse, then continue matching process, if not present in dictionary, then matching terminates, and is divided.
Further, after search obtains collections of web pages, constructed and be based on respectively according to search and webpage own content feature
The similar diagram of feature, while search and webpage binary crelation figure is built based on the interest relation between search and webpage, given few
The classification of unmarked search and webpage is predicted in the case of amount search and webpage category label.
First have to build a figure, node table sample notebook data, while between representing sample according to sample data and its contact
Contact, the weight on side represents the tightness degree contacted between sample.The search for containing a variety of different objects structures is drawn
G=(V, E) can be expressed as by holding up, and wherein V=Q ∪ D can be expressed as the set on different types of summit, and E is connection summit
The set on side.Q is the set of search, and D is the set of webpage.E=EQQ∪EQD∪EDD, wherein EQQ=Q × Q, EQD=Q × D, EDD
=D × D.Make GQ=(Q, EQQ), GQD=(Q, D, EQD), GD=(D, EDD), then G=GQQ∪GQD∪GDD.Wherein GQRepresent by searching for
The subgraph that node is built, GDRepresent the subgraph built by web page joint, GQDRepresent by search node and web page joint according to interest
The subgraph that relation is built.
The weight w on side is defined based on the following distance function between nodeij
wij=exp (- d (xi,xj)/2σ2)
It is two text vector x in text calculatingi, xjBetween included angle cosine;xiAnd xjSection in k neighbours each other
Point;Wherein d (xi,xj) be distance function | | xi-xj| |, σ is regulation parameter.
Original graph structure is changed according to the discriminant information of advance marker samples:
1. construct GQAnd GDAnd GQD.Calculate GQThe average weight w on side between upper all nodesq, GDSide between upper all nodes
Average weight wd。
2:If search flag data is divided into c class, P is expressed asq={ Pq 1, Pq 2..., Pq c, wherein Pq iRepresent i-th
The core set of the mark search of individual classification.Make Mq iThe paired constraint set of i-th of search category mark is represented, if x ∈ Pq iAnd y ∈
Pq i, then (x, y) is added into Mq i.If Web Page Tags data are divided into c class, P is expressed asd={ Pd 1, Pd 2..., Pd c, wherein
Pd iRepresent the core set of the marking of web pages of i-th of classification.Make Md iThe paired constraint set of i-th of webpage category label is represented, if x
∈Pd iAnd y ∈ Pd i, then (x, y) is added into Md i。
3:If searching for sample to (ql, qm)∈Mq i, qkFor qlAnd qmNeighbours, wlk<wqAnd wmk>wq, then by qkAdd Pq i,
By (ql, qk) and (qm, qk) add Mq i。
4:3 are repeated, until PqNo longer change.
5:If webpage sample is to (dl, dm)∈Md i, dkFor dlAnd dmNeighbours, wlk<wdAnd wmk>wd, then by dkAdd Pd i,
By (dl, dk) and (dm, dk) add Md i。
6:5 are repeated, until PdNo longer change.
7:If searching for sample to (ql, qm)∈Mq i, then G is changedQMake wlm=1, if ql∈Pq iThen change GQMake wlm=0.If
Webpage sample is to (dl, dm)∈Md i, then G is changeddMake wlm=1, if dl∈Pd iThen change GdMake wlm=0.
8:If searching for ql∈gP:And webpage dmgPj, then and modification GQDMake wlm=1.
Wherein wlk, wmk, wlmRespectively in search ql、qm、qlRespectively in webpage dk, dk, dmIn weight.
Relation between search and webpage can be enriched by said process, make the contact between similar node more tight
Gather, the contact between different classes of node is looser, so as to preferably be classified using procedure below:
1. for classification j ∈ { 1 ..., c }, and the node i ∈ { 1 ..., n } in above-mentioned subgraph, construction n × c's is first
Beginningization mark matrix Y.
2. construct the adjacency matrix W in homogeneous network respectively according to the similarity measurement between isomorphism nodeQQ, WDD, according to
Relation construction adjacency matrix W between heterogeneous nodesQDAnd corresponding transposed matrix WDQ。
3. structural matrixWherein
4. F (0)=Y is taken, iterative calculation F (t+1)=μαSF(t)+(1-μα) Y, wherein μαFor the ginseng between (0,1)
Number.
5. setting F* as { F (t) } limit of a sequence, then the node v in G is schemediAccording to yi=argmaxj<cFijCarry out contingency table
Note.
The above-mentioned neighbor node that in an iterative process, label information is constantly broadcast to oneself by each node of figure is straight
A stable state is reached to them.
In summary, the present invention proposes a kind of internet data full-text search method, is carried out using transaction controlling strategy
High efficient data capture, data mining is carried out for the coupled relation between multi dimensional object.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step
Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and constituted
Network on, alternatively, the program code that they can be can perform with computing system be realized, it is thus possible to they are stored
Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's
Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention
Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing
Change example.
Claims (5)
1. a kind of internet data full-text search method, it is characterised in that including:
Basic data is carried out by the acquisition module of search engine to internet sites to crawl and distributed storage;
The analysis module of search engine is analyzed and processed the basic data of collection, is that keyword sets up index, is easy to user
Search.
2. according to the method described in claim 1, it is characterised in that the acquisition module includes crawling site database, crawling
Website scheduling unit, transaction management controller, affairs container, recording controller, basic database;
The transaction management controller is used for multiple establishments for crawling affairs, startup, operation control and destroyed;Affairs have oneself only
Vertical container, the management for transaction resource;The data exchange processing that the recording controller is used between program and database;
Including global data source cache unit, data scheduling unit, data access administrative unit;The overall situation, which crawls buffer unit, to be used to handle
Transaction latency of many affairs when to critical resource access, each web crawlers only one of which overall situation crawls buffer unit example;
Data access administrative unit is used for database and the data interaction of program is handled;Data scheduling unit realizes the tune that single affairs are crawled
Degree, when single affairs, which are crawled, not to be crawled in buffer unit, if by data scheduling unit from the overall situation crawl buffer unit in obtain
The dry affairs that crawl crawl buffer unit;Data scheduling unit only one of which example in whole program.
3. method according to claim 2, it is characterised in that the affairs container further comprises:
Save buffer unit is crawled, the affairs station data to be crawled is cached for setting up queue in internal memory;
Transactional cache unit, the data for caching affairs itself;
Memory buffers unit, caches the related data to be stored to database;
Issued transaction unit is gathered, for the loading to gathered data, data renewal, link duplicate removal, storage processing is realized;
Data cleansing extracting unit, is cleaned to the code of collection, extracts effective information, is obtained web page quality grade and is evaluated
Relevant information and obtain and new in webpage crawl website;
Data storage analytic unit, will clean and extracts data and be converted to the form that easily stores, data are compressed, and composition is treated
The database search character string of execution.
4. method according to claim 3, it is characterised in that the effective information includes the title of webpage, keyword, plucked
Font type, the media characteristic information, to occur in text, webpage.
5. according to the method described in claim 1, it is characterised in that the analysis module keyword sets up index, further bag
Include:
Keyword extraction is carried out to text successively;For numeral or Chinese figure Chinese character, if continued presence, as one
Keyword is handled;For English alphabet, if run into the non-English letters such as space, divided;For the Chinese Chinese
The sentence of word composition, then handled by following order:If 1. character is a continuous Chinese-character digital, by Chinese-character digital
Continuously put together;If 2. continuous three Chinese characters are all independent, do not form phrase, using this three independent words as
One new phrase is divided.If 3. 1., be 2. either way not present, use and divided based on dictionary algorithm;
For it is other be not numeral, English alphabet, simplified form of Chinese Character computer symbols, then regard these computer symbolses as special word
Symbol, each special character is a keyword.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710432267.1A CN107220367A (en) | 2017-06-09 | 2017-06-09 | Internet data full-text search method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710432267.1A CN107220367A (en) | 2017-06-09 | 2017-06-09 | Internet data full-text search method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107220367A true CN107220367A (en) | 2017-09-29 |
Family
ID=59947568
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710432267.1A Pending CN107220367A (en) | 2017-06-09 | 2017-06-09 | Internet data full-text search method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107220367A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190010A (en) * | 2018-09-20 | 2019-01-11 | 河南智慧云大数据有限公司 | Internet data acquisition system is carried out based on customized keyword acquisition mode |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104281697A (en) * | 2014-10-15 | 2015-01-14 | 安徽华贞信息科技有限公司 | Semantic-based hadoop system |
CN105468744A (en) * | 2015-11-25 | 2016-04-06 | 浪潮软件集团有限公司 | Big data platform for realizing tax public opinion analysis and full text retrieval |
-
2017
- 2017-06-09 CN CN201710432267.1A patent/CN107220367A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104281697A (en) * | 2014-10-15 | 2015-01-14 | 安徽华贞信息科技有限公司 | Semantic-based hadoop system |
CN105468744A (en) * | 2015-11-25 | 2016-04-06 | 浪潮软件集团有限公司 | Big data platform for realizing tax public opinion analysis and full text retrieval |
Non-Patent Citations (1)
Title |
---|
周庭安: ""分布式搜索引擎研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109190010A (en) * | 2018-09-20 | 2019-01-11 | 河南智慧云大数据有限公司 | Internet data acquisition system is carried out based on customized keyword acquisition mode |
CN109190010B (en) * | 2018-09-20 | 2021-05-11 | 河南智慧云大数据有限公司 | Internet data acquisition system based on user-defined keyword acquisition mode |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101122942B1 (en) | New word collection and system for use in word-breaking | |
CN102053991B (en) | Method and system for multi-language document retrieval | |
CN110543595B (en) | In-station searching system and method | |
CN112256939B (en) | Text entity relation extraction method for chemical field | |
Fujimura et al. | Topigraphy: visualization for large-scale tag clouds | |
CN107239558A (en) | Common interconnection network collecting method | |
CN106815307A (en) | Public Culture knowledge mapping platform and its use method | |
CN101694658A (en) | Method for constructing webpage crawler based on repeated removal of news | |
JP2009110513A (en) | Automatic generation of ontologies using word affinities | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
CN111104801B (en) | Text word segmentation method, system, equipment and medium based on website domain name | |
CN107506472B (en) | Method for classifying browsed webpages of students | |
Geng et al. | Evaluating web content quality via multi-scale features | |
CN110334343A (en) | The method and system that individual privacy information extracts in a kind of contract | |
CN109284441B (en) | Dynamic self-adaptive network sensitive information detection method and device | |
JP2008059442A (en) | Document aggregate analyzer, document aggregate analytical method, program mounted with method, and recording medium for storing program | |
CN112035723A (en) | Resource library determination method and device, storage medium and electronic device | |
CN107220367A (en) | Internet data full-text search method | |
Huang et al. | Design a batched information retrieval system based on a concept-lattice-like structure | |
CN107133366A (en) | Data based on meta-search engine find method | |
Hardik et al. | Link analysis of Wikipedia documents using mapreduce | |
Li et al. | Geospatial data mining on the web: Discovering locations of emergency service facilities | |
JP2007041700A (en) | Topic extraction device, topic extraction method, topic extraction program, and storage medium | |
JP5321258B2 (en) | Information collecting system, information collecting method and program thereof | |
Page | TBMap: a taxonomic perspective on the phylogenetic database TreeBASE |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170929 |
|
RJ01 | Rejection of invention patent application after publication |