CN103473369A - Semantic-based information acquisition method and semantic-based information acquisition system - Google Patents

Semantic-based information acquisition method and semantic-based information acquisition system Download PDF

Info

Publication number
CN103473369A
CN103473369A CN2013104526558A CN201310452655A CN103473369A CN 103473369 A CN103473369 A CN 103473369A CN 2013104526558 A CN2013104526558 A CN 2013104526558A CN 201310452655 A CN201310452655 A CN 201310452655A CN 103473369 A CN103473369 A CN 103473369A
Authority
CN
China
Prior art keywords
network information
semantic
topic
information
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013104526558A
Other languages
Chinese (zh)
Inventor
李涓子
祁羽
何巍
焦程波
张鹏
杨瑞兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN2013104526558A priority Critical patent/CN103473369A/en
Publication of CN103473369A publication Critical patent/CN103473369A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention relates to the technical field of data mining, in particular to a semantic-based information acquisition method and a semantic-based information acquisition system. The semantic-based information acquisition method comprises the following steps: S1, establishing a network resource abstract data model according to typical characteristics of network resources; S2, acquiring network information from the internet by means of a search engine, and performing formatted processing on the acquired network information by using the network resource abstract data model; S3, performing clustering analysis on the network information after the formatted processing, dividing the network information into a corresponding topic according to a clustering analysis result, and extracting a label of each topic; S4, performing visual display on a processed result in the step S3. According to the semantic-based information acquisition method and the semantic-based information acquisition system provided by the invention, network resource organization, the visual display and downloading and online viewing of the network resources are performed by topic drive, and therefore, the display of the network information can be performed in a multi-dimensional manner, the network information is visually and displayed to a user, and an effect that the browsing efficiency of the user is improved is achieved.

Description

The information collecting method of semantic-based and system
Technical field
The present invention relates to the data mining technology field, be specifically related to a kind of information collecting method and system of semantic-based.
Background technology
Network data (resource) refers to the summation of various information resources on internet, comprises the set of various forms of knowledge, data, information, the message etc. such as electronic literature, database, Digital Documents, digitizing bibliography, e-newspaper, Internet news
Data information on internet has that data volume is large, renewal speed is fast, the ageing characteristics such as strong, there is every day a large amount of network informations to produce, in order to help the user to free from the predicament of " information explosion ", each large portal website and main search engine companies all can provide the Internet resources of magnanimity at present, in a space of a whole page, internet information, by the representing of comprehensive, multi-angle, is introduced to the correlation circumstance of Internet resources, analyze its characteristics.Generally, these network datas are manually organized by the editorial staff.
The robotization tissue of network data, refer to and browse for the convenience of the user and obtain network data information, utilize the correlation techniques such as information extraction and data mining, according to certain standard or pattern, scattered, unordered network data information is given to the process of systematization, ordering.Therefore, how research carries out robotization tissue effectively and reasonably to network data, become a problem in the urgent need to address, the extensive concern that also more and more is subject to the user is organized in the robotization of network data: for each large internet site, it can replace the artificial tissue to network data in the past; And, for general network data user, it can utilize fast throughput and the relevant mature technology of computing machine, further improve the enterprise schema of network data, thereby improve user's browse efficiency.
The network information that comprises number of different types in network data, the information type comprised as resource classification, resource, time, related person, place, organizational structure etc., these different classes of information not are present in network isolatedly, but interdepend, and link together by certain close relation.Therefore, how by these, different classes of information fusion together, is the key of network data robotization tissue effectively, and this is target place of this paper research just also.
In the correlation technique of Internet resources tissue, topic detection can effectively collect the Internet resources of dispersion and organize, yet because Internet resources internal information similarity is higher, the topic detection poor effect based on traditional vector space model; Reasonably Internet resources enterprise schema can help the information that the user removes understanding and analysis Internet resources better, yet existing enterprise schema is single, is difficult to present its multidimensional characteristic.
Summary of the invention
(1) technical matters that will solve
The object of the present invention is to provide a kind of information collecting method and system of semantic-based, drive and carry out Internet resources tissue, visual presentation and download and the off-line of Internet resources are checked by topic, thereby can various dimensions the network information be represented, with the image, mode is presented to the user by the network information intuitively, further improves user's browse efficiency.
(2) technical scheme
Technical solution of the present invention is as follows:
A kind of information collecting method of semantic-based comprises step:
S1. according to the characteristic feature of Internet resources, set up the Internet resources abstract data model;
S2. by search engine from internet collection network information, and the network information of collection is formatd to processing with described Internet resources abstract data model;
S3. the network information after format being processed is carried out cluster analysis, and according to cluster analysis result, the described network information is subdivided in corresponding topic, and extracts the label of each topic;
S4. result in described step S3 is carried out to visual presentation.
Preferably, described step S1 further comprises:
According to the characteristic feature of Internet resources, sum up Internet resources abstract data model model element, set up Internet resources abstract data model model.
Preferably, described step S2 further comprises:
S21. capture the network information that search engine searches from internet;
S22. the rule of utilizing webpage to capture routine analyzer assembly and regular expression is carried out analytical analysis to the network information captured, and obtains text message;
S23. utilize described Internet resources abstract data model to format processing to the text message obtained.
Preferably, described step S3 further comprises:
S31. utilize the text message after the Chinese word segmentation instrument is processed format to carry out participle and part-of-speech tagging;
S32. according to default candidate keywords standard, word segmentation result in described step S31 is filtered, obtained candidate keywords;
S33. add up the contribution degree of each candidate keywords to described topic label, the described network information is carried out to cluster analysis, and according to cluster analysis result, the described network information is subdivided in corresponding topic;
S34. to described candidate keywords according to the contribution degree descending sort, several candidate keywords before extracting, generate the topic label.
Preferably, described step S3 further comprises:
S35. set up the link of described candidate keywords in knowledge base.
Preferably, described step S4 further comprises:
S41. the search word provided according to the user, front some the network informations that the search engine of take searches are summary, for the user, judge whether required content: if not, finish; If continue;
S42. according to described step S1-step S3, the network information of obtaining in described step S41 is subdivided in corresponding topic, and generates corresponding topic label;
S43. according to the relationship degree between topic and wall scroll network information sequence, generate the topic entity relationship diagram and with the linking of knowledge base.
Preferably, after described step S4, also comprise:
S5. according to the topic label generated and the network information under the topic label, will the pack data content of downloading the data content that packing is downloaded to of selection set up index.
Preferably, after described step S5, also comprise:
S6. the data content that in described step S5, packing is downloaded is copied under the file or catalogue of appointment; Automatically the data content copied is carried out to decompression processing and data reduction, and present for the user and browse with the form of webpage.
The present invention also provides a kind of information acquisition system of the semantic-based of realizing according to above-mentioned any one the information collecting method of semantic-based:
A kind of information acquisition system of semantic-based comprises:
Abstract data model builds module: for the characteristic feature according to Internet resources, set up the Internet resources abstract data model;
Network information gathering module: by search engine, from internet collection network information, and the network information of collection is formatd to processing with described Internet resources abstract data model;
The cluster analysis module: the network information after format is processed is carried out cluster analysis, and according to cluster analysis result, the described network information is subdivided in corresponding topic, and extracts the label of each topic;
Analysis result display module: for the result to described cluster analysis module, carry out visual presentation.
Preferably, also comprise:
The data content download module: for the topic label according to generating and the network information under the topic label, will the pack data content of downloading the data content that packing is downloaded to of selection set up index;
The off-line browsing module: the data content of downloading for packing copies under the file of appointment or catalogue and automatically the data content copied is carried out to decompression processing and data reduction, and presents for the user and browse with the form of webpage.
(3) beneficial effect
The information collecting method of the semantic-based that the embodiment of the present invention provides and system, drive and carry out Internet resources tissue, visual presentation and download and the off-line of Internet resources are checked by topic, thereby can various dimensions the network information be represented, with image, mode is presented to the user by the network information intuitively, has realized improving the effect of user's browse efficiency.
The accompanying drawing explanation
Fig. 1 is the schematic flow sheet of the information collecting method of semantic-based in the embodiment of the present invention;
Fig. 2 is the hardware configuration schematic diagram of the information acquisition system of semantic-based in the embodiment of the present invention;
Fig. 3 is the design sketch of realizing of the information collecting method of semantic-based in the embodiment of the present invention and system.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described further.Following examples are only for the present invention is described, but are not used for limiting the scope of the invention.
Embodiment mono-
At first a kind of information collecting method of semantic-based is provided in the present embodiment, and as shown in fig. 1, the information collecting method of this semantic-based mainly comprises step:
S1. according to the characteristic feature of Internet resources, sum up model element, set up the Internet resources abstract data model;
S2. by search engine from internet collection network information, and the network information of collection is formatd to processing with described Internet resources abstract data model;
S3. the network information after format being processed is carried out cluster analysis, and according to cluster analysis result, the described network information is subdivided in corresponding topic, and extracts the label of each topic;
S4. result in described step S3 is carried out to visual presentation.
In addition, can also comprise the following steps:
S5. the packing of the network information is downloaded: according to the topic label generated and the network information under the topic label, will the pack data content of downloading the data content that packing is downloaded to of selection set up index;
S6. the off-line of the network information is checked: the data content that in described step S5, packing is downloaded is copied under the file or catalogue of appointment; Automatically the data content copied in described step S6 is carried out to decompression processing and data reduction, and present for the user and browse with the form of webpage.
Below the step of the information collecting method of semantic-based in the present embodiment is described in further detail.
Wherein, described step S1 comprises:
According to the characteristic feature of Internet resources, sum up model element, set up the Internet resources abstract data model; In the present embodiment, this step can be specially:
The characteristic feature of contrast Internet resources, and the characteristic feature of Internet resources is summarized and analyzed, thereby the model element of Internet resources abstract data model obtained; For example, network information text generally includes special topic (Topic), title (Title), issuing time (Time), publisher (Author), body matter (Content), the link of data (URL) etc.And Internet resources generally all comprise these key elements; Simultaneously, these key elements also normally the user be concerned about, the model of Internet resources abstract data model should be as the criterion with these key elements.By the foundation of Internet resources abstract model, can allow user's content more clear, that the awareness network resource comprises more easily, the user that is more convenient for understands the expressed meaning of Internet resources, so that allow the user use more easily Internet resources.
Wherein, described step S2 further comprises:
S21. the user of take inputs word as search word, utilizes the search engines such as Baidu or Google to carry out the collection of the network information, captures the network information that search engine searches from internet;
S22. the rule of utilizing webpage to capture routine analyzer assembly and regular expression is carried out analytical analysis to the network information (as the label of html web page) captured, and obtains text message; Simultaneously, noise information on internet (such as advertising words or Flash etc.) is filtered;
S23. the format of the text message of extraction being carried out to data with the Internet resources abstract data model of setting up in step S1 is processed.
Wherein, described step S3 further comprises:
S31. utilize ICTCLAS(Institute of Computing Technology-Chinese Lexical Analysis System, the Chinese lexical analysis system) etc. the participle instrument described text message is carried out to participle and part-of-speech tagging;
S32. the key message in the network information should easily be understood by the user, definite semantic.In order to reduce the contrary opinion of keyword, also added the technical term in some proprietary fields in the present embodiment, regulation is except indivedual chemical elements, animals and plants general designation and other proper nouns, and keyword can not be single character.In addition, except festivals or holidays, the user generally can be not interested in specific date, time, therefore, unless text is emphasized certain concrete time really, otherwise as the words such as " 2003 ", " March " should not be the content of topic label.Usining this standard adds up the candidate keywords standard of participle data as the topic label; According to this candidate word standard, remove some words that do not meet definition standard (such as some function words, numeral-classifier compound, onomatopoeia etc.) and stop words, word segmentation result in described step S31 is filtered, the word in the word of some single characters and inactive vocabulary is filtered out, obtained candidate keywords;
S33. preserve all candidate keywords, add up the contribution degree of each candidate keywords to described topic label, utilize LDA(Latent Dirichlet Allocation, potential Di Li Cray apportion model) the topic model algorithm, carry out cluster analysis to the described network information; In the present embodiment, this step specifically comprises:
Except part of speech, from the word frequency of word, occur that position and form three aspects: consider, for each word is provided with eight contribution degrees, all contribution degrees and computing method are as shown in table 1.
Table 1 word contribution degree and computing method thereof
Figure BDA0000389231000000071
Word contribution degree computation process also is responsible for some time words, place word etc. are carried out to normalized, for example " Tsing-Hua University " reaches " Tsing-Hua University ", " Beijing University " reaches " Peking University ", these words occur in the same network information, same concept in fact, degree w.ctf commonly used according to them in the present embodiment reaches frequency of occurrences w.tf in the text, the word of one of them is merged in another word, the frequency addition, the merging of other contribution degrees is as the criterion with strong contribution degree, for example, as a word w.quo wherein is 1, the w.quo after merging is also 1.
Finally, according to cluster analysis result, the described network information is subdivided in corresponding topic, and extracts the label of each topic; The label that extracts topic is specially:
To described candidate keywords, according to the contribution degree descending sort, several candidate keywords before extracting, generate the topic label.
In addition, the S3 of step described in the present embodiment further comprises:
S34. for the candidate keywords of obtaining in step S31, set up the link of each candidate keywords in knowledge base, thereby can check the relevant information of entry in knowledge base by the link of setting up.
Wherein, described step S4 further comprises:
S41. the search word provided according to the user, front some the network informations that search with search engine, as summary, for user discriminatory analysis its required content whether: if not, finish; If continue;
S42. according to described step S1-step S3, the network information of obtaining in described step S41 is subdivided in corresponding topic label;
S43. the result obtained in described step S42 is shown in html page, according to the relationship degree between topic and wall scroll network information sequence, generate the topic entity relationship diagram and with the linking of knowledge base, thereby the network information is carried out to visual presentation.
For example, in reality and internet life, the user by numerous informants such as newspaper, media, website round, the information of magnanimity has been enriched user's life, it is excessive also to have brought such as information, the problems such as the impalpable true and false.Visual refer to by means of technological means information and data with image conversion, interactive mode shows, the cognition extended one's service with this.
In the present embodiment, can be by using the layout that drags in the JavaScript resources bank, generate fast relational network, at first each topic, people entities, organization object, place entity are added in layout according to different patterns with the type of node, add link information according to the index between them again, initial like this graph of a relation has just generated.
When the user to graph of a relation in certain node while being analyzed, to choosing node and all and related node of this node, add highlightedly with linking, make the user can analyze easily these elements.And image is added to drag function, and the user is dissatisfied or while occurring that node is piled up to current layout, can control current layout by the value of regulating relationship degree.
S5. select according to the topic label generated and the network information under the topic label data content that will pack and download, and choose and need the data information of downloading by the check box of HTML, use multithreading, the information that the mode of multitask is got hook is simultaneously carried out packing and the download of webpage, and the data content that packing is downloaded to is set up index.
S6. the data content that in described step S5, packing is downloaded is copied under the file or catalogue of appointment; Automatically the data content copied in described step S61 is carried out to decompression processing and data reduction, and present for the user and browse with the form of webpage.
A kind of information acquisition system of the semantic-based of realizing according to above-mentioned any one the information collecting method of semantic-based also is provided in the present embodiment; The information acquisition system of this semantic-based mainly comprises that abstract data model builds module, network information gathering module, cluster analysis module and analysis result display module; In addition, can also comprise data content download module and off-line browsing module etc.Wherein, abstract data model builds module: for the characteristic feature according to Internet resources, set up the Internet resources abstract data model; Network information gathering module: by search engine, from internet collection network information, and the network information of collection is formatd to processing with described Internet resources abstract data model; The cluster analysis module: the network information after format is processed is carried out cluster analysis, and according to cluster analysis result, the described network information is subdivided in corresponding topic, and extracts the label of each topic; Analysis result display module: for the result to described cluster analysis module, carry out visual presentation; The data content download module: for the topic label according to generating and the network information under the topic label, will the pack data content of downloading the data content that packing is downloaded to of selection set up index; The off-line browsing module: the data content of downloading for packing copies under the file of appointment or catalogue and automatically the data content copied is carried out to decompression processing and data reduction, and presents for the user and browse with the form of webpage.
Fig. 2 is the hardware configuration schematic diagram of the information acquisition system of semantic-based in the present embodiment; Fig. 3 is the design sketch of realizing of the information collecting method of semantic-based in the present embodiment and system.Information collecting method and system below in conjunction with example to the semantic-based provided in the present embodiment are further described.
(1), abstract data model builds module to the characteristic feature according to Internet resources, sets up the Internet resources abstract data model:
By having collected data text that each related term is at first 40 pages (general 400) Search Results of Baidu's Search Results corresponding as text set, for carrying out the analysis of Internet resources characteristic feature.
Afterwards, therefrom delete some defective data texts (for example, only having title, video, picture etc.), finally obtain 360 Search Results as the test source data.In the test source data, extract the feature of Internet resources, find out common feature as characteristic feature.Using this characteristic feature as Internet resources abstract data model key element, build the characteristic feature of Internet resources.
(2), the network information gathering module by search engine from internet collection network information:
This module and method are based on JavaEE and realize; The B/S(Browser/Server that uses java to develop, the browser/server end) system, database adopts the MySql database; Obtain the key word Webpage searching result from Baidu and Google, the attribute kit purse rope page head of Webpage searching result, webpage Url and web page contents summary etc., adopting webpage to capture the routine analyzer assembly directly accesses this Url and searched page is resolved and obtained document, further analyzing structure of web page, obtain text message, and according to set up Internet resources abstract data model, the text message obtained is formatd to processing.
(3) cluster analysis module is utilized the LDA algorithm, and the described network information is carried out to cluster analysis:
This part can directly adopt the NewsMiner(media event and excavate) algorithm of the data analysis part of engineering, utilize Chinese words segmentation and LDA probability model algorithm to carry out the topic analysis to the keyword search result set obtained from Baidu and Google.Analysis result comprises the topic classified information, degree of association information etc. between topic.
(4), the analysis result display module carries out visual presentation to the result of described cluster analysis module:
4.1) front M bar Search Results demonstration
The search word provided according to the user, front some the network informations that search with search engine, as summary, for user discriminatory analysis its required content whether: if not, finish; If continue;
4.2) according to the decision of user in step 4.1, if proceed to process, according to the topic label, carry out the cluster analysis of data, in conjunction with the Information Number got, carry out dynamically classification analysis, generate topic;
4.3) result obtained in step 4.2 is shown in html page, according to the relationship degree between topic and wall scroll network information sequence, generate the topic entity relationship diagram and with the linking of knowledge base, thereby the network information is carried out to visual presentation.
(5), the data content download module is according to the topic label generated and the network information under the topic label, will the pack data content of downloading the data content that packing is downloaded to of selection set up index; This part mainly comprises that the webpage of task captures, the multithreading task, and progress is preserved, and index is set up and five aspects of file packing; Below a part is wherein described in detail.
5.1) webpage of task captures
5.1.1) Baidu's Search Results crawl
Utilize socket to carry out the correlation parameter request access to Baidu's server, obtain true url; Capture for solving the situation that unsuccessfully causes system seemingly-dead, be provided with timeout mechanism in the present embodiment, to guarantee jumping out this information scratching after failed download, continue the crawl of other information.Obtaining data stream by HttpClient carries out the webpage HTML code and obtains.The first step before preserving file need to be detected page coding, the set positions page coding that generally form of webpage can be earlier in the leader label, so adopt the directly way of coupling to be obtained here, then in the end a step is used this coding that HTML code is preserved into to web page files just can to avoid the user to occur the problem of mess code when the browsing web pages in offline state.
Second step is that HTML code is resolved, and obtains picture link and the link of webpage css pattern, downloads the static original appearance that this part file can keep webpage as far as possible.Just use HtmlParser to be resolved so here, the chain obtained is connected to various ways, absolute Url address is arranged, relative address is also arranged, after will changing its real Url address of acquisition to relative address like this, downloaded, after downloading successfully, the file chaining of replacing in former HTML code is local links.In addition, the knowledge base information of obtaining from knowledge base system also some picture needs to download, and the principle of its parsing is with being consistent in webpage.
Final step is preserved into html file by HTML code.
5.1.2) Google's Search Results obtains
According to Google's opening API (Application Programming Interface, application programming interface) interface, obtain the JSON(JavaScript Object Notation of corresponding Search Results, a kind of lightweight data interchange format) network data of data layout, the then data file of same form in generation and step 5.1.1
5.2) data file grabbed is carried out to index foundation
The full-text search engine kit that file index can be used mono-kind of Lucene(to increase income) set up, the folder name, topic classification, the task name that mainly comprise web page title, webpage brief introduction, web storage in the index of setting up, also comprise the relevant knowledge library information, to title and webpage brief introduction, use the Chinese word segmentation kit of a lightweight of increasing income of IKAnalyzer(in index) carry out participle so that the off-line office system is retrieved.
5.3) file of setting up index is carried out to the file packing
To the processing of packing of the index file of generation in the data file that generates in step 5.1 and step 5.2, for example adopt the packetized form of zip form to be packed to whole assignment folder.
(6), the data content downloaded for pack of off-line browsing module copies under the file of appointment or catalogue and automatically the data content copied carried out to decompression processing and data reduction, and present and supply the user to browse with the form of webpage:
6.1) online packaging file download
The user can sign in under the account of oneself, by oneself task that download online completes, is downloaded (download be by the file after the network information and index file packing) to local.
6.2) data are reduced and off-line is checked
The user signs in on off-line system, and the packaging file of downloading in step 6.1 is uploaded in off-line system, and off-line system can automatically to the packaging file of uploading, carry out decompress(ion) and the data reduction is processed, and presents to the user with the form of webpage and browse.
6.3) retrieval of off-line system File
To the packaging file information of having uploaded in step 6.2, the user can carry out meticulous retrieval to mission bit stream by the keyword relevant to task names or content in the input of intelligent information search frame.With the index of network-side, set up supportingly, retrieval is based on that the key word of user input carries out, and retrieval is retrieved title and the brief introduction of webpage part, in the knowledge base part, the word content of knowledge base is retrieved simultaneously.The global search technology that this part adopts lucene to be designed and developed, be equivalent to a search engine in fact; That is to say at each user's of this part downloading data and can be searched for as the search source data.
The front page layout of whole off-line browsing module is html format, and wherein the filling of data is all obtained the backstage related data by JavaScript, then carries out structure organization and shows.The visual presentation that comprises the topic degree of association in the information browse page, due to the model that has adopted network information topic formula, can understand the incidence relation between topic very legibly.
(7), experimental result
In the searching analysis part, we are searched under proper network approximately needs 2-3 minute ability display analysis result; This is relatively to be difficult to accept for domestic consumer, but for the scientific research personnel, they just need to constantly check to screen habitually in the past and inscribe if required content themselves when collecting the network information, comparatively speaking, scientific research personnel's many time under can saving when using information acquisition system provided by the present invention.The baseline results that native system is directly come search has been carried out the topic cluster, and the scientific research personnel can directly find according to the topic cluster result content oneself needed quickly, thereby substantially can reach the set goal on the effect of searching analysis.
Although native system is a kind of searching analysis system that belongs to other search engine databases of Adoption Network, but it combines the topic intellectuality of the network information and the relevant factor that has comprised knowledge base on the whole, and by it towards the user of service, native system also can be promoted and use, thereby has on the whole its some superiority.In addition, native system has very large expanding space, and can constantly enrich search content and improve search quality by upgrading.
Native system with respect to other traditional search engines (such as Baidu, Google), although in system, be the Search Results of taking Baidu, to the topic cluster analysis of result still can allow the user clear see the topic content that webpage is involved, make information searching convenient.
Knowledge base part in system can be the ontology information of encyclopaedia class, that is to say that whole system is that Webpage search and encyclopaedia Ontology Searching are combined, allow the understanding related content that the user can be convenient, make Search Results abundanter, represent content more directly perceived.
On the function setting of internal user, can make scientific research department and mechanism can obtain more easily Internet resources, for the scientific research personnel saves the time that ample resources is collected, thereby can improve scientific research efficiency.
In sum, the information collecting method of semantic-based provided by the present invention and system, merged data mining, Semantic Web and natural language processing technique, take text semantic as core, utilize topic analysis and knowledge corresponding technology, depth analysis and reorganization web search results, and provide full automatic intelligentized network data download service, the user is freed from the work of manual read and the lengthy and tedious information of filtration, for the user provide one more deep, web search results is understood and browsing service more easily, thereby can effectively improve user's browse efficiency.
Above embodiment is only for illustrating the present invention; and be not limitation of the present invention; the those of ordinary skill in relevant technologies field; without departing from the spirit and scope of the present invention; can also make a variety of changes and modification, therefore all technical schemes that are equal to also belong to protection category of the present invention.

Claims (10)

1. the information collecting method of a semantic-based, is characterized in that, comprises step:
S1. according to the characteristic feature of Internet resources, set up the Internet resources abstract data model;
S2. by search engine from internet collection network information, and the network information of collection is formatd to processing with described Internet resources abstract data model;
S3. the network information after format being processed is carried out cluster analysis, and according to cluster analysis result, the described network information is subdivided in corresponding topic, and extracts the label of each topic;
S4. result in described step S3 is carried out to visual presentation.
2. the information collecting method of semantic-based according to claim 1, is characterized in that, described step S1 further comprises:
According to the characteristic feature of Internet resources, sum up Internet resources abstract data model model element, set up Internet resources abstract data model model.
3. the information collecting method of semantic-based according to claim 2, is characterized in that, described step S2 further comprises:
S21. capture the network information that search engine searches from internet;
S22. the rule of utilizing webpage to capture routine analyzer assembly and regular expression is carried out analytical analysis to the network information captured, and obtains text message;
S23. utilize described Internet resources abstract data model to format processing to the text message obtained.
4. the information collecting method of semantic-based according to claim 3, is characterized in that, described step S3 further comprises:
S31. utilize the text message after the Chinese word segmentation instrument is processed format to carry out participle and part-of-speech tagging;
S32. according to default candidate keywords standard, word segmentation result in described step S31 is filtered, obtained candidate keywords;
S33. add up the contribution degree of each candidate keywords to described topic label, the described network information is carried out to cluster analysis, and according to cluster analysis result, the described network information is subdivided in corresponding topic;
S34. to described candidate keywords according to the contribution degree descending sort, several candidate keywords before extracting, generate the topic label.
5. the information collecting method of semantic-based according to claim 4, is characterized in that, described step S3 further comprises:
S35. set up the link of described candidate keywords in knowledge base.
6. the information collecting method of semantic-based according to claim 5, is characterized in that, described step S4 further comprises:
S41. the search word provided according to the user, front some the network informations that the search engine of take searches are summary, for the user, judge whether required content: if not, finish; If continue;
S42. according to described step S1-step S3, the network information of obtaining in described step S41 is subdivided in corresponding topic, and generates corresponding topic label;
S43. according to the relationship degree between topic and wall scroll network information sequence, generate the topic entity relationship diagram and with the linking of knowledge base.
7. according to the information collecting method of the described semantic-based of claim 1-6 any one, it is characterized in that, also comprise after described step S4:
S5. according to the topic label generated and the network information under the topic label, will the pack data content of downloading the data content that packing is downloaded to of selection set up index.
8. the information collecting method of semantic-based according to claim 7, is characterized in that, after described step S5, also comprises:
S6. the data content that in described step S5, packing is downloaded is copied under the file or catalogue of appointment; Automatically the data content copied is carried out to decompression processing and data reduction, and present for the user and browse with the form of webpage.
9. the information acquisition system of a semantic-based of realizing according to the method shown in claim 1-8 any one, is characterized in that, comprising:
Abstract data model builds module: for the characteristic feature according to Internet resources, set up the Internet resources abstract data model;
Network information gathering module: by search engine, from internet collection network information, and the network information of collection is formatd to processing with described Internet resources abstract data model;
The cluster analysis module: the network information after format is processed is carried out cluster analysis, and according to cluster analysis result, the described network information is subdivided in corresponding topic, and extracts the label of each topic;
Analysis result display module: for the result to described cluster analysis module, carry out visual presentation.
10. the information acquisition system of semantic-based according to claim 9, is characterized in that, also comprises:
The data content download module: for the topic label according to generating and the network information under the topic label, will the pack data content of downloading the data content that packing is downloaded to of selection set up index;
The off-line browsing module: the data content of downloading for packing copies under the file of appointment or catalogue and automatically the data content copied is carried out to decompression processing and data reduction, and presents for the user and browse with the form of webpage.
CN2013104526558A 2013-09-27 2013-09-27 Semantic-based information acquisition method and semantic-based information acquisition system Pending CN103473369A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013104526558A CN103473369A (en) 2013-09-27 2013-09-27 Semantic-based information acquisition method and semantic-based information acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013104526558A CN103473369A (en) 2013-09-27 2013-09-27 Semantic-based information acquisition method and semantic-based information acquisition system

Publications (1)

Publication Number Publication Date
CN103473369A true CN103473369A (en) 2013-12-25

Family

ID=49798217

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013104526558A Pending CN103473369A (en) 2013-09-27 2013-09-27 Semantic-based information acquisition method and semantic-based information acquisition system

Country Status (1)

Country Link
CN (1) CN103473369A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317845A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for automatic extraction of deep web data
CN104951432A (en) * 2015-05-21 2015-09-30 腾讯科技(深圳)有限公司 Information processing method and device
CN104991897A (en) * 2015-05-29 2015-10-21 百度在线网络技术(北京)有限公司 Method and device for searching weights and measures
CN105447202A (en) * 2015-12-31 2016-03-30 宁波公众信息产业有限公司 Internet information collecting system
CN105677716A (en) * 2015-12-23 2016-06-15 牡丹江师范学院 Computer data acquisition, processing and analysis system
CN106844336A (en) * 2016-12-26 2017-06-13 博彦科技股份有限公司 Data model processing method and processing device
CN107918644A (en) * 2017-10-31 2018-04-17 北京锐思爱特咨询股份有限公司 News subject under discussion analysis method and implementation system in reputation Governance framework
CN108052527A (en) * 2017-11-08 2018-05-18 中国传媒大学 Method is recommended in film bridge piecewise analysis based on label system
CN109685158A (en) * 2019-01-08 2019-04-26 东北大学 A kind of cluster result semantic feature extraction and method for visualizing based on strong point collection
CN109947858A (en) * 2017-07-26 2019-06-28 腾讯科技(深圳)有限公司 A kind of method and device of data processing
CN110019763A (en) * 2017-12-27 2019-07-16 北京京东尚科信息技术有限公司 Text filtering method, system, equipment and computer readable storage medium
CN110399605A (en) * 2018-04-17 2019-11-01 富士施乐株式会社 Information processing unit and the computer-readable medium for storing program
CN110688508A (en) * 2019-09-03 2020-01-14 北京字节跳动网络技术有限公司 Image-text data expansion method and device and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101529418A (en) * 2006-01-19 2009-09-09 维里德克斯有限责任公司 Systems and methods for acquiring analyzing mining data and information
CN101788988A (en) * 2009-01-22 2010-07-28 蔡亮华 Information extraction method
US20120030206A1 (en) * 2010-07-29 2012-02-02 Microsoft Corporation Employing Topic Models for Semantic Class Mining
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101529418A (en) * 2006-01-19 2009-09-09 维里德克斯有限责任公司 Systems and methods for acquiring analyzing mining data and information
CN101788988A (en) * 2009-01-22 2010-07-28 蔡亮华 Information extraction method
US20120030206A1 (en) * 2010-07-29 2012-02-02 Microsoft Corporation Employing Topic Models for Semantic Class Mining
CN102567530A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Intelligent extraction system and intelligent extraction method for article type web pages

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王巍 等: "基于网络搜索引擎的网络话题分析框架", 《计算机工程》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104317845A (en) * 2014-10-13 2015-01-28 安徽华贞信息科技有限公司 Method and system for automatic extraction of deep web data
CN104951432A (en) * 2015-05-21 2015-09-30 腾讯科技(深圳)有限公司 Information processing method and device
CN104951432B (en) * 2015-05-21 2019-01-11 腾讯科技(深圳)有限公司 The method and device that a kind of pair of information is handled
CN104991897B (en) * 2015-05-29 2018-09-25 百度在线网络技术(北京)有限公司 Weights and measures searching method and device
CN104991897A (en) * 2015-05-29 2015-10-21 百度在线网络技术(北京)有限公司 Method and device for searching weights and measures
CN105677716A (en) * 2015-12-23 2016-06-15 牡丹江师范学院 Computer data acquisition, processing and analysis system
CN105677716B (en) * 2015-12-23 2019-03-29 牡丹江师范学院 A kind of computer data acquiring processing analysis system
CN105447202A (en) * 2015-12-31 2016-03-30 宁波公众信息产业有限公司 Internet information collecting system
CN106844336A (en) * 2016-12-26 2017-06-13 博彦科技股份有限公司 Data model processing method and processing device
CN109947858A (en) * 2017-07-26 2019-06-28 腾讯科技(深圳)有限公司 A kind of method and device of data processing
CN107918644A (en) * 2017-10-31 2018-04-17 北京锐思爱特咨询股份有限公司 News subject under discussion analysis method and implementation system in reputation Governance framework
CN107918644B (en) * 2017-10-31 2020-12-08 北京锐思爱特咨询股份有限公司 News topic analysis method and implementation system in reputation management framework
CN108052527A (en) * 2017-11-08 2018-05-18 中国传媒大学 Method is recommended in film bridge piecewise analysis based on label system
CN110019763A (en) * 2017-12-27 2019-07-16 北京京东尚科信息技术有限公司 Text filtering method, system, equipment and computer readable storage medium
CN110019763B (en) * 2017-12-27 2022-04-12 北京京东尚科信息技术有限公司 Text filtering method, system, equipment and computer readable storage medium
CN110399605A (en) * 2018-04-17 2019-11-01 富士施乐株式会社 Information processing unit and the computer-readable medium for storing program
CN109685158A (en) * 2019-01-08 2019-04-26 东北大学 A kind of cluster result semantic feature extraction and method for visualizing based on strong point collection
CN109685158B (en) * 2019-01-08 2020-10-16 东北大学 Clustering result semantic feature extraction and visualization method based on strong item set
CN110688508A (en) * 2019-09-03 2020-01-14 北京字节跳动网络技术有限公司 Image-text data expansion method and device and electronic equipment
CN110688508B (en) * 2019-09-03 2022-09-02 北京字节跳动网络技术有限公司 Image-text data expansion method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN103473369A (en) Semantic-based information acquisition method and semantic-based information acquisition system
CN109992645B (en) Data management system and method based on text data
US20120041953A1 (en) Text mining of microblogs using latent topic labels
Brenner et al. Social event detection and retrieval in collaborative photo collections
CN104881428B (en) A kind of hum pattern extraction, search method and the device of hum pattern webpage
Abdelkader et al. Brands in newsstand: Spatio-temporal browsing of business news
CN114443928A (en) Web text data crawler method and system
Zhang et al. Through the eyes of a poet: Classical poetry recommendation with visual input on social media
Liu et al. Event-based cross media question answering
Hubmann-Haidvogel et al. Visualizing contextual and dynamic features of micropost streams
Park et al. Application of semi-automatic metadata generation in libraries: Types, tools, and techniques
CN113836434B (en) Web page data processing method based on database
CN113407678B (en) Knowledge graph construction method, device and equipment
Liu et al. EXOD: A tool for building and exploring a large graph of open datasets
Singh et al. A Content-based eResource Recommender System to augment eBook-based Learning
Fung et al. Discover information and knowledge from websites using an integrated summarization and visualization framework
Blaz̆ek et al. Video hunter at VBS 2017
Qian et al. Multi-modal supervised latent dirichlet allocation for event classification in social media
Baldauf et al. Getting context on the go: mobile urban exploration with ambient tag clouds
JP2004206571A (en) Method, device, and program for presenting document information, and recording medium
CN104516941A (en) Related document search apparatus and method, and program
CN102890715A (en) Device and method for automatically organizing specific domain information
CN107943822A (en) OGC geographic information services semantic retrieving methods based on MIML
Bo et al. Spatio-temporal visualization system of news events based on GIS
CN111143694B (en) Information pushing method and device, storage device and program

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20131225

RJ01 Rejection of invention patent application after publication