CN106682007A - Data acquisition method and device - Google Patents
Data acquisition method and device Download PDFInfo
- Publication number
- CN106682007A CN106682007A CN201510752393.6A CN201510752393A CN106682007A CN 106682007 A CN106682007 A CN 106682007A CN 201510752393 A CN201510752393 A CN 201510752393A CN 106682007 A CN106682007 A CN 106682007A
- Authority
- CN
- China
- Prior art keywords
- source
- web page
- news
- information
- page news
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a data acquisition method and device, relates to the technical field of networks, and solves the problem of low accuracy of influence data of existing acquired webpage news. The main technical scheme includes that the method includes the steps: acquiring source messages of the webpage news; extracting source messages corresponding to the source messages of the webpage news from a preset webpage source library; determining weight values corresponding to the extracted source messages as data with influence on the webpage news. A plurality of source messages and weight values respectively corresponding to the source messages are stored in the preset webpage source library.
Description
Technical field
The present invention relates to networking technology area, more particularly to a kind of data capture method and device.
Background technology
With the popularization and the surge of netizen's quantity of network, Internet news is used as a kind of brand-new relative
Independent dissemination of news pattern is a dark horse, and has become the another important channel that people obtain information.
Wherein, Internet news refers to the news information passed by based on internet.And for the impact of Internet news
Power research is increasingly becoming the focus of concern, is news by carrying out influence power calculating to Internet news
Validity differentiates the related foundation of offer, and compared to other mass media, the network media shows more
Complexity, the generation of this complexity had both come from media technology, also came from the spatial character of network.
At present, by the reprinting rate and response rate of web page news as the index for judging news influence,
But the response rate of web page news and reprinting rate are proportional with the time that news occurs, and at one section
Taper off after time, thus this computational methods for the influence power for evaluating real-time news be inaccurate
, so as to the accuracy rate of the web page news influence power of existing acquisition is low.
The content of the invention
In view of the above problems, it is proposed that the present invention, the problems referred to above or at least are overcome to provide one kind
The data capture method for partly solving the above problems and device.
To reach above-mentioned purpose, present invention generally provides following technical scheme:
On the one hand, a kind of data capture method is embodiments provided, the method includes:
Obtain the source-information of web page news;
From preset web page source storehouse, source letter corresponding with the source-information of the web page news is extracted
Breath, be stored with multiple source-informations in the preset web page source storehouse, and with the source-information
The corresponding weighted value of difference;
The corresponding weighted value of the source-information of the extraction is defined as having the web page news to be affected
Data.
On the other hand, the embodiment of the present invention also provides a kind of data acquisition facility, and the device includes:
Acquiring unit, for obtaining the source-information of web page news;
Extraction unit, believes for from preset web page source storehouse, extracting with the source of the web page news
Corresponding source-information is ceased, be stored with multiple source-informations in the preset web page source storehouse, and
With the source-information corresponding weighted value of difference;
Determining unit, for the corresponding weighted value of the source-information of the extraction to be defined as to the net
Page news has influential data.
By above-mentioned technical proposal, technical scheme provided in an embodiment of the present invention at least has following advantages:
A kind of data capture method and device are embodiments provided, web page news are obtained first
Source-information, then from preset web page source storehouse, extracts the source-information pair with the web page news
The source-information answered, be stored with multiple source-informations in the preset web page source storehouse, and with institute
State source-information and distinguish corresponding weighted value, finally by the corresponding weighted value of the source-information of the extraction
It is defined as on the web page news influential data of tool.With at present by the reprinting rate of web page news and
Response rate is compared as the data target of evaluating network page news influence, and the embodiment of the present invention is obtained first
The source-information of web page news, then from preset web page source storehouse, extracts and the web page news
The corresponding source-information of source-information, finally determines the corresponding weighted value of the source-information of the extraction
It is that on the web page news influential data of tool, the present invention is by by the data of Internet news influence power
Evaluation is converted into evaluation to source of news webpage such that it is able to some high forward rates and higher assessment opinion
Deceptive news are identified, and the real-time influence power to Internet news differentiates, and then improve
The accuracy rate that web page news influence power is obtained.
Description of the drawings
By the detailed description for reading hereafter preferred embodiment, various other advantage and benefit for
Those of ordinary skill in the art will be clear from understanding.Accompanying drawing is only used for illustrating the mesh of preferred embodiment
, and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol
Number represent identical part.In the accompanying drawings:
Fig. 1 is a kind of data capture method flow chart provided in an embodiment of the present invention;
Fig. 2 is another kind of data capture method flow chart provided in an embodiment of the present invention;
Fig. 3 is a kind of composition frame chart of data acquisition facility provided in an embodiment of the present invention;
Fig. 4 is the composition frame chart of another kind of data acquisition facility provided in an embodiment of the present invention.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing
The exemplary embodiment of the disclosure is shown, it being understood, however, that may be realized in various forms the disclosure
And should not be limited by embodiments set forth here.On the contrary, there is provided these embodiments are able to more
Thoroughly understand the disclosure, and can be by the complete technology for conveying to this area of the scope of the present disclosure
Personnel.
To make the advantage of technical solution of the present invention clearer, with reference to the accompanying drawings and examples to this
It is bright to elaborate.
A kind of data capture method is embodiments provided, as shown in figure 1, methods described includes:
S101, the source-information for obtaining web page news.
Wherein, the source-information of Internet news be used for represent which website Internet news be specifically under the jurisdiction of,
For example, there is the news with regard to " 18 the fifth plenary sessions will be held tomorrow " in the middle of network, obtain
It is central government website to the corresponding source-information of the web page news.
It should be noted that the source-information detailed process for obtaining web page news is as follows:Obtaining first should
Web page news, then crawl whole network data and judge to whether there is and the web page news in network by reptile
Identical webpage, needs to search what web page news were initially originated in from these identical webpages if existing
Website, then obtains the source-information of web page news from the website in initial source;It is straight if not existing
Connect and obtain source-information from the web page news.
S102, from preset web page source storehouse, extract it is corresponding with the source-information of the web page news
Source-information.
Wherein, be stored with multiple source-informations in the preset web page source storehouse, and next with described
Source information distinguishes corresponding weighted value, the corresponding weight of source-information in the preset web page source storehouse
Value can be divided according to actual website credit rating grade, it is also possible to grading according to the indicated weight of website etc.
Row divide, can also synthesis indicated weight grade and website credit grade divided, the embodiment of the present invention is not
It is specifically limited.The weighted value is used to represent the corresponding influence power of web page news, the bigger table of weighted value
The influence power of bright web page news is higher.
Indicated weight grade mentioned here may refer to government department's rank of official's announcement, or each
Rank between the main web site subnet station that individual website is announced etc..Any restriction is not done to this present invention.
For example, the indicated weight rank in preset web page library according to website is divided, the source-information of division
And the concrete ratio of corresponding with source-information weighted value can be with as follows:1 grade of central indicated weight website
50%;2 grades of local indicated weight websites 30%;2.1 grades of provincial indicated weight websites 15%;2.2 grades of city-level indicated weight nets
Stand 10%;2.3 grades of indicated weight websites 5% at county level;3 grades of news websites 20%;3.1 grades of provincial News Networks
10%;3.2 grades of city-level News Networks 6%;3.3 grades of News Networks at county level 4%.
S103, by the corresponding weighted value of the source-information of the extraction be defined as to the web page news have
Influential data.
For the embodiment of the present invention, the source-information of web page news is obtained first, then from preset webpage
In the storehouse of source, source-information corresponding with the source-information of the web page news is extracted, finally will be described
The corresponding weighted value of source-information of extraction is defined as on the web page news influential data of tool, this
Invention by the way that the data evaluation of Internet news influence power to be converted into the evaluation to source of news webpage, from
And the Deceptive news of some high forward rates and higher assessment opinion can be identified, and it is new to network in real time
The influence power of news is differentiated, and then improves the accuracy rate of web page news influence power acquisition.
A kind of data capture method is embodiments provided, the source letter of web page news is obtained first
Breath, then from preset web page source storehouse, extracts corresponding with the source-information of the web page news next
Source information, be stored with multiple source-informations in the preset web page source storehouse, and with the source
Information distinguishes corresponding weighted value, is finally defined as the corresponding weighted value of the source-information of the extraction
On the web page news influential data of tool.With the reprinting rate and response rate for passing through web page news at present
Compare as the data target of evaluating network page news influence, it is new that the embodiment of the present invention obtains first webpage
The source-information of news, then from preset web page source storehouse, extracts and believes with the source of the web page news
Corresponding source-information is ceased, finally the corresponding weighted value of the source-information of the extraction is defined as to institute
State web page news and have influential data, the present invention is by the way that the data evaluation of web page news influence power is turned
Turn to the evaluation to source of news webpage such that it is able to new to the falseness of some high forward rates and higher assessment opinion
News is identified, and the real-time influence power to Internet news differentiates, and then it is new to improve webpage
Hear the accuracy rate that influence power is obtained.
Another kind of data capture method is embodiments provided, as shown in Fig. 2 methods described bag
Include:
S201, the source-information for obtaining web page news.
Wherein, the source-information of Internet news be used for represent which website Internet news be specifically under the jurisdiction of,
For example, there is one in the middle of network with regard to " foundation Innovation Base in Shenyang Dadong District enters in the first batch 42 enterprises
The news of industry ", gets the corresponding source-information of the web page news for Liaoning Province indicated weight website.
For the embodiment of the present invention, step S201 includes:Obtain the web page news;Climbed by reptile
Whole network data is taken, is judged whether and the web page news identical webpage;If not existing, from
The source-information is obtained in the web page news.
In the embodiment of the present invention, it is described judge whether with after the web page news identical webpage,
Methods described also includes:If existing, the webpage that starting source is extracted from the identical webpage is new
Hear;The source-information is obtained from the web page news in the starting source.For the embodiment of the present invention,
Obtain the web page news first, then by reptile crawl whole network data judge in network whether there is with
The web page news identical webpage, needs to search web page news in from these identical webpages if existing
Initially the website in source, then obtains the source-information of web page news from the website in initial source;If
Do not exist, directly obtain source-information from the web page news.
It should be noted that the web page news in starting source are extracted from the identical webpage, can be with
Extracted by following parameter, webpage PR values, it is considered as original version that webpage PR values are higher
This possibility is bigger;The time that webpage is included for the first time, the searched engine of webpage include when
For waiting the webpage of more early, to find after comparing identical content, the possibility in original source is taken as just
It is bigger;Domain name registration time, the webpage above older domain name is treated as the possibility in original source
It is bigger;Technorati authority of website etc., the embodiment of the present invention is not specifically limited.
S202, judge whether that history weighted value can be extracted from the web page news.
In embodiments of the present invention, by judging whether that history can be extracted from the web page news
Weighted value, can improve the efficiency of the weighted value for obtaining web page news.If can be from the web page news
In extract history weighted value, then directly history weighted value is defined as into the weighted value of the web page news,
So as to again power corresponding with the info web need not be obtained by way of the preset web page source storehouse of lookup
Weight values, and then improve the data efficiency for obtaining web page news influence power.
If S203a, history weighted value can be extracted from the web page news, extract described
History weighted value is defined as on the web page news influential data of tool.
If S203b, history weighted value can not be extracted from the web page news, from preset webpage
In the storehouse of source, source-information corresponding with the source-information of the web page news is extracted.
Wherein, step S203b is the step arranged side by side of step S203a, in the preset web page source storehouse
Be stored with multiple source-informations, and with the source-information corresponding weighted value of difference, it is described pre-
The corresponding weighted value of source-information put in web page source storehouse can be grading according to actual website credit rating etc.
Row is divided, it is also possible to divided according to the indicated weight grade of website, can also synthesis indicated weight grade and net
Credit grade of standing is divided, and the embodiment of the present invention is not specifically limited.The weighted value is used to represent
The corresponding influence power of web page news, weighted value shows that more greatly the influence power of web page news is higher.
For the embodiment of the present invention, the source-information in the preset web page source storehouse is according to information source
Indicated weight rank configure corresponding weighted value.For example, according to the indicated weight rank of website in preset web page library
Divided, the source-information of division and the concrete ratio of weighted value corresponding with source-information can be as follows
It is shown:1 grade of central government website 50%;2 grades of local government websites 30%;2.1 grades of provincial government's nets
Stand 15%;2.2 grades of municipal government websites 10%;2.3 grades of county's government websites 5%;3 grades of news websites
20%;3.1 grades of provincial News Networks 10%;3.2 grades of city-level News Networks 6%;3.3 grades of News Networks at county level 4%.
S204b, by the corresponding weighted value of the source-information of the extraction be defined as to the web page news have
Influential data.
For the embodiment of the present invention, the source-information of web page news is obtained first, then judge whether energy
It is enough that history weighted value is extracted from the web page news, if can extract from the web page news
History weighted value, then be defined as having impact to the web page news by the history weighted value that extracts
Data;If history weighted value can not be extracted from the web page news, come from preset webpage
In the storehouse of source, source-information corresponding with the source-information of the web page news is extracted, by the extraction
The corresponding weighted value of source-information is defined as on the web page news influential data of tool.By by net
The data evaluation of page news influence is converted into the evaluation to source of news webpage such that it is able to some
The Deceptive news of high forward rate and higher assessment opinion are identified, and the real-time influence power to Internet news is entered
Row differentiates, and then improves the accuracy rate and efficiency of the acquisition of web page news influence power.
Another kind of data capture method is embodiments provided, the source of web page news is obtained first
Information, then from preset web page source storehouse, extracts corresponding with the source-information of the web page news
Source-information, be stored with multiple source-informations in the preset web page source storehouse, and next with described
Source information distinguishes corresponding weighted value, finally determines the corresponding weighted value of the source-information of the extraction
It is on the web page news influential data of tool.With reprinting rate at present by web page news and reply
Rate is compared as the data target of evaluating network page news influence, and the embodiment of the present invention obtains first webpage
The source-information of news, then from preset web page source storehouse, extracts the source with the web page news
The corresponding source-information of information, it is right to be finally defined as the corresponding weighted value of the source-information of the extraction
The web page news have an influential data, and the present invention is by by the data evaluation of web page news influence power
It is converted into the evaluation to source of news webpage such that it is able to some high forward rates and the falseness of higher assessment opinion
News is identified, and the real-time influence power to Internet news differentiates, and then improves webpage
The accuracy rate that news influence is obtained.
Further, the embodiment of the present invention provides a kind of data acquisition facility, as shown in figure 3, described
Device includes:Acquiring unit 31, extraction unit 32, determining unit 33.
Acquiring unit 31, for obtaining the source-information of web page news.
Extraction unit 32, for from preset web page source storehouse, extracting the source with the web page news
The corresponding source-information of information, be stored with multiple source-informations in the preset web page source storehouse,
And with the source-information corresponding weighted value of difference.
Determining unit 33, for the corresponding weighted value of the source-information of the extraction to be defined as to described
Web page news have influential data.
It should be noted that each function involved by a kind of data acquisition facility provided in an embodiment of the present invention
Other corresponding descriptions of unit, may be referred to the correspondence description of method shown in Fig. 1, will not be described here,
It should be understood that the device in the present embodiment can be corresponded in the whole that realize in preceding method embodiment
Hold.
A kind of data acquisition facility is embodiments provided, the source letter of web page news is obtained first
Breath, then from preset web page source storehouse, extracts corresponding with the source-information of the web page news next
Source information, be stored with multiple source-informations in the preset web page source storehouse, and with the source
Information distinguishes corresponding weighted value, is finally defined as the corresponding weighted value of the source-information of the extraction
On the web page news influential data of tool.With the reprinting rate and response rate for passing through web page news at present
Compare as the data target of evaluating network page news influence, it is new that the embodiment of the present invention obtains first webpage
The source-information of news, then from preset web page source storehouse, extracts and believes with the source of the web page news
Corresponding source-information is ceased, finally the corresponding weighted value of the source-information of the extraction is defined as to institute
State web page news and have influential data, the present invention is by the way that the data evaluation of web page news influence power is turned
Turn to the evaluation to source of news webpage such that it is able to new to the falseness of some high forward rates and higher assessment opinion
News is identified, and the real-time influence power to Internet news differentiates, and then it is new to improve webpage
Hear the accuracy rate that influence power is obtained.
Further, the embodiment of the present invention provides another kind of data acquisition facility, as shown in figure 4, institute
Stating device includes:Acquiring unit 41, extraction unit 42, determining unit 43.
Acquiring unit 41, for obtaining the source-information of web page news.
Extraction unit 42, for from preset web page source storehouse, extracting the source with the web page news
The corresponding source-information of information, be stored with multiple source-informations in the preset web page source storehouse,
And with the source-information corresponding weighted value of difference.
Determining unit 43, for the corresponding weighted value of the source-information of the extraction to be defined as to described
Web page news have influential data.
Further, the acquiring unit 41 includes:
Acquisition module 411, for obtaining the web page news.
Judge module 412, for crawling whole network data by reptile, judges whether and the webpage
News identical webpage.
The acquisition module 411, if being additionally operable to not exist and the web page news identical webpage, from
The source-information is obtained in the web page news.
Further, the acquiring unit 41, also including extraction module 412;
The extraction module 412, if for existing and the web page news identical webpage, from described
The web page news in starting source are extracted in identical webpage.
The acquisition module 411, specifically for obtaining described next in the web page news from the starting source
Source information.
Further, described device also includes:Judging unit 44;
The judging unit 44, for judging whether that history power can be extracted from the web page news
Weight values.
The determining unit 43, if specifically for history weight can be extracted from the web page news
Value, then be defined as the history weighted value of the extraction on the web page news influential data of tool.
The extraction unit 42, if specifically for history power can not be extracted from the web page news
Weight values, then from preset web page source storehouse, extract corresponding with the source-information of the web page news next
Source information.
For the embodiment of the present invention, the source-information in the preset web page source storehouse is according to information source
Indicated weight rank configure corresponding weighted value.
It should be noted that each work(involved by another kind of data acquisition facility provided in an embodiment of the present invention
Other corresponding descriptions of energy unit, may be referred to the correspondence description of method shown in Fig. 2, and here is no longer gone to live in the household of one's in-laws on getting married
State, it should be understood that the device in the present embodiment can correspond to realize it is complete in preceding method embodiment
Portion's content.
Another kind of data acquisition facility is embodiments provided, the source of web page news is obtained first
Information, then from preset web page source storehouse, extracts corresponding with the source-information of the web page news
Source-information, be stored with multiple source-informations in the preset web page source storehouse, and next with described
Source information distinguishes corresponding weighted value, finally determines the corresponding weighted value of the source-information of the extraction
It is on the web page news influential data of tool.With reprinting rate at present by web page news and reply
Rate is compared as the data target of evaluating network page news influence, and the embodiment of the present invention obtains first webpage
The source-information of news, then from preset web page source storehouse, extracts the source with the web page news
The corresponding source-information of information, it is right to be finally defined as the corresponding weighted value of the source-information of the extraction
The web page news have an influential data, and the present invention is by by the data evaluation of web page news influence power
It is converted into the evaluation to source of news webpage such that it is able to some high forward rates and the falseness of higher assessment opinion
News is identified, and the real-time influence power to Internet news differentiates, and then improves webpage
The accuracy rate that news influence is obtained.
The netpage registration device includes processor and memory, above-mentioned acquiring unit, extraction unit,
Determining unit and judging unit etc. are stored in memory as program unit, are deposited by computing device
Storage said procedure unit in memory is realizing corresponding function.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can
To arrange one or more, the data for improving web page news influence power by adjusting kernel parameter are accurate
Rate.
Memory potentially includes the volatile memory in computer-readable medium, random access memory
The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash
RAM), memory includes at least one storage chip.
Present invention also provides a kind of computer program, when performing on data processing equipment,
It is adapted for carrying out initializing the program code of there are as below methods step:Obtain the source-information of web page news;
From preset web page source storehouse, source-information corresponding with the source-information of the web page news is extracted,
Be stored with multiple source-informations in the preset web page source storehouse, and distinguishes with the source-information
Corresponding weighted value;The corresponding weighted value of the source-information of the extraction is defined as new to the webpage
Hear the influential data of tool.
Those skilled in the art it should be appreciated that embodiments herein can be provided as method, system,
Or computer program.Therefore, the application can be implemented using complete hardware embodiment, complete software
Example or with reference to the form of the embodiment in terms of software and hardware.And, the application can be adopted at one
Or it is multiple wherein include computer usable program code computer-usable storage medium (including but not
Be limited to magnetic disc store, CD-ROM, optical memory etc.) on the computer program implemented
Form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer journey
The flow chart and/or block diagram of sequence product is describing.It should be understood that can be realized by computer program instructions
Each flow process and/or square frame and flow chart and/or block diagram in flow chart and/or block diagram
In flow process and/or square frame combination.Can provide these computer program instructions to all-purpose computer,
The processor of special-purpose computer, Embedded Processor or other programmable data processing devices is producing one
Individual machine so that by the instruction of computer or the computing device of other programmable data processing devices
Produce for realizing in one square frame or multiple of one flow process of flow chart or multiple flow processs and/or block diagram
The device of the function of specifying in square frame.
These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process
In the computer-readable memory that equipment works in a specific way so that be stored in the computer-readable and deposit
Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one
The function of specifying in flow process or one square frame of multiple flow processs and/or block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing devices
On so that series of operation steps is performed on computer or other programmable devices to produce computer
The process of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing
Specify in one flow process of flow chart or one square frame of multiple flow processs and/or block diagram or multiple square frames
The step of function.
In a typical configuration, computing device include one or more processors (CPU), input/
Output interface, network interface and internal memory.
Memory potentially includes the volatile memory in computer-readable medium, random access memory
The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash
RAM).Memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with
Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot
Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase
Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory
(DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electricity can
Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light
Disk read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic
Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium,
Can be used to store the information that can be accessed by a computing device.Define according to herein, computer-readable
Medium does not include temporary computer readable media (transitory media), the such as data-signal and load of modulation
Ripple.
Embodiments herein is these are only, the application is not limited to.For this area skill
For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it
Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model
Within enclosing.
Claims (10)
1. a kind of data capture method, it is characterised in that include:
Obtain the source-information of web page news;
From preset web page source storehouse, source letter corresponding with the source-information of the web page news is extracted
Breath, be stored with multiple source-informations in the preset web page source storehouse, and with the source-information
The corresponding weighted value of difference;
The corresponding weighted value of the source-information of the extraction is defined as having the web page news to be affected
Data.
2. method according to claim 1, it is characterised in that the source of the acquisition web page news
Information includes:
Obtain the web page news;
Whole network data is crawled by reptile, is judged whether and the web page news identical webpage;
If not existing, the source-information is obtained from the web page news.
3. method according to claim 2, it is characterised in that it is described judge whether with it is described
After web page news identical webpage, methods described also includes:
If existing, the web page news in starting source are extracted from the identical webpage;
The source-information is obtained from the web page news in the starting source.
4. method according to claim 3, it is characterised in that described from preset web page source storehouse,
Before extracting source-information corresponding with the source-information of the web page news, methods described also includes:
Judge whether that history weighted value can be extracted from the web page news;
If history weighted value can be extracted from the web page news, the history of the extraction is weighed
Weight values are defined as on the web page news influential data of tool.
5. method according to claim 4, it is characterised in that described from preset web page source storehouse,
Extracting source-information corresponding with the source-information of the web page news includes:
If history weighted value can not be extracted from the web page news, from preset web page source storehouse
In, extract source-information corresponding with the source-information of the web page news.
6. according to the arbitrary methods described of claim 1-5, it is characterised in that the preset web page source
Source-information in storehouse configures corresponding weighted value according to the indicated weight rank of information source.
7. a kind of data acquisition facility, it is characterised in that include:
Acquiring unit, for obtaining the source-information of web page news;
Extraction unit, believes for from preset web page source storehouse, extracting with the source of the web page news
Corresponding source-information is ceased, be stored with multiple source-informations in the preset web page source storehouse, and
With the source-information corresponding weighted value of difference;
Determining unit, for the corresponding weighted value of the source-information of the extraction to be defined as to the net
Page news has influential data.
8. device according to claim 7, it is characterised in that the acquiring unit includes:
Acquisition module, for obtaining the web page news;
Judge module, for crawling whole network data by reptile, judges whether new with the webpage
Hear identical webpage;
The acquisition module, if being additionally operable to not exist and the web page news identical webpage, from institute
State and obtain in web page news the source-information.
9. device according to claim 8, it is characterised in that the acquiring unit, also including carrying
Delivery block;
The extraction module, if for existing and the web page news identical webpage, from the phase
The web page news in starting source are extracted in same webpage;
The acquisition module, specifically for obtaining the source in the web page news from the starting source
Information.
10. device according to claim 9, it is characterised in that described device also includes:Judge
Unit;
The judging unit, for judging whether that history weight can be extracted from the web page news
Value;
The determining unit, if specifically for history weighted value can be extracted from the web page news,
Then the history weighted value that extracts is defined as on the web page news influential data of tool.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510752393.6A CN106682007A (en) | 2015-11-06 | 2015-11-06 | Data acquisition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510752393.6A CN106682007A (en) | 2015-11-06 | 2015-11-06 | Data acquisition method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106682007A true CN106682007A (en) | 2017-05-17 |
Family
ID=58863906
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510752393.6A Pending CN106682007A (en) | 2015-11-06 | 2015-11-06 | Data acquisition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106682007A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101409634A (en) * | 2007-10-10 | 2009-04-15 | 中国科学院自动化研究所 | Quantitative analysis tools and method for internet news influence based on information retrieval |
CN101625693A (en) * | 2009-08-10 | 2010-01-13 | 北京精讯云顿数据软件有限公司 | Method and system of online article statistics |
JP2013015973A (en) * | 2011-07-01 | 2013-01-24 | Kddi Corp | Method and program for extracting small group from social network, and naming and visualizing the same |
CN104598477A (en) * | 2013-10-31 | 2015-05-06 | 北大方正集团有限公司 | News transmission effect determining method and system |
-
2015
- 2015-11-06 CN CN201510752393.6A patent/CN106682007A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101409634A (en) * | 2007-10-10 | 2009-04-15 | 中国科学院自动化研究所 | Quantitative analysis tools and method for internet news influence based on information retrieval |
CN101625693A (en) * | 2009-08-10 | 2010-01-13 | 北京精讯云顿数据软件有限公司 | Method and system of online article statistics |
JP2013015973A (en) * | 2011-07-01 | 2013-01-24 | Kddi Corp | Method and program for extracting small group from social network, and naming and visualizing the same |
CN104598477A (en) * | 2013-10-31 | 2015-05-06 | 北大方正集团有限公司 | News transmission effect determining method and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104679743B (en) | A kind of method and device of the preference pattern of determining user | |
CN106651416A (en) | Analyzing method and analyzing device of application popularization information | |
CN105306495B (en) | user identification method and device | |
CN104899228A (en) | Method and device for publishing webpage resources | |
CN108134760A (en) | Website monitoring data acquisition methods and device | |
CN106535204A (en) | Service coverage quality evaluation method and device | |
CN104348871A (en) | Similar account expanding method and device | |
CN109800364A (en) | Amount of access statistical method, device, equipment and storage medium based on block chain | |
CN102902790B (en) | Web page classification system and method | |
CN110689211A (en) | Method and device for evaluating website service capability | |
JP2015525956A5 (en) | ||
CN106919576A (en) | Using the method and device of two grades of classes keywords database search for application now | |
CN105528399A (en) | Multi-source terminal parameter data fusion method and apparatus | |
CN106909567A (en) | Data processing method and device | |
CN108121749A (en) | Website user's behavior analysis method and device | |
CN108243046A (en) | A kind of evaluation the quality method and device based on data auditing | |
CN103605670B (en) | A kind of method and apparatus for determining the crawl frequency of network resource point | |
CN107665208A (en) | User preference measure and device | |
CN105183806A (en) | Method and system for identifying same user among different platforms | |
CN107909496A (en) | User influence in social network analysis method, device and electronic equipment | |
CN108255891A (en) | A kind of method and device for differentiating type of webpage | |
CN105989019B (en) | A kind of method and device for cleaning data | |
CN104794135A (en) | Method and device for carrying out sorting on search results | |
CN106933849A (en) | The method and device that keyword is pushed | |
CN106682007A (en) | Data acquisition method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170517 |