CN108255891A - A kind of method and device for differentiating type of webpage - Google Patents
A kind of method and device for differentiating type of webpage Download PDFInfo
- Publication number
- CN108255891A CN108255891A CN201611270198.0A CN201611270198A CN108255891A CN 108255891 A CN108255891 A CN 108255891A CN 201611270198 A CN201611270198 A CN 201611270198A CN 108255891 A CN108255891 A CN 108255891A
- Authority
- CN
- China
- Prior art keywords
- webpage
- type
- standard
- ratio
- web pages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9562—Bookmark management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses it is a kind of differentiate type of webpage method, including:Obtain the page info of webpage to be judged;Heading message is extracted from page info;Judge that preset keyword is the keyword comprising type of webpage whether comprising preset keyword in heading message;If the type of webpage of webpage to be judged is not obtained based on the page structure information corresponding to page info and/or the heading message comprising preset keyword in heading message.The present invention can solve the problems, such as to rely on manual type in the prior art, and to carry out type of webpage classification effectiveness low.The invention also discloses a kind of devices for differentiating type of webpage.
Description
Technical field
The present invention relates to Webpage classification technology field more particularly to a kind of method and devices for differentiating type of webpage.
Background technology
With the rapid development of Internet technology, the webpage quantity that search engine is included is more and more, to type of webpage
Differentiate also more and more important.Type of webpage refers to the media properties of webpage, can be divided into news, forum, blog, mhkc,
Question and answer etc..Have much to the application scenario that type of webpage is classified, such as:1st, brand exposure analysis, by being exposed to brand
URL (Uniform Resource Locator, uniform resource locator) be collected statistics, analyze its categories of websites, can
To know that brand is more in which kind of medium type exposure, and then help the more targeted selection brand exposure media of brand master;
2nd, brand the analysis of public opinion by being counted to brand public sentiment, understands the positive negative report of the brand in different media types, into
And more effectively it can cope with and release news;3rd, web page crawl by the way that type of webpage is identified, can in advance determine not
Same page parsing logic, more reasonably extracts page info.Type of webpage classification is mainly also to rely at present artificial
Mode takes time and effort very much, this obviously can not be suitable for the present situation that webpage quantity sharply increases, therefore how improve type of webpage
Classification effectiveness be a urgent problem to be solved.
Invention content
In view of the above problems, the present invention provides a kind of method and device for differentiating type of webpage, to solve the prior art
Middle dependence manual type carries out the problem of type of webpage classification effectiveness is low.
The present invention provides it is a kind of differentiate type of webpage method, including:
Obtain the page info of webpage to be judged;
Heading message is extracted from the page info;
Judge that the preset keyword is the pass comprising type of webpage whether comprising preset keyword in the heading message
Key word;
If not comprising the preset keyword in the heading message, based on the page structure corresponding to the page info
Information and/or the heading message obtain the type of webpage of the webpage to be judged.
Preferably, the method further includes:
If the type of webpage corresponding to the preset keyword is made comprising the preset keyword in the heading message
Type of webpage for the webpage to be judged.
Preferably, the page info for obtaining webpage to be judged includes:
The webpage to be judged is parsed, extracts the domain name of the corresponding link of the webpage to be judged;
Uniform resource position mark URL corresponding to analog access domain name crawls the page letter of the webpage to be judged
Breath.
Preferably, the page structure information based on corresponding to the page info and/or the heading message obtain
The type of webpage of the webpage to be judged includes:
Obtain the page info of several webpages as with reference to standard under at least one known web pages type;
It is withdrawn as joining in the page structure information corresponding to page info from the webpage as with reference to standard
The label information of standard is examined, and counts the quantity under each known web pages type as the label information with reference to standard;
At least one label letter is extracted in the page structure information corresponding to page info from the webpage to be judged
Breath;
Each label information is matched, and count each with the label information as with reference to standard respectively
The quantity of the label information of successful match under the known web pages type;
It obtains under each known web pages type under the quantity and the known web pages type of the label information of successful match
The ratio of quantity as the label information with reference to standard, and the ratio and default ratio are compared;
If the ratio is more than or equal to the default ratio, using the known web pages type corresponding to the ratio as described in
The type of webpage of webpage to be judged.
Preferably, the page structure information based on corresponding to the page info and/or the heading message obtain
The type of webpage of the webpage to be judged includes:
Obtain the heading message of several webpages as with reference to standard under at least one known web pages type;
It is split out from the heading message as with reference to the webpage of standard as the phrase with reference to standard, and counts every
Quantity under a known web pages type as the phrase with reference to standard;
At least one phrase is split out from the heading message of the webpage to be judged;
Each phrase with the phrase as with reference to standard is matched respectively, and counts each described known
The quantity of the phrase of successful match under type of webpage;
Obtain the quantity of the phrase of successful match and conduct under the known web pages type under each known web pages type
The ratio of the quantity of the phrase of reference standard, and the ratio and default ratio are compared;
If the ratio is more than or equal to the default ratio, using the known web pages type corresponding to the ratio as described in
The type of webpage of webpage to be judged.
Preferably, the page structure information based on corresponding to the page info and/or the heading message obtain
The type of webpage of the webpage to be judged includes:
Obtain the page info of several webpages as with reference to standard under at least one known web pages type;
It is withdrawn as joining in the page structure information corresponding to page info from the webpage as with reference to standard
The label information of standard is examined, and counts the quantity under each known web pages type as the label information with reference to standard;
At least one label letter is extracted in the page structure information corresponding to page info from the webpage to be judged
Breath;
Each label information is matched, and count each with the label information as with reference to standard respectively
The quantity of the label information of successful match under the known web pages type;
It obtains under each known web pages type under the quantity and the known web pages type of the label information of successful match
First ratio of the quantity as the label information with reference to standard, and first ratio and the first default ratio are compared
Compared with;
If first ratio is more than or equal to the described first default ratio, the Hownet corresponding to first ratio is obtained
The heading message of several webpages as with reference to standard under page type;
It is split out from the heading message as with reference to the webpage of standard as the phrase with reference to standard, and counts every
Quantity under a known web pages type as the phrase with reference to standard;
At least one phrase is split out from the heading message of the webpage to be judged;
Each phrase with the phrase as with reference to standard is matched respectively, and counts each described known
The quantity of the phrase of successful match under type of webpage;
Obtain the quantity of the phrase of successful match and conduct under the known web pages type under each known web pages type
Second ratio of the quantity of the phrase of reference standard, and second ratio and the second default ratio are compared;
If second ratio is more than or equal to the described second default ratio, by the known web pages corresponding to second ratio
The type of webpage of type webpage to be judged as described in.
A kind of device for differentiating type of webpage, including:
Acquisition module, for obtaining the page info of webpage to be judged;
Extraction module, for extracting heading message from the page info;
Judgment module, for judging that the preset keyword is packet whether comprising preset keyword in the heading message
Keyword containing type of webpage;
Processing module, if in the heading message do not include the preset keyword when, based on the page info
Corresponding page structure information and/or the heading message obtain the type of webpage of the webpage to be judged.
Preferably, the processing module, if be additionally operable to include the preset keyword in the heading message, by described in
The type of webpage of type of webpage corresponding to the preset keyword webpage to be judged as described in.
Preferably, the acquisition module includes:
Resolution unit for being parsed to the webpage to be judged, extracts the corresponding link of the webpage to be judged
Domain name;
Analog access unit for the uniform resource position mark URL corresponding to analog access domain name, crawls described treat
Judge the page info of webpage.
Preferably, the processing module includes:
First acquisition unit, for obtaining several webpages as with reference to standard under at least one known web pages type
Page info;
First statistic unit, for believing from the page structure corresponding to the page info as with reference to the webpage of standard
The label information of reference standard is withdrawn as in breath, and is counted under each known web pages type as the mark with reference to standard
Sign the quantity of information;
First extraction unit, for being extracted from the page structure information corresponding to the page info of the webpage to be judged
Go out at least one label information;
First matching unit, for using each label information respectively with it is described as with reference to standard label information into
Row matching, and count the quantity of the label information of successful match under each known web pages type;
First comparing unit, for obtain under each known web pages type the quantity of the label information of successful match with
Ratio under the known web pages type as the quantity of the label information with reference to standard, and the ratio and default ratio are carried out
Compare;
First output unit, if for the ratio be more than or equal to the default ratio, by corresponding to the ratio
Know the type of webpage of type of webpage webpage to be judged as described in.
Preferably, the processing module includes:
Second acquisition unit, for obtaining several webpages as with reference to standard under at least one known web pages type
Heading message;
Second statistic unit, for being split out from the heading message as with reference to the webpage of standard as with reference to mark
Accurate phrase, and count the quantity under each known web pages type as the phrase with reference to standard;
First split cells, for splitting out at least one phrase from the heading message of the webpage to be judged;
Second matching unit, for each phrase to be matched respectively with the phrase as with reference to standard,
And count the quantity of the phrase of successful match under each known web pages type;
Second comparing unit, for obtaining under each known web pages type the quantity of the phrase of successful match with this
Know the ratio as the quantity of the phrase with reference to standard under type of webpage, and the ratio and default ratio are compared;
Second output unit, will be corresponding to the ratio if be more than or equal to the default ratio for the ratio
The type of webpage of known web pages type webpage to be judged as described in.
Preferably, the processing module includes:
Third acquiring unit, for obtaining several webpages as with reference to standard under at least one known web pages type
Page info;
Third statistic unit, for believing from the page structure corresponding to the page info as with reference to the webpage of standard
The label information of reference standard is withdrawn as in breath, and is counted under each known web pages type as the mark with reference to standard
Sign the quantity of information;
Second extraction unit, for being extracted from the page structure information corresponding to the page info of the webpage to be judged
Go out at least one label information;
Third matching unit, for using each label information respectively with it is described as with reference to standard label information into
Row matching, and count the quantity of the label information of successful match under each known web pages type;
Third comparing unit, for obtain under each known web pages type the quantity of the label information of successful match with
The first ratio under the known web pages type as the quantity of the label information with reference to standard, and by first ratio and first
Default ratio is compared;
4th acquiring unit if be more than or equal to the first default ratio for first ratio, obtains described the
The heading message of several webpages as with reference to standard under known web pages type corresponding to one ratio;
4th statistic unit, for being split out from the heading message as with reference to the webpage of standard as with reference to mark
Accurate phrase, and count the quantity under each known web pages type as the phrase with reference to standard;
Second split cells, for splitting out at least one phrase from the heading message of the webpage to be judged;
4th matching unit, for each phrase to be matched respectively with the phrase as with reference to standard,
And count the quantity of the phrase of successful match under each known web pages type;
4th comparing unit, for obtaining under each known web pages type the quantity of the phrase of successful match with this
Know the second ratio as the quantity of the phrase with reference to standard under type of webpage, and by second ratio and the second default ratio
It is compared;
Third output unit, if be more than or equal to the second default ratio for second ratio, by described second
The type of webpage of known web pages type corresponding to the ratio webpage to be judged as described in.
By above-mentioned technical proposal, a kind of method for differentiating type of webpage provided by the invention, when needing to type of webpage
When being judged, the page info of webpage to be judged is obtained first, and title letter is then extracted from the page info got
Then whether breath is further judged comprising the preset keyword that can directly judge type of webpage in heading message, when title is believed
When not including preset keyword in breath, obtain waiting to sentence by the page structure information corresponding to page info and/or heading message
The type of webpage of suspension page.The classification for carrying out type of webpage relative to manual type is relied in the prior art, the present invention can be certainly
The dynamic classification for realizing type of webpage improves the efficiency of type of webpage classification.
Description of the drawings
By reading the detailed description of hereafter preferred embodiment, it is various other the advantages of and benefit it is common for this field
Technical staff will become clear.Attached drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention
Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows a kind of method flow diagram of embodiment of the method 1 for differentiating type of webpage disclosed by the invention;
Fig. 2 shows a kind of method flow diagrams for the embodiment of the method 2 for differentiating type of webpage disclosed by the invention;
Fig. 3 shows a kind of method flow diagram of embodiment of the method 3 for differentiating type of webpage disclosed by the invention;
Fig. 4 shows a kind of method flow diagram of embodiment of the method 4 for differentiating type of webpage disclosed by the invention;
Fig. 5 shows a kind of structure diagram of device embodiment 1 for differentiating type of webpage disclosed by the invention;
Fig. 6 shows a kind of structure diagram of device embodiment 2 for differentiating type of webpage disclosed by the invention;
Fig. 7 shows a kind of structure diagram of device embodiment 3 for differentiating type of webpage disclosed by the invention;
Fig. 8 shows a kind of structure diagram of device embodiment 4 for differentiating type of webpage disclosed by the invention.
Specific embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Completely it is communicated to those skilled in the art.
As shown in Figure 1, for a kind of method flow diagram for the embodiment of the method 1 for differentiating type of webpage disclosed by the invention, it should
Method can comprise the steps of:
S101, the page info for obtaining webpage to be judged;
When needing to judge the type of webpage belonging to webpage, for example, judging whether webpage belongs to news category webpage
Or forum's class webpage etc..First, the page info of webpage to be judged is obtained, wherein, the page info of webpage to be judged includes
Heading message and page structure information.
Specifically, when obtaining wait the page info for judging webpage, a kind of realization method therein can be, by band
Judge that webpage is parsed, extract the domain name of the corresponding link of webpage to be judged, then the system corresponding to analog access domain name
One Resource Locator URL crawls the page info of webpage to be judged.Treat judge that webpage is parsed when, can pass through
It treats and judges that the original URL (Uniform Resource Locator, uniform resource locator) of webpage is parsed.Pass through solution
Analysis extracts domain name in web page interlinkage to be judged, wherein the domain name can be defined as in URL between beginning
“http:// " and occur thereafter first ":" between character string.For example, webpage to be judged is linked as http://
example.com:1234/test.htm judges the parsing of webpage by treating, and the domain name extracted can be
example.com.In the uniform resource position mark URL corresponding to analog access domain name, the page info of webpage to be judged is crawled
When, Python reptiles library can be used to carry out analog access or carry out analog access using other programming languages, pass through simulation
It accesses, crawls the information in page-out.
S102, heading message is extracted from page info;
After getting wait the page info for judging webpage, from HTLM (the HyperText Markup crawled
Language, HyperText Markup Language) heading message is extracted in the page.
S103, judge that the preset keyword is includes type of webpage whether comprising preset keyword in heading message
Keyword;
Then the heading message of extraction is judged, judged whether comprising the default pass that can directly determine type of webpage
Key word, for example, it may be judged whether including the preset keywords such as " ends of the earth ", " forum ", " news ", " blog ".
If not comprising preset keyword in S104, heading message, based on the page structure information corresponding to page info and/
Or the heading message obtains the type of webpage of the webpage to be judged.
It, i.e., cannot be straight by the heading message of extraction when not including preset keyword in the heading message for judging to extract
When connecing determining type of webpage, it is based further on the HTLM extracted (HyperText Markup Language, supertext mark
Note language) page structure information and/or heading message treat and judge that webpage is classified, to obtain the web page class of webpage to be judged
Type.That is, when not including preset keyword in heading message, can further be believed by the page structure corresponding to page info
Breath treat judge that webpage is classified and obtain type of webpage or treated by the heading message in page info judge webpage into
Row classification obtains type of webpage or treats by the page structure information in page info and heading message to judge that webpage carries out
Classification obtains type of webpage.
It should be noted that when including preset keyword in heading message, by the web page class corresponding to preset keyword
Type of webpage of the type as webpage to be judged.If for example, heading message for " as one piece of lipstick control, come try a color-amusement eight
Comprising preset keyword " forum " in hexagram-forum ", wherein heading message, therefore the type of webpage to be judged can be determined as
Forum.
In conclusion in the above-described embodiments, when needing to judge type of webpage, webpage to be judged is obtained first
Page info, heading message is then extracted from the page info got, then further judge be in heading message
It is no to include the preset keyword that directly judge type of webpage, when not including preset keyword in heading message, pass through page
Page structure information and/or heading message corresponding to the information of face obtain the type of webpage of webpage to be judged.Relative to existing skill
The classification that manual type carries out type of webpage is relied in art, the present invention can realize the classification of type of webpage, improve net automatically
The efficiency of page classification of type.
As shown in Fig. 2, for a kind of method flow diagram for the embodiment of the method 1 for differentiating type of webpage disclosed by the invention, it should
Method can comprise the steps of:
S201, the page info for obtaining webpage to be judged;
When needing to judge the type of webpage belonging to webpage, for example, judging whether webpage belongs to news category webpage
Or forum's class webpage etc..First, the page info of webpage to be judged is obtained, wherein, the page info of webpage to be judged includes
Heading message and page structure information.
Specifically, when obtaining wait the page info for judging webpage, a kind of realization method therein can be, by band
Judge that webpage is parsed, extract the domain name of the corresponding link of webpage to be judged, then the system corresponding to analog access domain name
One Resource Locator URL crawls the page info of webpage to be judged.Treat judge that webpage is parsed when, can pass through
It treats and judges that the original URL (Uniform Resource Locator, uniform resource locator) of webpage is parsed.Pass through solution
Analysis extracts domain name in web page interlinkage to be judged, wherein the domain name can be defined as in URL between beginning
“http:// " and occur thereafter first ":" between character string.For example, webpage to be judged is linked as http://
example.com:1234/test.htm judges the parsing of webpage by treating, and the domain name extracted can be
example.com.In the uniform resource position mark URL corresponding to analog access domain name, the page info of webpage to be judged is crawled
When, Python reptiles library can be used to carry out analog access or carry out analog access using other programming languages, pass through simulation
It accesses, crawls the information in page-out.
S202, heading message is extracted from page info;
After getting wait the page info for judging webpage, from HTLM (the HyperText Markup crawled
Language, HyperText Markup Language) heading message is extracted in the page.
S203, judge that the preset keyword is includes type of webpage whether comprising preset keyword in heading message
Keyword;
Then the heading message of extraction is judged, judged whether comprising the default pass that can directly determine type of webpage
Key word, for example, it may be judged whether including the preset keywords such as " ends of the earth ", " forum ", " news ", " blog ".
If several works under at least one known web pages type are not obtained comprising preset keyword in S204, heading message
The page info of webpage for reference standard;
When not including preset keyword in heading message, i.e., it cannot pass through the heading message of Webpage information to be judged
When directly judging the type of webpage, the webpage of at least one known web pages type is obtained first, while is got each known
The page info of type of webpage webpage, using the page info got as with reference to standard.
S205, it is withdrawn as joining from the page structure information corresponding to the page info as the webpage with reference to standard
The label information of standard is examined, and counts the quantity under each known web pages type as the label information with reference to standard;
After the page info for obtaining several webpages as the standard of referring under at least one known web pages type, from work
To be withdrawn as the label information of reference standard in the page structure information corresponding to the page info of the webpage of reference standard.
Because each page structure information includes multiple label informations.For example, by taking the webpage of one of known web pages type as an example, packet
The label information contained has:" meta ", " link ", " span ", " a ", " p " count the quantity as the label information with reference to standard,
" meta " has 12, and " link " has 3, and " span " has 5, and " a " has 3, and " p " has 3.
At least one label letter is extracted in page structure information corresponding to S206, the page info from webpage to be judged
Breath;
Meanwhile extracted in the page structure information corresponding to the page info from webpage to be judged, it is at least one to be used for
The label information being determined to page type.For example, extract " meta ", " div ".
S207, each label information is matched, and count each with as the label information with reference to standard respectively
Know the quantity of the label information of successful match under type of webpage;
Each label information of webpage to be judged is matched respectively with as the label information with reference to standard, with above-mentioned
For example, the label information " meta " of webpage to be judged is found by matching, it can be with the label information as reference standard
" meta " is matched.Then the quantity of the label information of successful match under each known web pages type is further counted, by system
Meter, has 10 " meta ".
S208, it obtains under each known web pages type under the quantity and the known web pages type of the label information of successful match
The ratio of quantity as the label information with reference to standard, and ratio and default ratio are compared;
Then it obtains under each known web pages type under the quantity of the label information of successful match and the known web pages type
The ratio of quantity as the label information with reference to standard, the quantity ratio of label information " meta " is by taking above-mentioned example as an example
5/6, then the ratio of acquisition and default ratio are compared, wherein, default ratio is flexibly is set according to actual demand
Fixed, the type of webpage that certain default ratio if setting is determined with the ratio got if is more accurate.
If S209, ratio are more than or equal to default ratio, using the known web pages type corresponding to ratio as webpage to be judged
Type of webpage.
When the ratio got is more than or equal to default ratio, using the known web pages type corresponding to ratio as waiting to judge
The type of webpage of webpage.By taking above-mentioned example as an example, label information " meta ", " link ", " span ", " a ", " p ", institute will be included
Type of webpage of the corresponding type of webpage as webpage to be judged.
As shown in figure 3, for a kind of method flow diagram for the embodiment of the method 3 for differentiating type of webpage disclosed by the invention, it should
Method can comprise the steps of:
S301, the page info for obtaining webpage to be judged;
When needing to judge the type of webpage belonging to webpage, for example, judging whether webpage belongs to news category webpage
Or forum's class webpage etc..First, the page info of webpage to be judged is obtained, wherein, the page info of webpage to be judged includes
Heading message and page structure information.
Specifically, when obtaining wait the page info for judging webpage, a kind of realization method therein can be, by band
Judge that webpage is parsed, extract the domain name of the corresponding connection of webpage to be judged, then the system corresponding to analog access domain name
One Resource Locator URL crawls the page info of webpage to be judged.Treat judge that webpage is parsed when, can pass through
It treats and judges that the original URL (Uniform Resource Locator, uniform resource locator) of webpage is parsed.Pass through solution
Analysis extracts domain name in web page interlinkage to be judged, wherein the domain name can be defined as in URL between beginning
“http:// " and occur thereafter first ":" between character string.For example, webpage to be judged is linked as http://
example.com:1234/test.htm judges the parsing of webpage by treating, and the domain name extracted can be
example.com.In the uniform resource position mark URL corresponding to analog access domain name, the page info of webpage to be judged is crawled
When, Python reptiles library can be used to carry out analog access or carry out analog access using other programming languages, pass through simulation
It accesses, crawls the information in page-out.
S302, heading message is extracted from page info;
After getting wait the page info for judging webpage, from HTLM (the HyperText Markup crawled
Language, HyperText Markup Language) heading message is extracted in the page.
S303, judge that the preset keyword is includes type of webpage whether comprising preset keyword in heading message
Keyword;
Then the heading message of extraction is judged, judged whether comprising the default pass that can directly determine type of webpage
Key word, for example, it may be judged whether including the preset keywords such as " ends of the earth ", " forum ", " news ", " blog ".
If several works under at least one known web pages type are not obtained comprising preset keyword in S304, heading message
The heading message of webpage for reference standard;
When not including preset keyword in heading message, i.e., it cannot pass through the heading message of Webpage information to be judged
When directly judging the type of webpage, the webpage of at least one known web pages type is obtained first, while is got each known
Heading message in the page info of type of webpage webpage, using the heading message got as with reference to standard.
S305, it splits out as the phrase with reference to standard, and counts from the heading message as the webpage with reference to standard
Quantity under each known web pages type as the phrase with reference to standard;
It is right after the heading message for getting several webpages as the standard of referring under at least one known web pages type
Heading message as the webpage with reference to standard carries out phrase fractionation, splits out the phrase of reference standard the most.For example, with wherein
For the webpage of one known web pages type, the heading message of the webpage is " as one piece of lipstick control, to select the board of lipstick
Son, the color for lipstick of having a try ", segments title, splits out the phrase " lipstick " as reference standard, " brand ", " face
Color " counts the quantity as each phrase with reference to standard, and " lipstick " has 3, " brand " 1, " 1, color ".
S306, at least one phrase is split out from the heading message of webpage to be judged;
Meanwhile at least one phrase is split out from the heading message of webpage to be judged, for example, the title of webpage to be judged
Information is " when selecting lipstick, there are many kinds of classes for lipstick ", and the phrase of fractionation includes " lipstick ", " type ".
S307, each phrase is matched, and count each known web pages class respectively with as the phrase with reference to standard
The quantity of the phrase of successful match under type;
Using wait judge webpage heading message split after each phrase respectively with as with reference to standard phrase carry out
Match, by taking above-mentioned example as an example, " lipstick " in web page title information to be judged found by matching, can with as with reference to standard
Phrase " lipstick " match.Then the quantity of the phrase of successful match under each known web pages type is further counted, is passed through
Statistics, has 2 " lipsticks ".
S308, it obtains under each known web pages type under the quantity and the known web pages type of the phrase of successful match
The ratio of quantity as the phrase with reference to standard, and the ratio and default ratio are compared;
Then the quantity of the phrase of successful match and conduct under the known web pages type under each known web pages type are obtained
The ratio of the quantity of the phrase of reference standard, the quantity ratio of phrase " lipstick " is 2/3 by taking above-mentioned example as an example, then will be obtained
The ratio taken is compared with default ratio, wherein, default ratio is flexibly set according to actual demand, if certainly
The default ratio of setting and the ratio got are more accurate closer to then determining type of webpage.
If S309, the ratio are more than or equal to the default ratio, the known web pages type corresponding to the ratio is made
Type of webpage for the webpage to be judged.
When the ratio got is more than or equal to default ratio, using the known web pages type corresponding to ratio as waiting to judge
The type of webpage of webpage.By taking above-mentioned example as an example, i.e., heading message is " as one piece of lipstick control, to select the brand of lipstick, examination
Try the color of lipstick ", type of webpage of the corresponding type of webpage as webpage to be judged.
As shown in figure 4, for a kind of method flow diagram for the embodiment of the method 4 for differentiating type of webpage disclosed by the invention, it should
Method can comprise the steps of:
S401, the page info for obtaining webpage to be judged;
When needing to judge the type of webpage belonging to webpage, for example, judging whether webpage belongs to news category webpage
Or forum's class webpage etc..First, the page info of webpage to be judged is obtained, wherein, the page info of webpage to be judged includes
Heading message and page structure information.
Specifically, when obtaining wait the page info for judging webpage, a kind of realization method therein can be, by treating
Judge that webpage is parsed, extract the domain name of the corresponding link of webpage to be judged, then the system corresponding to analog access domain name
One Resource Locator URL crawls the page info of webpage to be judged.Treat judge that webpage is parsed when, can pass through
It treats and judges that the original URL (Uniform Resource Locator, uniform resource locator) of webpage is parsed.Pass through solution
Analysis extracts domain name in web page interlinkage to be judged, wherein the domain name can be defined as in URL between beginning
“http:// " and occur thereafter first ":" between character string.For example, webpage to be judged is linked as http://
example.com:1234/test.htm judges the parsing of webpage by treating, and the domain name extracted can be
example.com.In the uniform resource position mark URL corresponding to analog access domain name, the page info of webpage to be judged is crawled
When, Python reptiles library can be used to carry out analog access or carry out analog access using other programming languages, pass through simulation
It accesses, crawls the information in page-out.
S402, heading message is extracted from page info;
After getting wait the page info for judging webpage, from HTLM (the HyperText Markup crawled
Language, HyperText Markup Language) heading message is extracted in the page.
S403, judge that the preset keyword is includes type of webpage whether comprising preset keyword in heading message
Keyword;
Then the heading message of extraction is judged, judged whether comprising the default pass that can directly determine type of webpage
Key word, for example, it may be judged whether including the preset keywords such as " ends of the earth ", " forum ", " news ", " blog ".
If several works under at least one known web pages type are not obtained comprising preset keyword in S404, heading message
The page info of webpage for reference standard;
When not including preset keyword in heading message, i.e., it cannot pass through the heading message of Webpage information to be judged
When directly judging the type of webpage, the webpage of at least one known web pages type is obtained first, while is got each known
The page info of type of webpage webpage, using the page info got as with reference to standard.
S405, it is withdrawn as joining from the page structure information corresponding to the page info as the webpage with reference to standard
The label information of standard is examined, and counts the quantity under each known web pages type as the label information with reference to standard;
After the page info for obtaining several webpages as the standard of referring under at least one known web pages type, from work
To be withdrawn as the label information of reference standard in the page structure information corresponding to the page info of the webpage of reference standard,
Each page structure includes multiple label informations.For example, by taking the webpage of one of known web pages type as an example, comprising label
Information has:" meta ", " link ", " span ", " a ", " p " count the quantity as the label information with reference to standard, and " meta " has
12, " link " has 3, and " span " has 5, and " a " has 3, and " p " has 3.
At least one label letter is extracted in page structure information corresponding to S406, the page info from webpage to be judged
Breath;
Meanwhile extracted in the page structure information corresponding to the page info from webpage to be judged, it is at least one to be used for
The label information being determined to page type.For example, extract " meta ", " div ".
S407, each label information is matched, and count each with as the label information with reference to standard respectively
Know the quantity of the label information of successful match under type of webpage;
Each label information of webpage to be judged is matched respectively with as the label information with reference to standard, with above-mentioned
For example, found by matching with the label information " meta " for judging webpage, it can be with the label information as reference standard
" meta " is matched.Then the quantity of the label information of successful match under each known web pages type is further counted, by system
Meter, has 10 " meta ".
S408, it obtains under each known web pages type under the quantity and the known web pages type of the label information of successful match
First ratio of the quantity as the label information with reference to standard, and first ratio and the first default ratio are compared
Compared with;
Then it obtains under each known web pages type under the quantity of the label information of successful match and the known web pages type
First ratio of the quantity as the label information with reference to standard, the quantity ratio of label information " meta " by taking above-mentioned example as an example
It is 5/6 to be worth, and is then compared the first ratio of acquisition and the first default ratio, wherein, the first default ratio is according to reality
Border demand is flexibly set, the net determined if if the first default ratio of setting and the ratio got certainly
Page type is more accurate.
If S409, the first ratio are more than or equal to the first default ratio, the known web pages type corresponding to the first ratio is obtained
Under it is several as with reference to standard webpage heading message;
When the first ratio is more than or equal to the first default ratio, the known web pages class corresponding to the first ratio is further obtained
The webpage of type, while get the heading message in the page info of each known web pages types of web pages, the title that will be got
Information is used as with reference to standard.It should be noted that may have with the webpage of the known web pages type corresponding to the first ratio multiple.
S410, it splits out as the phrase with reference to standard, and counts from the heading message as the webpage with reference to standard
Quantity under each known web pages type as the phrase with reference to standard;
It is right after the heading message for getting several webpages as the standard of referring under at least one known web pages type
Heading message as the webpage with reference to standard carries out phrase fractionation, splits out as the phrase with reference to standard.For example, with wherein
For the webpage of one known web pages type, the heading message of the webpage is " as one piece of lipstick control, to select the board of lipstick
Son, the color for lipstick of having a try ", segments title, splits out the phrase " lipstick " as reference standard, " brand ", " face
Color " counts the quantity as each phrase with reference to standard, and " lipstick " has 3, " brand " 1, " 1, color ".
S411, at least one phrase is split out from the heading message of webpage to be judged;
Meanwhile at least one phrase is split out from the heading message of webpage to be judged, for example, the title of webpage to be judged
Information is " when selecting lipstick, there are many kinds of classes for lipstick ", and the phrase of fractionation includes " lipstick ", " type ".
S412, each phrase is matched, and count each known web pages class respectively with as the phrase with reference to standard
The quantity of the phrase of successful match under type;
Using wait judge webpage heading message split after each phrase respectively with as with reference to standard phrase carry out
Match, by taking above-mentioned example as an example, " lipstick " in web page title information to be judged found by matching, can with as with reference to standard
Phrase " lipstick " match.Then the quantity of the phrase of successful match under each known web pages type is further counted, is passed through
Statistics, has 2 " lipsticks ".
S413, conduct under the quantity and the known web pages type of the phrase of successful match under each known web pages type is obtained
Second ratio of the quantity of the phrase of reference standard, and second ratio and the second default ratio are compared;
Then the quantity of the phrase of successful match and conduct under the known web pages type under each known web pages type are obtained
Second ratio of the quantity of the phrase of reference standard, the quantity ratio of phrase " lipstick " is 2/3 by taking above-mentioned example as an example, then
Second ratio of acquisition and the second default ratio are compared, wherein, the second default ratio is to carry out spirit according to actual demand
Setting living, it is got over certainly if the type of webpage that the second default ratio of setting is determined with the second ratio got if
Accurately.
If S414, the second ratio are more than or equal to the second default ratio, the known web pages type corresponding to the second ratio is made
Type of webpage for webpage to be judged.
When the second ratio got is more than or equal to the second default ratio, by the known web pages class corresponding to the second ratio
Type of webpage of the type as webpage to be judged.By taking above-mentioned example as an example, i.e., heading message is " as one piece of lipstick control, to select
The brand of lipstick, the color for lipstick of having a try ", type of webpage of the corresponding type of webpage as webpage to be judged.
It should be noted that in the above-described embodiments, it, at this time can be by first if the second ratio is less than the second default ratio
Type of webpage of the known web pages type as webpage to be judged corresponding to ratio.
As shown in figure 5, for a kind of structure diagram for the device embodiment 1 for differentiating type of webpage disclosed by the invention, it should
Device can include:
Acquisition module 501, for obtaining the page info of webpage to be judged;
When needing to judge the type of webpage belonging to webpage, for example, judging whether webpage belongs to news category webpage
Or forum's class webpage etc..First, the page info of webpage to be judged is obtained, wherein, the page info of webpage to be judged includes
Heading message and page structure information.
Specifically, when obtaining wait the page info for judging webpage, a kind of realization method therein can be, by band
Judge that webpage is parsed, extract the domain name of the corresponding connection of webpage to be judged, then the system corresponding to analog access domain name
One Resource Locator URL crawls the page info of webpage to be judged.Treat judge that webpage is parsed when, can pass through
It treats and judges that the original URL (Uniform Resource Locator, uniform resource locator) of webpage is parsed.Pass through solution
Analysis extracts domain name in web page interlinkage to be judged, wherein the domain name can be defined as in URL between beginning
“http:// " and occur thereafter first ":" between character string.For example, webpage to be judged is linked as http://
example.com:1234/test.htm judges the parsing of webpage by treating, and the domain name extracted can be
example.com.In the uniform resource position mark URL corresponding to analog access domain name, the page info of webpage to be judged is crawled
When, Python reptiles library can be used to carry out analog access or carry out analog access using other programming languages, pass through simulation
It accesses, crawls the information in page-out.
Extraction module 502, for extracting heading message from page info;
After getting wait the page info for judging webpage, from HTLM (the HyperText Markup crawled
Language, HyperText Markup Language) heading message is extracted in the page.
Judgment module 503, for whether judging in heading message comprising preset keyword, the preset keyword be comprising
The keyword of type of webpage;
Then the heading message of extraction is judged, judged whether comprising the default pass that can directly determine type of webpage
Key word, for example, it may be judged whether including the preset keywords such as " ends of the earth ", " forum ", " news ", " blog ".
Processing module 504, if for not including preset keyword in heading message, based on the page corresponding to page info
Structural information and/or the heading message obtain the type of webpage of the webpage to be judged.
It, i.e., cannot be straight by the heading message of extraction when not including preset keyword in the heading message for judging to extract
When connecing determining type of webpage, it is based further on the HTLM extracted (HyperText Markup Language, supertext mark
Note language) page structure information and/or heading message treat and judge that webpage is classified, to obtain the web page class of webpage to be judged
Type.That is, when not including preset keyword in heading message, can further be believed by the page structure corresponding to page info
Breath treat judge that webpage is classified and obtain type of webpage or treated by the heading message in page info judge webpage into
Row classification obtains type of webpage or treats by the page structure information in page info and heading message to judge that webpage carries out
Classification obtains type of webpage.
It should be noted that when including preset keyword in heading message, by the web page class corresponding to preset keyword
Type of webpage of the type as webpage to be judged.If for example, heading message for " as one piece of lipstick control, come try a color-amusement eight
Comprising preset keyword " forum " in hexagram-forum ", wherein heading message, therefore the type of webpage to be judged can be determined as
Forum.
The device for differentiating type of webpage includes processor and memory, above-mentioned acquisition module, extraction module, judgement
Module and processing module etc. in memory, are performed stored in memory above-mentioned as program unit storage by processor
Program unit realizes corresponding function.
Comprising kernel in processor, gone in memory to transfer corresponding program unit by kernel.Kernel can set one
Or more, solve the problems, such as that Web page classifying efficiency is low by adjusting kernel parameter.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory includes at least one deposit
Store up chip.
In conclusion in the above-described embodiments, when needing to judge type of webpage, webpage to be judged is obtained first
Page info, heading message is then extracted from the page info got, then further judge be in heading message
It is no to include the preset keyword that directly judge type of webpage, when not including preset keyword in heading message, pass through page
Page structure information and/or heading message corresponding to the information of face obtain the type of webpage of webpage to be judged.Relative to existing skill
The classification that manual type carries out type of webpage is relied in art, the present invention can realize the classification of type of webpage, improve net automatically
The efficiency of page classification of type.
As shown in fig. 6, for a kind of structure diagram for the device embodiment 2 for differentiating type of webpage disclosed by the invention, it should
Device can include:
Acquisition module 601, for obtaining the page info of webpage to be judged;
When needing to judge the type of webpage belonging to webpage, for example, judging whether webpage belongs to news category webpage
Or forum's class webpage etc..First, the page info of webpage to be judged is obtained, wherein, the page info of webpage to be judged includes
Heading message and page structure information.
Specifically, when obtaining wait the page info for judging webpage, a kind of realization method therein can be, by band
Judge that webpage is parsed, extract the domain name of the corresponding connection of webpage to be judged, then the system corresponding to analog access domain name
One Resource Locator URL crawls the page info of webpage to be judged.Treat judge that webpage is parsed when, can pass through
It treats and judges that the original URL (Uniform Resource Locator, uniform resource locator) of webpage is parsed.Pass through solution
Analysis extracts domain name in web page interlinkage to be judged, wherein the domain name can be defined as in URL between beginning
“http:// " and occur thereafter first ":" between character string.For example, webpage to be judged is linked as http://
example.com:1234/test.htm judges the parsing of webpage by treating, and the domain name extracted can be
example.com.In the uniform resource position mark URL corresponding to analog access domain name, the page info of webpage to be judged is crawled
When, Python reptiles library can be used to carry out analog access or carry out analog access using other programming languages, pass through simulation
It accesses, crawls the information in page-out.
Extraction module 602, for extracting heading message from page info;
After getting wait the page info for judging webpage, from HTLM (the HyperText Markup crawled
Language, HyperText Markup Language) heading message is extracted in the page.
Judgment module 603, for whether judging in heading message comprising preset keyword, the preset keyword be comprising
The keyword of type of webpage;
Then the heading message of extraction is judged, judged whether comprising the default pass that can directly determine type of webpage
Key word, for example, it may be judged whether including the preset keywords such as " ends of the earth ", " forum ", " news ", " blog ".
First acquisition unit 604, if for, not comprising preset keyword, obtaining at least one known web pages in heading message
The page info of several webpages as with reference to standard under type;
When not including preset keyword in heading message, i.e., it cannot pass through the heading message of Webpage information to be judged
When directly judging the type of webpage, the webpage of at least one known web pages type is obtained first, while is got each known
The page info of type of webpage webpage, using the page info got as with reference to standard.
First statistic unit 605, for believing from the page structure corresponding to the page info as the webpage with reference to standard
The label information of reference standard is withdrawn as in breath, and counts and believes under each known web pages type as the label with reference to standard
The quantity of breath;
After the page info for obtaining several webpages as the standard of referring under at least one known web pages type, from work
To be withdrawn as the label information of reference standard in the page structure information corresponding to the page info of the webpage of reference standard,
Each page structure includes multiple label informations.For example, by taking the webpage of one of known web pages type as an example, comprising label
Information has:" meta ", " link ", " span ", " a ", " p " count the quantity as the label information with reference to standard, and " meta " has
12, " link " has 3, and " span " has 5, and " a " has 3, and " p " has 3.
First extraction unit 606, for being extracted from the page structure information corresponding to the page info of webpage to be judged
Go out at least one label information;
Meanwhile extracted in the page structure information corresponding to the page info from webpage to be judged, it is at least one to be used for
The label information being determined to page type.For example, extract " meta ", " div ".
First matching unit 607, for using each label information respectively with as with reference to standard label information carry out
Match, and count the quantity of the label information of successful match under each known web pages type;
Each label information of webpage to be judged is matched respectively with as the label information with reference to standard, with above-mentioned
For example, found by matching with the label information " meta " for judging webpage, it can be with the label information as reference standard
" meta " is matched.Then the quantity of the label information of successful match under each known web pages type is further counted, by system
Meter, has 10 " meta ".
First comparing unit 608, for obtain under each known web pages type the quantity of the label information of successful match with
Ratio under the known web pages type as the quantity of the label information with reference to standard, and ratio and default ratio are compared
Compared with;
Then it obtains under each known web pages type under the quantity of the label information of successful match and the known web pages type
The ratio of quantity as the label information with reference to standard, the quantity ratio of label information " meta " is by taking above-mentioned example as an example
5/6, then the ratio of acquisition and default ratio are compared, wherein, default ratio is flexibly is set according to actual demand
Fixed, the type of webpage that certain default ratio if setting is determined with the ratio got if is more accurate.
First output unit 609, if being more than or equal to default ratio for ratio, by the known web pages type corresponding to ratio
Type of webpage as webpage to be judged.
When the ratio got is more than or equal to default ratio, using the known web pages type corresponding to ratio as waiting to judge
The type of webpage of webpage.By taking above-mentioned example as an example, label information " meta ", " link ", " span ", " a ", " p ", institute will be included
Type of webpage of the corresponding type of webpage as webpage to be judged.
As shown in fig. 7, for a kind of structure diagram for the device embodiment 3 for differentiating type of webpage disclosed by the invention, it should
Device can include:
Acquisition module 701, for obtaining the page info of webpage to be judged;
When needing to judge the type of webpage belonging to webpage, for example, judging whether webpage belongs to news category webpage
Or forum's class webpage etc..First, the page info of webpage to be judged is obtained, wherein, the page info of webpage to be judged includes
Heading message and page structure information.
Specifically, when obtaining wait the page info for judging webpage, a kind of realization method therein can be, by treating
Judge that webpage is parsed, extract the domain name of the corresponding link of webpage to be judged, then the system corresponding to analog access domain name
One Resource Locator URL crawls the page info of webpage to be judged.Treat judge that webpage is parsed when, can pass through
It treats and judges that the original URL (Uniform Resource Locator, uniform resource locator) of webpage is parsed.Pass through solution
Analysis extracts domain name in web page interlinkage to be judged, wherein the domain name can be defined as in URL between beginning
“http:// " and occur thereafter first ":" between character string.For example, webpage to be judged is linked as http://
example.com:1234/test.htm judges the parsing of webpage by treating, and the domain name extracted can be
example.com.In the uniform resource position mark URL corresponding to analog access domain name, the page info of webpage to be judged is crawled
When, Python reptiles library can be used to carry out analog access or carry out analog access using other programming languages, pass through simulation
It accesses, crawls the information in page-out.
Extraction module 702, for extracting heading message from page info;
After getting wait the page info for judging webpage, from HTLM (the HyperText Markup crawled
Language, HyperText Markup Language) heading message is extracted in the page.
Judgment module 703, for whether judging in heading message comprising preset keyword, the preset keyword be comprising
The keyword of type of webpage;
Then the heading message of extraction is judged, judged whether comprising the default pass that can directly determine type of webpage
Key word, for example, it may be judged whether including the preset keywords such as " ends of the earth ", " forum ", " news ", " blog ".
Second acquisition unit 704, if for, not comprising preset keyword, obtaining at least one known web pages in heading message
The heading message of several webpages as with reference to standard under type;
When not including preset keyword in heading message, i.e., it cannot pass through the heading message of Webpage information to be judged
When directly judging the type of webpage, the webpage of at least one known web pages type is obtained first, while is got each known
Heading message in the page info of type of webpage webpage, using the heading message got as with reference to standard.
Second statistic unit 705, for being split out from the heading message as the webpage with reference to standard as with reference to mark
Accurate phrase, and count the quantity under each known web pages type as the phrase with reference to standard;
It is right after the heading message for getting several webpages as the standard of referring under at least one known web pages type
Heading message as the webpage with reference to standard carries out phrase fractionation, splits out the phrase of reference standard the most.For example, with wherein
For the webpage of one known web pages type, the heading message of the webpage is " as one piece of lipstick control, to select the board of lipstick
Son, the color for lipstick of having a try ", segments title, splits out the phrase " lipstick " as reference standard, " brand ", " face
Color " counts the quantity as each phrase with reference to standard, and " lipstick " has 3, " brand " 1, " 1, color ".
First split cells 706, for splitting out at least one phrase from the heading message of webpage to be judged;
Meanwhile at least one phrase is split out from the heading message of webpage to be judged, for example, the title of webpage to be judged
Information is " when selecting lipstick, there are many kinds of classes for lipstick ", and the phrase of fractionation includes " lipstick ", " type ".
Second matching unit 707, for using each phrase respectively with being matched, and count as with reference to the phrase of standard
The quantity of the phrase of successful match under each known web pages type;
Using wait judge webpage heading message split after each phrase respectively with as with reference to standard phrase carry out
Match, by taking above-mentioned example as an example, " lipstick " in web page title information to be judged found by matching, can with as with reference to standard
Phrase " lipstick " match.Then the quantity of the phrase of successful match under each known web pages type is further counted, is passed through
Statistics, has 2 " lipsticks ".
Second comparing unit 708, for obtain under each known web pages type the quantity of the phrase of successful match with
Ratio under the known web pages type as the quantity of the phrase with reference to standard, and the ratio and default ratio are compared
Compared with;
Then the quantity of the phrase of successful match and conduct under the known web pages type under each known web pages type are obtained
The ratio of the quantity of the phrase of reference standard, the quantity ratio of phrase " lipstick " is 2/3 by taking above-mentioned example as an example, then will be obtained
The ratio taken is compared with default ratio, wherein, default ratio is flexibly set according to actual demand, if certainly
The default ratio of setting and the ratio got are more accurate closer to then determining type of webpage.
Second output unit 709, will be corresponding to the ratio if being more than or equal to the default ratio for the ratio
The type of webpage of known web pages type webpage to be judged as described in.
When the ratio got is more than or equal to default ratio, using the known web pages type corresponding to ratio as waiting to judge
The type of webpage of webpage.By taking above-mentioned example as an example, i.e., heading message is " as one piece of lipstick control, to select the brand of lipstick, examination
Try the color of lipstick ", type of webpage of the corresponding type of webpage as webpage to be judged.
As shown in figure 8, for a kind of structure diagram for the device embodiment 4 for differentiating type of webpage disclosed by the invention, it should
Device can include:
Acquiring unit 801, for obtaining the page info of webpage to be judged;
When needing to judge the type of webpage belonging to webpage, for example, judging whether webpage belongs to news category webpage
Or forum's class webpage etc..First, the page info of webpage to be judged is obtained, wherein, the page info of webpage to be judged includes
Heading message and page structure information.
Specifically, when obtaining wait the page info for judging webpage, a kind of realization method therein can be, by band
Judge that webpage is parsed, extract the domain name of the corresponding connection of webpage to be judged, then the system corresponding to analog access domain name
One Resource Locator URL crawls the page info of webpage to be judged.Treat judge that webpage is parsed when, can pass through
It treats and judges that the original URL (Uniform Resource Locator, uniform resource locator) of webpage is parsed.Pass through solution
Analysis extracts domain name in web page interlinkage to be judged, wherein the domain name can be defined as in URL between beginning
“http:// " and occur thereafter first ":" between character string.For example, webpage to be judged is linked as http://
example.com:1234/test.htm judges the parsing of webpage by treating, and the domain name extracted can be
example.com.In the uniform resource position mark URL corresponding to analog access domain name, the page info of webpage to be judged is crawled
When, Python reptiles library can be used to carry out analog access or carry out analog access using other programming languages, pass through simulation
It accesses, crawls the information in page-out.
Extraction module 802, for extracting heading message from page info;
After getting wait the page info for judging webpage, from HTLM (the HyperText Markup crawled
Language, HyperText Markup Language) heading message is extracted in the page.
Judgment module 803, for whether judging in heading message comprising preset keyword, the preset keyword be comprising
The keyword of type of webpage;
Then the heading message of extraction is judged, judged whether comprising the default pass that can directly determine type of webpage
Key word, for example, it may be judged whether including the preset keywords such as " ends of the earth ", " forum ", " news ", " blog ".
Third acquiring unit 804, if for, not comprising preset keyword, obtaining at least one known web pages in heading message
The page info of several webpages as with reference to standard under type;
When not including preset keyword in heading message, i.e., it cannot pass through the heading message of Webpage information to be judged
When directly judging the type of webpage, the webpage of at least one known web pages type is obtained first, while is got each known
The page info of type of webpage webpage, using the page info got as with reference to standard.
Third statistic unit 805, for believing from the page structure corresponding to the page info as the webpage with reference to standard
The label information of reference standard is withdrawn as in breath, and is counted under each known web pages type as the mark with reference to standard
Sign the quantity of information;
After the page info for obtaining several webpages as the standard of referring under at least one known web pages type, from work
To be withdrawn as the label information of reference standard in the page structure information corresponding to the page info of the webpage of reference standard,
Each page structure includes multiple label informations.For example, by taking the webpage of one of known web pages type as an example, comprising label
Information has:" meta ", " link ", " span ", " a ", " p " count the quantity as the label information with reference to standard, and " meta " has
12, " link " has 3, and " span " has 5, and " a " has 3, and " p " has 3.
Second extraction unit 806, for being extracted from the page structure information corresponding to the page info of webpage to be judged
Go out at least one label information;
Meanwhile extracted in the page structure information corresponding to the page info from webpage to be judged, it is at least one to be used for
The label information being determined to page type.For example, extract " meta ", " div ".
Third matching unit 807, for using each label information respectively with as with reference to standard label information carry out
Match, and count the quantity of the label information of successful match under each known web pages type;
Each label information of webpage to be judged is matched respectively with as the label information with reference to standard, with above-mentioned
For example, found by matching with the label information " meta " for judging webpage, it can be with the label information as reference standard
" meta " is matched.Then the quantity of the label information of successful match under each known web pages type is further counted, by system
Meter, has 10 " meta ".
Third comparing unit 808, for obtain under each known web pages type the quantity of the label information of successful match with
The first ratio under the known web pages type as the quantity of the label information with reference to standard, and by first ratio and first
Default ratio is compared;
Then it obtains under each known web pages type under the quantity of the label information of successful match and the known web pages type
First ratio of the quantity as the label information with reference to standard, the quantity ratio of label information " meta " by taking above-mentioned example as an example
It is 5/6 to be worth, and is then compared the first ratio of acquisition and the first default ratio, wherein, the first default ratio is according to reality
Border demand is flexibly set, the net determined if if the first default ratio of setting and the ratio got certainly
Page type is more accurate.
4th acquiring unit 809 if being more than or equal to the first default ratio for the first ratio, is obtained corresponding to the first ratio
Known web pages type under it is several as with reference to standard webpage heading message;
When the first ratio is more than or equal to the first default ratio, the known web pages class corresponding to the first ratio is further obtained
The webpage of type, while get the heading message in the page info of each known web pages types of web pages, the title that will be got
Information is used as with reference to standard.It should be noted that may have with the webpage of the known web pages type corresponding to the first ratio multiple.
4th statistic unit 810, for being split out from the heading message as the webpage with reference to standard as with reference to mark
Accurate phrase, and count the quantity under each known web pages type as the phrase with reference to standard;
It is right after the heading message for getting several webpages as the standard of referring under at least one known web pages type
Heading message as the webpage with reference to standard carries out phrase fractionation, splits out the phrase of reference standard the most.For example, with wherein
For the webpage of one known web pages type, the heading message of the webpage is " as one piece of lipstick control, to select the board of lipstick
Son, the color for lipstick of having a try ", segments title, splits out the phrase " lipstick " as reference standard, " brand ", " face
Color " counts the quantity as each phrase with reference to standard, and " lipstick " has 3, " brand " 1, " 1, color ".
Second split cells 811, for splitting out at least one phrase from the heading message of webpage to be judged;
Meanwhile at least one phrase is split out from the heading message of webpage to be judged, for example, the title of webpage to be judged
Information is " when selecting lipstick, there are many kinds of classes for lipstick ", and the phrase of fractionation includes " lipstick ", " type ".
4th matching unit 812, for using each phrase respectively with being matched, and count as with reference to the phrase of standard
The quantity of the phrase of successful match under each known web pages type;
Using wait judge webpage heading message split after each phrase respectively with as with reference to standard phrase carry out
Match, by taking above-mentioned example as an example, " lipstick " in web page title information to be judged found by matching, can with as with reference to standard
Phrase " lipstick " match.Then the quantity of the phrase of successful match under each known web pages type is further counted, is passed through
Statistics, has 2 " lipsticks ".
4th comparing unit 813, for obtaining under each known web pages type the quantity of the phrase of successful match with this
Know the second ratio as the quantity of the phrase with reference to standard under type of webpage, and by second ratio and the second default ratio
It is compared;
Then the quantity of the phrase of successful match and conduct under the known web pages type under each known web pages type are obtained
Second ratio of the quantity of the phrase of reference standard, the quantity ratio of phrase " lipstick " is 2/3 by taking above-mentioned example as an example, then
Second ratio of acquisition and the second default ratio are compared, wherein, the second default ratio is to carry out spirit according to actual demand
Setting living, it is got over certainly if the type of webpage that the second default ratio of setting is determined with the second ratio got if
Accurately.
Third output unit 814, will be corresponding to the second ratio if being more than or equal to the second default ratio for the second ratio
Type of webpage of the known web pages type as webpage to be judged.
When the second ratio got is more than or equal to the second default ratio, by the known web pages class corresponding to the second ratio
Type of webpage of the type as webpage to be judged.By taking above-mentioned example as an example, i.e., heading message is " as one piece of lipstick control, to select
The brand of lipstick, the color for lipstick of having a try ", type of webpage of the corresponding type of webpage as webpage to be judged.
It should be noted that in the above-described embodiments, it, at this time can be by first if the second ratio is less than the second default ratio
Type of webpage of the known web pages type as webpage to be judged corresponding to ratio.
Present invention also provides a kind of computer program products, first when being performed on data processing equipment, being adapted for carrying out
The program code of beginningization there are as below methods step:
Obtain the page info of webpage to be judged;
Heading message is extracted from the page info;
Judge that the preset keyword is the pass comprising type of webpage whether comprising preset keyword in the heading message
Key word;
If not comprising the preset keyword in the heading message, based on the page structure corresponding to the page info
Information and/or the heading message obtain the type of webpage of the webpage to be judged.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system or computer program
Product.Therefore, the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware can be used in the application
Apply the form of example.Moreover, the computer for wherein including computer usable program code in one or more can be used in the application
The computer program production that usable storage medium is implemented on (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)
The form of product.
The application is with reference to the flow according to the method for the embodiment of the present application, equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that it can be realized by computer program instructions every first-class in flowchart and/or the block diagram
The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided
The processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce
A raw machine so that the instruction performed by computer or the processor of other programmable data processing devices is generated for real
The device of function specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works so that the instruction generation being stored in the computer-readable memory includes referring to
Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or
The function of being specified in multiple boxes.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted
Series of operation steps are performed on calculation machine or other programmable devices to generate computer implemented processing, so as in computer or
The instruction offer performed on other programmable devices is used to implement in one flow of flow chart or multiple flows and/or block diagram one
The step of function of being specified in a box or multiple boxes.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie
The example of matter.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, CD-ROM read-only memory (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, the storage of tape magnetic rigid disk or other magnetic storage apparatus
Or any other non-transmission medium, available for storing the information that can be accessed by a computing device.It defines, calculates according to herein
Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It these are only embodiments herein, be not limited to the application.To those skilled in the art,
The application can have various modifications and variations.All any modifications made within spirit herein and principle, equivalent replacement,
Improve etc., it should be included within the scope of claims hereof.
Claims (12)
- A kind of 1. method for differentiating type of webpage, which is characterized in that including:Obtain the page info of webpage to be judged;Heading message is extracted from the page info;Judge that the preset keyword is the key comprising type of webpage whether comprising preset keyword in the heading message Word;If not comprising the preset keyword in the heading message, based on the page structure information corresponding to the page info And/or the heading message obtains the type of webpage of the webpage to be judged.
- 2. according to the method described in claim 1, it is characterized in that, the method further includes:If comprising the preset keyword in the heading message, using the type of webpage corresponding to the preset keyword as institute State the type of webpage of webpage to be judged.
- 3. according to the method described in claim 1, it is characterized in that, the page info for obtaining webpage to be judged includes:The webpage to be judged is parsed, extracts the domain name of the corresponding link of the webpage to be judged;Uniform resource position mark URL corresponding to analog access domain name crawls the page info of the webpage to be judged.
- 4. according to the method described in any one in claim 1-3, which is characterized in that described right based on the page info The type of webpage that the page structure information and/or the heading message answered obtain the webpage to be judged includes:Obtain the page info of several webpages as with reference to standard under at least one known web pages type;It is withdrawn as in the page structure information corresponding to page info from the webpage as with reference to standard with reference to mark Accurate label information, and count the quantity under each known web pages type as the label information with reference to standard;At least one label information is extracted in the page structure information corresponding to page info from the webpage to be judged;Each label information with the label information as with reference to standard is matched respectively, and counts each described The quantity of the label information of successful match under known web pages type;Obtain the quantity of the label information of successful match and conduct under the known web pages type under each known web pages type The ratio of the quantity of the label information of reference standard, and the ratio and default ratio are compared;If the ratio is more than or equal to the default ratio, the known web pages type corresponding to the ratio is waited to sentence as described in The type of webpage of suspension page.
- 5. according to the method described in any one in claim 1-3, which is characterized in that described right based on the page info The type of webpage that the page structure information and/or the heading message answered obtain the webpage to be judged includes:Obtain the heading message of several webpages as with reference to standard under at least one known web pages type;It is split out from the heading message as with reference to the webpage of standard as the phrase with reference to standard, and count each institute State the quantity as the phrase with reference to standard under known web pages type;At least one phrase is split out from the heading message of the webpage to be judged;Each phrase is matched, and count each known web pages with the phrase as with reference to standard respectively The quantity of the phrase of successful match under type;Under each known web pages type of acquisition under the quantity and the known web pages type of the phrase of successful match as reference The ratio of the quantity of the phrase of standard, and the ratio and default ratio are compared;If the ratio is more than or equal to the default ratio, the known web pages type corresponding to the ratio is waited to sentence as described in The type of webpage of suspension page.
- 6. according to the method described in any one in claim 1-3, which is characterized in that described right based on the page info The type of webpage that the page structure information and/or the heading message answered obtain the webpage to be judged includes:Obtain the page info of several webpages as with reference to standard under at least one known web pages type;It is withdrawn as in the page structure information corresponding to page info from the webpage as with reference to standard with reference to mark Accurate label information, and count the quantity under each known web pages type as the label information with reference to standard;At least one label information is extracted in the page structure information corresponding to page info from the webpage to be judged;Each label information with the label information as with reference to standard is matched respectively, and counts each described The quantity of the label information of successful match under known web pages type;Obtain the quantity of the label information of successful match and conduct under the known web pages type under each known web pages type First ratio of the quantity of the label information of reference standard, and first ratio and the first default ratio are compared;If first ratio is more than or equal to the described first default ratio, the known web pages class corresponding to first ratio is obtained The heading message of several webpages as with reference to standard under type;It is split out from the heading message as with reference to the webpage of standard as the phrase with reference to standard, and count each institute State the quantity as the phrase with reference to standard under known web pages type;At least one phrase is split out from the heading message of the webpage to be judged;Each phrase is matched, and count each known web pages with the phrase as with reference to standard respectively The quantity of the phrase of successful match under type;Under each known web pages type of acquisition under the quantity and the known web pages type of the phrase of successful match as reference Second ratio of the quantity of the phrase of standard, and second ratio and the second default ratio are compared;If second ratio is more than or equal to the described second default ratio, by the known web pages type corresponding to second ratio Type of webpage as the webpage to be judged.
- 7. a kind of device for differentiating type of webpage, which is characterized in that including:Acquisition module, for obtaining the page info of webpage to be judged;Extraction module, for extracting heading message from the page info;Judgment module, for judging that the preset keyword is includes net whether comprising preset keyword in the heading message The keyword of page type;Processing module, if in the heading message do not include the preset keyword when, based on the page info institute it is right The page structure information and/or the heading message answered obtain the type of webpage of the webpage to be judged.
- 8. device according to claim 7, which is characterized in that the processing module, if being additionally operable in the heading message During comprising the preset keyword, using the webpage of the webpage to be judged as described in of the type of webpage corresponding to the preset keyword Type.
- 9. device according to claim 8, which is characterized in that the acquisition module includes:Resolution unit for being parsed to the webpage to be judged, extracts the domain of the corresponding link of the webpage to be judged Name;Analog access unit for the uniform resource position mark URL corresponding to analog access domain name, crawls and described waits to judge The page info of webpage.
- 10. according to the device described in any one in claim 7-9, which is characterized in that the processing module includes:First acquisition unit, for obtaining the page of several webpages as with reference to standard under at least one known web pages type Information;First statistic unit, for from the page structure information corresponding to the page info of the webpage as with reference to standard The label information of reference standard is withdrawn as, and counts and believes under each known web pages type as the label with reference to standard The quantity of breath;First extraction unit, for extracted from the page structure information corresponding to the page info of the webpage to be judged to A few label information;First matching unit, for using each label information respectively with it is described as with reference to standard label information carry out Match, and count the quantity of the label information of successful match under each known web pages type;First comparing unit, for obtaining under each known web pages type the quantity of the label information of successful match with this Know the ratio as the quantity of the label information with reference to standard under type of webpage, and the ratio and default ratio are compared Compared with;First output unit, if being more than or equal to the default ratio for the ratio, by the Hownet corresponding to the ratio The type of webpage of page type webpage to be judged as described in.
- 11. according to the device described in any one in claim 7-9, which is characterized in that the processing module includes:Second acquisition unit, for obtaining the title of several webpages as with reference to standard under at least one known web pages type Information;Second statistic unit, for being split out from the heading message as with reference to the webpage of standard as with reference to standard Phrase, and count the quantity under each known web pages type as the phrase with reference to standard;First split cells, for splitting out at least one phrase from the heading message of the webpage to be judged;Second matching unit, for each phrase to be matched, and unite respectively with the phrase as with reference to standard The quantity of the phrase of successful match under each known web pages type of meter;Second comparing unit, for obtaining under each known web pages type the quantity of the phrase of successful match and the Hownet Ratio under page type as the quantity of the phrase with reference to standard, and the ratio and default ratio are compared;Second output unit, if be more than or equal to the default ratio for the ratio, known to corresponding to the ratio The type of webpage of type of webpage webpage to be judged as described in.
- 12. according to the device described in any one in claim 7-9, which is characterized in that the processing module includes:Third acquiring unit, for obtaining the page of several webpages as with reference to standard under at least one known web pages type Information;Third statistic unit, for from the page structure information corresponding to the page info of the webpage as with reference to standard The label information of reference standard is withdrawn as, and counts and believes under each known web pages type as the label with reference to standard The quantity of breath;Second extraction unit, for extracted from the page structure information corresponding to the page info of the webpage to be judged to A few label information;Third matching unit, for using each label information respectively with it is described as with reference to standard label information carry out Match, and count the quantity of the label information of successful match under each known web pages type;Third comparing unit, for obtaining under each known web pages type the quantity of the label information of successful match with this Know the first ratio as the quantity of the label information with reference to standard under type of webpage, and first ratio is preset with first Ratio is compared;4th acquiring unit if be more than or equal to the first default ratio for first ratio, obtains first ratio The heading message of several webpages as with reference to standard under the corresponding known web pages type of value;4th statistic unit, for being split out from the heading message as with reference to the webpage of standard as with reference to standard Phrase, and count the quantity under each known web pages type as the phrase with reference to standard;Second split cells, for splitting out at least one phrase from the heading message of the webpage to be judged;4th matching unit, for each phrase to be matched, and unite respectively with the phrase as with reference to standard The quantity of the phrase of successful match under each known web pages type of meter;4th comparing unit, for obtaining under each known web pages type the quantity of the phrase of successful match and the Hownet The second ratio under page type as the quantity of the phrase with reference to standard, and second ratio and the second default ratio are carried out Compare;Third output unit, if be more than or equal to the second default ratio for second ratio, by second ratio The type of webpage of corresponding known web pages type webpage to be judged as described in.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611270198.0A CN108255891B (en) | 2016-12-29 | 2016-12-29 | Method and device for judging webpage type |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611270198.0A CN108255891B (en) | 2016-12-29 | 2016-12-29 | Method and device for judging webpage type |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108255891A true CN108255891A (en) | 2018-07-06 |
CN108255891B CN108255891B (en) | 2020-08-28 |
Family
ID=62721846
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611270198.0A Active CN108255891B (en) | 2016-12-29 | 2016-12-29 | Method and device for judging webpage type |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108255891B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287409A (en) * | 2019-06-05 | 2019-09-27 | 新华三信息安全技术有限公司 | A kind of webpage type identification method and device |
CN113297525A (en) * | 2021-06-17 | 2021-08-24 | 恒安嘉新(北京)科技股份公司 | Webpage classification method and device, electronic equipment and storage medium |
WO2021253252A1 (en) * | 2020-06-17 | 2021-12-23 | 深圳市欢太数字科技有限公司 | Method and apparatus for testing webpage, and electronic device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
CN101814083A (en) * | 2010-01-08 | 2010-08-25 | 上海复歌信息科技有限公司 | Automatic webpage classification method and system |
WO2012083874A1 (en) * | 2010-12-22 | 2012-06-28 | 北大方正集团有限公司 | Webpage information detection method and system |
CN103309862A (en) * | 2012-03-07 | 2013-09-18 | 腾讯科技(深圳)有限公司 | Webpage type recognition method and system |
-
2016
- 2016-12-29 CN CN201611270198.0A patent/CN108255891B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101814083A (en) * | 2010-01-08 | 2010-08-25 | 上海复歌信息科技有限公司 | Automatic webpage classification method and system |
CN101727500A (en) * | 2010-01-15 | 2010-06-09 | 清华大学 | Text classification method of Chinese web page based on steam clustering |
WO2012083874A1 (en) * | 2010-12-22 | 2012-06-28 | 北大方正集团有限公司 | Webpage information detection method and system |
CN103309862A (en) * | 2012-03-07 | 2013-09-18 | 腾讯科技(深圳)有限公司 | Webpage type recognition method and system |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110287409A (en) * | 2019-06-05 | 2019-09-27 | 新华三信息安全技术有限公司 | A kind of webpage type identification method and device |
CN110287409B (en) * | 2019-06-05 | 2022-07-22 | 新华三信息安全技术有限公司 | Webpage type identification method and device |
WO2021253252A1 (en) * | 2020-06-17 | 2021-12-23 | 深圳市欢太数字科技有限公司 | Method and apparatus for testing webpage, and electronic device and storage medium |
CN113297525A (en) * | 2021-06-17 | 2021-08-24 | 恒安嘉新(北京)科技股份公司 | Webpage classification method and device, electronic equipment and storage medium |
CN113297525B (en) * | 2021-06-17 | 2023-12-12 | 恒安嘉新(北京)科技股份公司 | Webpage classification method, device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108255891B (en) | 2020-08-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20220197923A1 (en) | Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information | |
CN109145216A (en) | Network public-opinion monitoring method, device and storage medium | |
CN108255862B (en) | A kind of search method and device of judgement document | |
WO2014101783A1 (en) | Method and server for performing cloud detection for malicious information | |
CN113282955B (en) | Method, system, terminal and medium for extracting privacy information in privacy policy | |
Sarne et al. | Unsupervised topic extraction from privacy policies | |
CN102609412A (en) | RSS (Really Simple Syndication)-based multi-thread graphic information synchronization crawling control method and system | |
US20150100877A1 (en) | Method or system for automated extraction of hyper-local events from one or more web pages | |
CN108255891A (en) | A kind of method and device for differentiating type of webpage | |
Cardoso et al. | An efficient language-independent method to extract content from news webpages | |
Tavakoli et al. | Metadata analysis of open educational resources | |
CN112818200A (en) | Data crawling and event analyzing method and system based on static website | |
CN105183843B (en) | list page identification system and method | |
CN105786929B (en) | A kind of information monitoring method and device | |
Siddiqui et al. | Developing an Arabic plagiarism detection corpus | |
CN116108776A (en) | Method for improving completeness of chip verification test plan | |
CN112559754A (en) | Judgment result processing method and device | |
CN106462614B (en) | Information analysis system, information analysis method, and information analysis program | |
CN105868346A (en) | Picture extraction method and device applied to web page | |
Bosse et al. | Web Data Mining 1: Collecting textual data from web pages using R | |
Alqahtani | Automated Extraction of Security Concerns from Bug Reports | |
CN106649337A (en) | Method and device for identifying webpage column | |
CN107220362A (en) | A kind of web crawlers for network documentation extracts URL and the framework for indexing and being mapped with keyword | |
CN108062337A (en) | A kind of method and device to label to reptile seed | |
CN108536688A (en) | It was found that the whole network multi-language website and the method for obtaining parallel corpora |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |