CN110020042A - A kind of web-based image acquiring method and device - Google Patents

A kind of web-based image acquiring method and device Download PDF

Info

Publication number
CN110020042A
CN110020042A CN201710744731.0A CN201710744731A CN110020042A CN 110020042 A CN110020042 A CN 110020042A CN 201710744731 A CN201710744731 A CN 201710744731A CN 110020042 A CN110020042 A CN 110020042A
Authority
CN
China
Prior art keywords
image
logical line
body matter
text
ratio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710744731.0A
Other languages
Chinese (zh)
Other versions
CN110020042B (en
Inventor
管国辰
刘中军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN201710744731.0A priority Critical patent/CN110020042B/en
Publication of CN110020042A publication Critical patent/CN110020042A/en
Application granted granted Critical
Publication of CN110020042B publication Critical patent/CN110020042B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a kind of web-based image acquiring method and devices, determine the position of every image in grabbed webpage information, for each position, identify body matter associated with the position, and according to body matter, determine the second image subject of the image at the position, if the degree of association of the second image subject of the image at the position and the image subject (the first image subject) of user's image of being obtained meets preset rules, the image at the position is obtained;First aspect, it is not only to be associated webpage information and the first image subject in this programme, but be associated the second image subject with the first image subject of each image in webpage information, image associated with the first image subject is only obtained, the accuracy for obtaining image is improved;Second aspect determines the second image subject of image, improves the accuracy of determining image subject according to body matter associated with image position in webpage information.

Description

A kind of web-based image acquiring method and device
Technical field
The present invention relates to Internet technical field, in particular to a kind of web-based image acquiring method and device.
Background technique
Currently, existing scheme may be implemented according to user's needs, the automatic video image obtained in internet web page, Or the image resources such as single image.Existing image acquisitions scheme generally comprises: determining user's image of being obtained first Image subject grabs the relevant webpage information of the image subject then in internet web page, downloads complete in the webpage information Portion's image resource.
In above scheme, although having grabbed webpage information relevant to image subject, the figure in these webpage informations As resource is not necessarily all related with image subject, that is to say, that in the image resource of above scheme downloading, some is not to use Required for family, the accuracy for obtaining image is poor.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of web-based image acquiring method and device, improves and obtains figure The accuracy of picture.
In order to achieve the above objectives, the embodiment of the invention discloses a kind of web-based image acquiring methods, comprising:
Determine the first image subject of image to be obtained;
Webpage information is grabbed in a network;
Determine the position of every image in the webpage information;
For identified each position, body matter associated with the position is identified, and according to the body matter, Determine the second image subject of the image at the position;Calculate being associated with for second image subject and the first image theme Degree;If the degree of association meets preset rules, the image at the position is obtained.
Optionally, described to be directed to identified each position, it identifies body matter associated with the position, can wrap It includes:
For identified each position, the content of text at a distance from the position within a preset range is determined;
The body matter in the content of text is identified, as body matter associated with the position.
Optionally, the text for being directed to identified each position, determining at a distance from the position within a preset range Content may include:
For identified each position, using the position as starting point, preset quantity logical line is searched in front and back;
Body matter in the identification content of text, as body matter associated with the position, comprising:
For each logical line searched, judge whether the logical line is body matter;
If so, using the logical line as body matter associated with the position.
It is optionally, described to judge whether the logical line is body matter, comprising:
Determine the text byte number in the logical line comprising hypertext markup language label;
If the text byte number less than the first preset threshold, indicates the logic behavior body matter;
Alternatively,
Determine the label text ratio of the logical line, the label text ratio are as follows: include hypertext markup language label Text byte number and logical line total bytes ratio;
If the label text ratio less than the second preset threshold, indicates the logic behavior body matter;
Alternatively,
Determine the hyperlink ratio of the logical line, the hyperlink ratio are as follows: the ratio of hyperlink byte number and total bytes Rate;
If the hyperlink ratio is less than third predetermined threshold value, the logic behavior body matter is indicated;
Alternatively,
Determine the text byte number, the label text ratio and the hyperlink ratio of the logical line;
If the text byte number less than the first preset threshold, the label text ratio less than the second preset threshold, And the hyperlink ratio is less than third predetermined threshold value, indicates the logic behavior body matter.
Optionally, described according to the body matter, it determines the second image subject of the image at the position, can wrap It includes:
For each logical line in the body matter, according to the text byte number and/or label text of the logical line Ratio and/or hyperlink ratio and the logical line determine the weight of the logical line at a distance from the image;
According to the content and weight of each logical line in the body matter, the second figure of the image at the position is determined As theme.
Optionally, described according to the body matter, it determines the second image subject of the image at the position, can wrap It includes:
Obtain one of the following contents or a variety of: each in the web page title of grabbed webpage information, the webpage information Open the image header of image, the image tag attribute of each image;
Acquired content and the body matter are inputted into the vector model constructed in advance, obtained at the position Second image subject of image;It wherein, include the corresponding weight of each input item in the vector model.
Optionally, the degree of association for calculating second image subject and the first image theme, comprising:
Calculate the direct correlation degree of second image subject Yu the first image theme;
Obtain the indirect association degree of second image subject Yu the first image theme, the indirect association degree are as follows: Higher level's webpage information of the webpage information and the degree of association of the first image theme;
In conjunction with the direct correlation degree and the indirect association degree, second image subject and the first image are calculated The degree of association of theme.
Optionally, the position of every image in the determination webpage information may include:
Filter the noise information in the webpage information;
In webpage information after filtration, the position of every image is determined.
In order to achieve the above objectives, the embodiment of the invention also discloses a kind of web-based image acquiring devices, comprising:
First determining module, for determining the first image subject of image to be obtained;
Handling module, for grabbing webpage information in a network;
Second determining module, for determining the position of every image in the webpage information;
Identification module identifies body matter associated with the position for being directed to identified each position;
Third determining module, for determining the second image subject of the image at the position according to the body matter;
Computing module, for calculating the degree of association of second image subject and the first image theme;
Module is obtained, for obtaining the image at the position in the case where the degree of association meets preset rules.
Optionally, the identification module may include:
It determines submodule, for being directed to identified each position, determines at a distance from the position within a preset range Content of text;
Submodule is identified, for identification the body matter in the content of text, as text associated with the position Content.
Optionally, the determining submodule, specifically can be used for: being directed to identified each position, is with the position Preset quantity logical line is searched in point, front and back;
The identification submodule, specifically can be used for: for each logical line searched, judge the logical line whether be Body matter;If so, using the logical line as body matter associated with the position.
Optionally, the identification submodule, specifically can be used for:
For each logical line searched, the text byte in the logical line comprising hypertext markup language label is determined Number;If the text byte number less than the first preset threshold, indicates the logic behavior body matter, using the logical line as with The associated body matter in the position;
Alternatively,
For each logical line searched, the label text ratio of the logical line, the label text ratio are determined are as follows: The ratio of text byte number and logical line total bytes comprising hypertext markup language label;If the label text ratio Less than the second preset threshold, the logic behavior body matter is indicated, using the logical line as in text associated with the position Hold;
Alternatively,
For each logical line searched, the hyperlink ratio of the logical line, the hyperlink ratio are as follows: hyperlink are determined Connect the ratio of byte number and total bytes;If the hyperlink ratio is less than third predetermined threshold value, indicate that the logical line is positive Literary content, using the logical line as body matter associated with the position;
Alternatively,
For each logical line searched, the text byte number of the logical line, the label text ratio are determined And the hyperlink ratio;If the text byte number is less than the first preset threshold, the label text ratio is less than second Preset threshold, and the hyperlink ratio is less than third predetermined threshold value, the logic behavior body matter is indicated, by the logical line As body matter associated with the position.
Optionally, the third determining module, specifically can be used for:
For each logical line in the body matter, according to the text byte number and/or label text of the logical line Ratio and/or hyperlink ratio and the logical line determine the weight of the logical line at a distance from the image;
According to the content and weight of each logical line in the body matter, the second figure of the image at the position is determined As theme.
Optionally, the third determining module, specifically can be used for:
Obtain one of the following contents or a variety of: each in the web page title of grabbed webpage information, the webpage information Open the image header of image, the image tag attribute of each image;
Acquired content and the body matter are inputted into the vector model constructed in advance, obtained at the position Second image subject of image;It wherein, include the corresponding weight of each input item in the vector model.
Optionally, the computing module, specifically can be used for:
Calculate the direct correlation degree of second image subject Yu the first image theme;
Obtain the indirect association degree of second image subject Yu the first image theme, the indirect association degree are as follows: Higher level's webpage information of the webpage information and the degree of association of the first image theme;
In conjunction with the direct correlation degree and the indirect association degree, second image subject and the first image are calculated The degree of association of theme.
Optionally, second determining module, specifically can be used for: filter the noise information in the webpage information;? In filtered webpage information, the position of every image is determined.
In order to achieve the above objectives, the embodiment of the invention also discloses a kind of electronic equipment, including processor, communication interface, Memory and communication bus, wherein processor, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes that any of the above-described kind of web-based image obtains Take method.
In order to achieve the above objectives, the embodiment of the invention also discloses a kind of computer readable storage medium, the computers Computer program is stored in readable storage medium storing program for executing, the computer program realizes that any of the above-described kind is based on when being executed by processor The image acquiring method of webpage.
Using the embodiment of the present invention, the position of every image in grabbed webpage information is determined, for each position, It identifies body matter associated with the position, and according to body matter, determines the second image subject of the image at the position, If the second image subject of the image at the position and the image subject (the first image subject) of user's image of being obtained The degree of association meets preset rules, then obtains the image at the position;As it can be seen that in the present solution, in a first aspect, not being only to believe webpage Breath is associated with first image subject, but by the second image subject of every image in webpage information and first image Theme is associated, and only obtains image associated with the first image subject, improves the accuracy for obtaining image;Second party Face determines the second image subject of image, improves according to body matter associated with image position in webpage information Determine the accuracy of image subject.
Certainly, implement any of the products of the present invention or method it is not absolutely required at the same reach all the above excellent Point.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 is a kind of flow diagram of web-based image acquiring method provided in an embodiment of the present invention;
Fig. 2 is a kind of embodiment schematic diagram provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of web-based image acquiring device provided in an embodiment of the present invention;
Fig. 4 is the structural schematic diagram of a kind of electronic equipment provided in an embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
In order to solve the above-mentioned technical problem, the embodiment of the invention provides a kind of web-based image acquiring method and dresses It sets.This method can be applied to the various electronic equipments such as mobile phone, computer, specifically without limitation.
A kind of web-based image acquiring method provided in an embodiment of the present invention is described in detail first below.
Fig. 1 is a kind of flow diagram of web-based image acquiring method provided in an embodiment of the present invention, comprising:
S101: the first image subject of image to be obtained is determined.
It is described to distinguish, in the present embodiment, the image subject of image to be obtained is known as the first image subject, it will be subsequent The image subject of image is known as the second image subject in grabbed webpage information in content.
As an implementation, the first image master directly can be parsed from the image acquisition instruction that user sends Topic allows user to input the relevant information of image to be obtained in the search interface, the phase for example, providing search interface Closing information may act as image acquisition instruction, to parsing after the relevant information, determine the first figure of image to be obtained As theme.
Alternatively, as another embodiment, can also be accustomed to based on user, be parsed in the interested content of user First image subject can then be based on the technical field, determine for example, user often retrieves the content of a certain technical field First image subject.There are many modes for determining image subject, numerous to list herein.
In one embodiment, image subject can be by one or more groups of crucial phrases at including in every group of keyword The keyword of one or more semantic similarities.Alternatively, image subject can also only include one in some better simply situations Keyword.
S102: webpage information is grabbed in a network.
In the present embodiment, web crawlers or other modes be can use, grab webpage information in a network.This field The webpage information that technical staff is appreciated that the webpage information grabbed and user watch usually is not quite alike, grabs Webpage information can be HTML (HyperText Markup Language, hypertext markup language) format, or may be Extended formatting.The following contents is illustrated by taking html format as an example, this is not constituted a limitation of the invention simultaneously.
S103: the position of every image in the webpage information is determined.
It will be understood by those skilled in the art that usually carrying<image>label in the webpage information of html format, also It is image tag, to indicate the image URL as image URL at this, in the webpage information of html format i.e. expression image.Therefore, <image>label can be identified in the webpage information of html format, be based on recognition result, determine each figure in the webpage information As the position of URL, that is, determine the position of every image.
As an implementation, S103 may include: the noise information in the filtering webpage information;After filtration In webpage information, the position of every image is determined.
It will be understood by those skilled in the art that being had in the webpage information of html format<body>the content of label is logical Often therefore it can first extract and have for web page body part<body>the content of label.It is generally comprised in this partial content Noise information, such as inline style text, javascript code, HTML annotation etc., these information are unrelated with picture material, It may be considered noise information.In the present embodiment, these noise informations are first filtered out, then determine picture position, are improved Treatment effeciency.
S104: being directed to identified each position, identifies body matter associated with the position.
As an implementation, can be directed to identified each position, by a distance from the position in preset range Interior content of text is as body matter associated with the position.
As another embodiment, it can be directed to identified each position, determined at a distance from the position default Content of text in range;The body matter in the content of text is identified, as body matter associated with the position.
Specifically, identified each position can be directed to, using the position as starting point, preset quantity logic is searched in front and back Row;For each logical line searched, judge whether the logical line is body matter;If so, using the logical line as with The associated body matter in the position.
Decision logic row whether be body matter scheme there are many, such as:
1, the text byte number in logical line comprising hypertext markup language label (html tag) is determined;If the text This byte number indicates the logic behavior body matter less than the first preset threshold.
2, the label text ratio of logical line, the label text ratio are determined are as follows: include hypertext markup language label Text byte number and logical line total bytes ratio;If the label text ratio is indicated less than the second preset threshold The logic behavior body matter.
3, the hyperlink ratio of logical line, the hyperlink ratio are determined are as follows: the ratio of hyperlink byte number and total bytes Rate;If the hyperlink ratio is less than third predetermined threshold value, the logic behavior body matter is indicated.The total bytes can be The total bytes of webpage information, or the total bytes evidence of logical line, specifically without limitation.
Alternatively, above-mentioned three kinds of schemes can also be subjected to any combination, such as:
4, text byte number, label text ratio and the hyperlink ratio of the logical line are determined;If the text word Joint number is less than the first preset threshold, and the label text ratio is less than the second preset threshold, and the hyperlink ratio is less than Third predetermined threshold value indicates the logic behavior body matter.
5, the text byte number and label text ratio of the logical line are determined;If the text byte number is less than first Preset threshold, and the label text ratio indicates the logic behavior body matter less than the second preset threshold.
6, the label text ratio and hyperlink ratio of the logical line are determined;If the label text ratio is less than Two preset thresholds, and the hyperlink ratio is less than third predetermined threshold value, indicates the logic behavior body matter.
7, the text byte number and hyperlink ratio of the logical line are determined;If the text byte number is pre- less than first If threshold value, and the hyperlink ratio is less than third predetermined threshold value, indicates the logic behavior body matter.
It is appreciated that the text byte number comprising html tag is less, label text ratio is smaller, hyperlink in logical line It connects that ratio is smaller, indicates that the probability of the logic behavior body matter is bigger.It is if logic behavior body matter, the logical line is true It is set to body matter associated with the position, that is, body matter associated with the image at the position.
S105: according to the body matter, the second image subject of the image at the position is determined.
As an implementation, S105 may include: and be patrolled for each logical line in the body matter according to this The text byte number and/or label text ratio and/or hyperlink ratio of volume row and the logical line and the image away from From determining the weight of the logical line;According to the content and weight of each logical line in the body matter, determine at the position Image the second image subject.
For example, it for the position where image URL each in webpage information, is got and the position using above scheme The form for setting associated body matter can be with are as follows:
ImgText={ (text (1), s1),(text(2),s2)…(text(i),si)…(text(n),sn),
Wherein, text (i) indicates i-th of logical line, siIndicate i-th of logical line to the position distance.
It is appreciated that the degree of correlation with picture material may be higher, in addition, patrolling with the closer logical line of image distance The probability for collecting behavior body matter is bigger, and the degree of correlation with picture material may be higher.Therefore, it can be examined in conjunction with these two aspects Consider, determines the weight of logical line.In conjunction with the content of identified weight and logical line, the second image subject is determined.
As an example it is assumed that in the associated body matter in the position image A include four logical lines, text (1), Text (2), text (3) and text (4) determine this four logics by carrying out participle integration processing to this four logical lines The content topic that capable content topic is respectively as follows: text (1) is " education ", and the content topic of text (2) is " tourism ", text (3) content topic is " finance ", and the content topic of text (4) is " amusement ".In addition, it is assumed that the weight of this four logical lines point Not are as follows: the weight that the weight of text (1) is 30%, text (2) is that the weight of 40%, text (3) is the power of 20%, text (4) Weight is 10%.
Second image subject can carry the set of probabilistic information, such as { 30% " education ", 40% " trip for one Trip ", 20% " finance ", 10% " amusement " }.Vector operation is carried out alternatively, can continue to gather this, determines one or more A theme, as the second image subject.Alternatively, the maximum content topic of weight directly can also be determined as the second image master Topic, etc., specific method of determination is without limitation.
As another embodiment, it is also conceivable to other content when determining the second image subject, for example, grabbed net The image tag attribute of the image header of each image in the web page title of page information, the webpage information, each image Deng.The image tag attribute may include src attribute, alt attribute in<image>label etc., and src attribute list shows Image Name Claim, alt attribute list shows picture specification.
Specifically, one of available the following contents or a variety of: the web page title of grabbed webpage information, the net The image tag attribute of the image header of each image in page information, each image.It will be understood by those skilled in the art that These contents can be got by modes such as text participle, keyword selections.
Acquired content and the body matter are inputted into the vector model constructed in advance, obtained at the position Second image subject of image;It wherein, include the corresponding weight of each input item in the vector model.
It is each input item setting weight in vector model, each input item may include: the body matter is grabbed The image tag category of the image header of each image in the web page title of webpage information, the webpage information, each image One of property is a variety of.The second final figure can be determined in conjunction with its corresponding weight using each input item as vector As theme.That is, the output result of the vector model is the second final image subject.
S106: the degree of association of second image subject and first image subject is calculated.
Specifically, can use association algorithm or similarity algorithm, the first figure in the second image subject and S101 is calculated As the direct correlation degree of theme.
As an implementation, which can be determined as to second image subject and the first image master The degree of association of topic.
Alternatively, as another embodiment, available second image subject and the first image theme Indirect association degree, the indirect association degree are as follows: higher level's webpage information of grabbed webpage information and the first image master in S102 The degree of association of topic;In conjunction with the direct correlation degree and the indirect association degree, second image subject and described first is calculated The degree of association of image subject.
It will be understood by those skilled in the art that may exist structural relation, in other words layer between the webpage information grabbed Grade relationship.For example, there are some links in webpage information X, these link junior's webpage letter that corresponding webpage information is X Breath.Under normal conditions, the content of each level webpage information is also associated, higher level's webpage of the same level webpage information can be believed The degree of association of breath and the first image subject is as indirect association degree.
Specifically, for every grade of webpage information, in the second image subject for determining self-contained each image After the degree of association of the first image subject, the degree of association determined can be propagated to by certain decay factor coupled In each junior's webpage information.Likewise, the same level webpage information, which can also receive higher level's webpage information, is propagated through the degree of association come, Namely above-mentioned indirect association degree.
Above-mentioned decay factor a, it can be understood as weighted value.For example, being calculated for the same level webpage information Direct correlation degree be P, the indirect association degree of acquisition is Q, and decay factor can be 30%, then final determining degree of association S= 70%P+30Q;Then the S is broadcast to junior's webpage information again, junior's webpage information is using the S as indirect association degree.
Direct correlation degree P in above-mentioned example can be the second image subject and the of all images in the same level webpage information The average value of the degree of association of one image subject;Alternatively, an image can also be randomly choosed in the same level webpage information, by this Second image subject of image and the degree of association of the first image subject are as P;Alternatively, can also be random in the same level webpage information Multiple images are selected, using the average value of the second image subject of multiple images and the degree of association of the first image subject as P, etc. Deng specifically without limitation.
In addition, if the same level webpage information has corresponded to multiple higher level's webpage informations, it can be by this multiple higher level's webpage information The average value of the degree of association of propagation is as indirect association degree, alternatively, also can be randomly selected one is used as indirect association degree, etc. Deng specifically without limitation.
It should be noted that above-mentioned " higher level's webpage information " can only include upper level webpage information, that is to say, that above-mentioned The degree of association is only propagated between adjacent level.It will be understood by those skilled in the art that the degree of association can by certain decay factor into Row is propagated, even if only being propagated between adjacent level, the degree of association information for the also not just adjacent level which carries, Carry the degree of association information of other levels.As in above-mentioned example, next stage webpage information is using S as indirect association degree, in the S Also the degree of association Q of the upper two-stage webpage information comprising the next stage webpage information.
Alternatively, above-mentioned " higher level's webpage information " may include the webpage information of preset quantity level, specifically without limitation.
S107: if the degree of association meets preset rules, the image at the position is obtained.
The preset rules can be threshold value or other conditions, can specifically be set according to the actual situation.If the The degree of association of two image subjects and the first image subject meets preset rules, then downloads the image.It further, can be by the figure As being supplied to user.
Alternatively, as an implementation, after S107, the first image subject and S107 in storage S101 can be corresponded to Acquired image.In this way, realizing the classification marker to image.There are many application scenarios of this programme, will not enumerate.
Using embodiment illustrated in fig. 1 of the present invention, in a first aspect, not being only by webpage information and the first image master in this programme Topic is associated, but the second image subject with the first image subject of every image in webpage information are associated, and is only obtained Image associated with the first image subject is taken, the accuracy for obtaining image is improved;Second aspect, compared to existing scheme In, all images resource in webpage information is downloaded, this programme only downloads the part figure for meeting preset rules in webpage information As resource, reduce resource waste;The third aspect, in this programme according in webpage information it is associated with image position just Literary content determines the second image subject of image, improves the accuracy of determining image subject.
A kind of specific embodiment is provided below, as shown in Figure 2:
1, webpage information is grabbed in a network;
2, grabbed webpage information is parsed: is extracted<body>label substance, filtering noise information, determines webpage Title,<image>label (image tag) etc.;Image URL, image header, image tag category are determined according to<image>label Property (such as src attribute, alt attribute);Carry out image context identification: using image URL as starting point, preset quantity is searched in front and back Logical line, and whether decision logic row is body matter, if so, the logical line is determined as associated with image URL Body matter;
3, according to the parsing result in step 2, the second image subject of every image in grabbed webpage information is determined;
4, the second image subject of every image that step 3 is determined and predetermined first image subject are closed Connection degree calculates, which is the image subject of user's image to be obtained, which can be directly linked to combine What degree and indirect association degree obtained, or directly count counted (only comprising direct correlation degree);
5, whether the degree of association that judgment step 4 obtains meets preset rules, if not, abandoning the corresponding figure of the degree of association Picture, if so, downloading the image.
6, the first image subject of corresponding storage and the image.
Corresponding with above method embodiment, the embodiment of the present invention also provides a kind of web-based image acquiring device, As shown in Figure 3, comprising:
First determining module 301, for determining the first image subject of image to be obtained;
Handling module 302, for grabbing webpage information in a network;
Second determining module 303, for determining the position of every image in the webpage information;
Identification module 304 identifies body matter associated with the position for being directed to identified each position;
Third determining module 305, for determining the second image master of the image at the position according to the body matter Topic;
Computing module 306, for calculating the degree of association of second image subject and the first image theme;
Module 307 is obtained, for obtaining the image at the position in the case where the degree of association meets preset rules.
As an implementation, identification module 304 may include: that determining submodule and identification submodule (do not show in figure Out), wherein
It determines submodule, for being directed to identified each position, determines at a distance from the position within a preset range Content of text;
Submodule is identified, for identification the body matter in the content of text, as text associated with the position Content.
As an implementation, the determining submodule, specifically can be used for: it is directed to identified each position, with The position is starting point, and preset quantity logical line is searched in front and back;
The identification submodule, specifically can be used for: for each logical line searched, judge the logical line whether be Body matter;If so, using the logical line as body matter associated with the position.
As an implementation, the identification submodule, specifically can be used for:
For each logical line searched, the text byte in the logical line comprising hypertext markup language label is determined Number;If the text byte number less than the first preset threshold, indicates the logic behavior body matter, using the logical line as with The associated body matter in the position;
Alternatively,
For each logical line searched, the label text ratio of the logical line, the label text ratio are determined are as follows: The ratio of text byte number and logical line total bytes comprising hypertext markup language label;If the label text ratio Less than the second preset threshold, the logic behavior body matter is indicated, using the logical line as in text associated with the position Hold;
Alternatively,
For each logical line searched, the hyperlink ratio of the logical line, the hyperlink ratio are as follows: hyperlink are determined Connect the ratio of byte number and total bytes;If the hyperlink ratio is less than third predetermined threshold value, indicate that the logical line is positive Literary content, using the logical line as body matter associated with the position;
Alternatively,
For each logical line searched, the text byte number of the logical line, the label text ratio are determined And the hyperlink ratio;If the text byte number is less than the first preset threshold, the label text ratio is less than second Preset threshold, and the hyperlink ratio is less than third predetermined threshold value, the logic behavior body matter is indicated, by the logical line As body matter associated with the position.
As an implementation, third determining module 305, specifically can be used for:
For each logical line in the body matter, according to the text byte number and/or label text of the logical line Ratio and/or hyperlink ratio and the logical line determine the weight of the logical line at a distance from the image;
According to the content and weight of each logical line in the body matter, the second figure of the image at the position is determined As theme.
As an implementation, third determining module 305, specifically can be used for:
Obtain one of the following contents or a variety of: each in the web page title of grabbed webpage information, the webpage information Open the image header of image, the image tag attribute of each image;
Acquired content and the body matter are inputted into the vector model constructed in advance, obtained at the position Second image subject of image;It wherein, include the corresponding weight of each input item in the vector model.
As an implementation, computing module 306 specifically can be used for:
Calculate the direct correlation degree of second image subject Yu the first image theme;
Obtain the indirect association degree of second image subject Yu the first image theme, the indirect association degree are as follows: Higher level's webpage information of the webpage information and the degree of association of the first image theme;
In conjunction with the direct correlation degree and the indirect association degree, second image subject and the first image are calculated The degree of association of theme.
As an implementation, the second determining module 303, specifically can be used for: filter making an uproar in the webpage information Message breath;In webpage information after filtration, the position of every image is determined.
The embodiment of the invention also provides a kind of electronic equipment, as shown in figure 4, include processor 401, communication interface 402, Memory 403 and communication bus 404, wherein processor 401, communication interface 402, memory 403 are complete by communication bus 404 At mutual communication,
Memory 403, for storing computer program;
Processor 401, when for executing the program stored on memory 403, any of the above-described kind of realization is web-based Image acquiring method.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Pomponent Interconnect, abbreviation PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, abbreviation EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc.. Only to be indicated with a thick line in figure, it is not intended that an only bus or a type of bus convenient for indicating.
Communication interface is for the communication between above-mentioned electronic equipment and other equipment.
Memory may include random access memory (Random Access Memory, abbreviation RAM), also may include Nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, Abbreviation CPU), network processing unit (Ne twork Processor, abbreviation NP) etc.;It can also be digital signal processor (Digital Signal Processing, abbreviation DSP), specific integrated circuit (Applica tion Specific Integrated Circuit, abbreviation ASIC), field programmable gate array (Field-Programmable Gate Array, Abbreviation FPGA) either other programmable logic device, discrete gate or transistor logic, discrete hardware components.
The embodiment of the present invention also provides a kind of computer readable storage medium, storage in the computer readable storage medium There is computer program, the computer program realizes any of the above-described kind of web-based image acquisition side when being executed by processor Method.
Using illustrated embodiment of the present invention, in a first aspect, not being only by webpage information and the first image subject in this programme It is associated, but the second image subject with the first image subject of every image in webpage information is associated, only obtain Image associated with the first image subject improves the accuracy for obtaining image;Second aspect, compared in existing scheme, The all images resource in webpage information is downloaded, this programme only downloads the parts of images money for meeting preset rules in webpage information Source reduces resource waste;The third aspect, according in text associated with image position in webpage information in this programme Hold, determines the second image subject of image, improve the accuracy of determining image subject.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for Fig. 3 institute It the web-based image acquiring device embodiment shown, electronic equipment embodiment shown in Fig. 4 and above-mentioned computer-readable deposits For storage media embodiment, since it is substantially similar to web-based image acquiring method embodiment shown in Fig. 1-2, so It is described relatively simple, referring to the part explanation of web-based image acquiring method embodiment shown in Fig. 1-2 in place of correlation ?.
Those of ordinary skill in the art will appreciate that all or part of the steps in realization above method embodiment is can It is completed with instructing relevant hardware by program, the program can store in computer-readable storage medium, The storage medium designated herein obtained, such as: ROM/RAM, magnetic disk, CD.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent replacement, improvement and so within the spirit and principles in the present invention, are all contained in protection scope of the present invention It is interior.

Claims (17)

1. a kind of web-based image acquiring method characterized by comprising
Determine the first image subject of image to be obtained;
Webpage information is grabbed in a network;
Determine the position of every image in the webpage information;
For identified each position, body matter associated with the position is identified, and according to the body matter, determine Second image subject of the image at the position;Calculate the degree of association of second image subject and the first image theme; If the degree of association meets preset rules, the image at the position is obtained.
2. the method according to claim 1, wherein described be directed to identified each position, identification and the position Set associated body matter, comprising:
For identified each position, the content of text at a distance from the position within a preset range is determined;
The body matter in the content of text is identified, as body matter associated with the position.
3. according to the method described in claim 2, it is characterized in that, described be directed to identified each position, the determining and position The content of text of the distance set within a preset range, comprising:
For identified each position, using the position as starting point, preset quantity logical line is searched in front and back;
Body matter in the identification content of text, as body matter associated with the position, comprising:
For each logical line searched, judge whether the logical line is body matter;
If so, using the logical line as body matter associated with the position.
4. according to the method described in claim 3, it is characterized in that, described judge whether the logical line is body matter, comprising:
Determine the text byte number in the logical line comprising hypertext markup language label;
If the text byte number less than the first preset threshold, indicates the logic behavior body matter;
Alternatively,
Determine the label text ratio of the logical line, the label text ratio are as follows: the text comprising hypertext markup language label The ratio of this byte number and logical line total bytes;
If the label text ratio less than the second preset threshold, indicates the logic behavior body matter;
Alternatively,
Determine the hyperlink ratio of the logical line, the hyperlink ratio are as follows: the ratio of hyperlink byte number and total bytes;
If the hyperlink ratio is less than third predetermined threshold value, the logic behavior body matter is indicated;
Alternatively,
Determine the text byte number, the label text ratio and the hyperlink ratio of the logical line;
If the text byte number less than the first preset threshold, the label text ratio less than the second preset threshold, and The hyperlink ratio is less than third predetermined threshold value, indicates the logic behavior body matter.
5. according to the method described in claim 4, determining at the position it is characterized in that, described according to the body matter Second image subject of image, comprising:
For each logical line in the body matter, according to the text byte number and/or label text ratio of the logical line And/or hyperlink ratio and the logical line determine the weight of the logical line at a distance from the image;
According to the content and weight of each logical line in the body matter, the second image master of the image at the position is determined Topic.
6. determining at the position the method according to claim 1, wherein described according to the body matter Second image subject of image, comprising:
Obtain one of the following contents or a variety of: each figure in the web page title of grabbed webpage information, the webpage information The image tag attribute of the image header of picture, each image;
Acquired content and the body matter are inputted into the vector model constructed in advance, obtain the image at the position The second image subject;It wherein, include the corresponding weight of each input item in the vector model.
7. the method according to claim 1, wherein described calculate second image subject and first figure As the degree of association of theme, comprising:
Calculate the direct correlation degree of second image subject Yu the first image theme;
Obtain the indirect association degree of second image subject Yu the first image theme, the indirect association degree are as follows: described Higher level's webpage information of webpage information and the degree of association of the first image theme;
In conjunction with the direct correlation degree and the indirect association degree, second image subject and the first image theme are calculated The degree of association.
8. the method according to claim 1, wherein the position of every image in the determination webpage information It sets, comprising:
Filter the noise information in the webpage information;
In webpage information after filtration, the position of every image is determined.
9. a kind of web-based image acquiring device characterized by comprising
First determining module, for determining the first image subject of image to be obtained;
Handling module, for grabbing webpage information in a network;
Second determining module, for determining the position of every image in the webpage information;
Identification module identifies body matter associated with the position for being directed to identified each position;
Third determining module, for determining the second image subject of the image at the position according to the body matter;
Computing module, for calculating the degree of association of second image subject and the first image theme;
Module is obtained, for obtaining the image at the position in the case where the degree of association meets preset rules.
10. device according to claim 9, which is characterized in that the identification module, comprising:
It determines submodule, for being directed to identified each position, determines the text at a distance from the position within a preset range Content;
Submodule is identified, for identification the body matter in the content of text, as body matter associated with the position.
11. device according to claim 10, which is characterized in that the determining submodule is specifically used for: being directed to and determine Each position, using the position as starting point, front and back search for preset quantity logical line;
The identification submodule, is specifically used for: for each logical line searched, judging whether the logical line is in text Hold;If so, using the logical line as body matter associated with the position.
12. device according to claim 11, which is characterized in that the identification submodule is specifically used for:
For each logical line searched, the text byte number in the logical line comprising hypertext markup language label is determined; If the text byte number less than the first preset threshold, indicates the logic behavior body matter, using the logical line as with this The associated body matter in position;
Alternatively,
For each logical line searched, the label text ratio of the logical line, the label text ratio are as follows: include are determined The text byte number of hypertext markup language label and the ratio of logical line total bytes;If the label text ratio is less than Second preset threshold indicates the logic behavior body matter, using the logical line as body matter associated with the position;
Alternatively,
For each logical line searched, the hyperlink ratio of the logical line, the hyperlink ratio are as follows: hyperlink word are determined The ratio of joint number and total bytes;If the hyperlink ratio is less than third predetermined threshold value, indicate in the logic behavior text Hold, using the logical line as body matter associated with the position;
Alternatively,
For each logical line searched, the text byte number, the label text ratio and the institute of the logical line are determined State hyperlink ratio;If the text byte number, less than the first preset threshold, the label text ratio is default less than second Threshold value, and the hyperlink ratio be less than third predetermined threshold value, indicate the logic behavior body matter, using the logical line as Body matter associated with the position.
13. device according to claim 12, which is characterized in that the third determining module is specifically used for:
For each logical line in the body matter, according to the text byte number and/or label text ratio of the logical line And/or hyperlink ratio and the logical line determine the weight of the logical line at a distance from the image;
According to the content and weight of each logical line in the body matter, the second image master of the image at the position is determined Topic.
14. device according to claim 9, which is characterized in that the third determining module is specifically used for:
Obtain one of the following contents or a variety of: each figure in the web page title of grabbed webpage information, the webpage information The image tag attribute of the image header of picture, each image;
Acquired content and the body matter are inputted into the vector model constructed in advance, obtain the image at the position The second image subject;It wherein, include the corresponding weight of each input item in the vector model.
15. device according to claim 9, which is characterized in that the computing module is specifically used for:
Calculate the direct correlation degree of second image subject Yu the first image theme;
Obtain the indirect association degree of second image subject Yu the first image theme, the indirect association degree are as follows: described Higher level's webpage information of webpage information and the degree of association of the first image theme;
In conjunction with the direct correlation degree and the indirect association degree, second image subject and the first image theme are calculated The degree of association.
16. device according to claim 9, which is characterized in that second determining module is specifically used for: described in filtering Noise information in webpage information;In webpage information after filtration, the position of every image is determined.
17. a kind of electronic equipment, which is characterized in that including processor, communication interface, memory and communication bus, wherein processing Device, communication interface, memory complete mutual communication by communication bus;
Memory, for storing computer program;
Processor when for executing the program stored on memory, realizes any method and step of claim 1-8.
CN201710744731.0A 2017-08-25 2017-08-25 Image acquisition method and device based on webpage Active CN110020042B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710744731.0A CN110020042B (en) 2017-08-25 2017-08-25 Image acquisition method and device based on webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710744731.0A CN110020042B (en) 2017-08-25 2017-08-25 Image acquisition method and device based on webpage

Publications (2)

Publication Number Publication Date
CN110020042A true CN110020042A (en) 2019-07-16
CN110020042B CN110020042B (en) 2021-09-10

Family

ID=67186123

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710744731.0A Active CN110020042B (en) 2017-08-25 2017-08-25 Image acquisition method and device based on webpage

Country Status (1)

Country Link
CN (1) CN110020042B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078098A1 (en) * 2000-12-19 2002-06-20 Nec Corporation Document filing method and system
US20120163707A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Matching text to images
CN102955795A (en) * 2011-08-24 2013-03-06 句容今太科技园有限公司 Web information acquisition system
CN102999489A (en) * 2011-09-08 2013-03-27 腾讯科技(深圳)有限公司 Method and system for image search of community website page
CN103425644A (en) * 2012-05-14 2013-12-04 腾讯科技(深圳)有限公司 Method and device for extracting pictures in webpage content
CN103544186A (en) * 2012-07-16 2014-01-29 富士通株式会社 Method and equipment for discovering theme key words in picture
CN103810303A (en) * 2014-03-18 2014-05-21 苏州大学 Image search method and system based on focus object recognition and theme semantics
CN104063489A (en) * 2014-07-04 2014-09-24 百度在线网络技术(北京)有限公司 Method and device for determining webpage image relevancy and displaying retrieved result
CN104281629A (en) * 2013-07-12 2015-01-14 贝壳网际(北京)安全技术有限公司 Method and device for extracting picture from webpage and client equipment
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078098A1 (en) * 2000-12-19 2002-06-20 Nec Corporation Document filing method and system
US20120163707A1 (en) * 2010-12-28 2012-06-28 Microsoft Corporation Matching text to images
CN102955795A (en) * 2011-08-24 2013-03-06 句容今太科技园有限公司 Web information acquisition system
CN102999489A (en) * 2011-09-08 2013-03-27 腾讯科技(深圳)有限公司 Method and system for image search of community website page
CN103425644A (en) * 2012-05-14 2013-12-04 腾讯科技(深圳)有限公司 Method and device for extracting pictures in webpage content
CN103544186A (en) * 2012-07-16 2014-01-29 富士通株式会社 Method and equipment for discovering theme key words in picture
CN104281629A (en) * 2013-07-12 2015-01-14 贝壳网际(北京)安全技术有限公司 Method and device for extracting picture from webpage and client equipment
CN103810303A (en) * 2014-03-18 2014-05-21 苏州大学 Image search method and system based on focus object recognition and theme semantics
CN104063489A (en) * 2014-07-04 2014-09-24 百度在线网络技术(北京)有限公司 Method and device for determining webpage image relevancy and displaying retrieved result
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郑莉霞: ""基于文本的Web图像检索技术研究"", 《中国优秀硕士学位论文全文数据库 (信息科技辑)》 *

Also Published As

Publication number Publication date
CN110020042B (en) 2021-09-10

Similar Documents

Publication Publication Date Title
AU2009276354B2 (en) Providing posts to discussion threads in response to a search query
CA2610208C (en) Learning facts from semi-structured text
CN110472027B (en) Intent recognition method, apparatus, and computer-readable storage medium
CN106383875B (en) Man-machine interaction method and device based on artificial intelligence
CN104573054A (en) Information pushing method and equipment
CN103617213B (en) Method and system for identifying newspage attributive characters
CN104268192B (en) A kind of webpage information extracting method, device and terminal
CN109684483A (en) Construction method, device, computer equipment and the storage medium of knowledge mapping
CN102609427A (en) Public opinion vertical search analysis system and method
GB2509773A (en) Automatic genre determination of web content
CN105740460B (en) Web crawling recommended method and device
CN110489558A (en) Polymerizable clc method and apparatus, medium and calculating equipment
CN106294473B (en) Entity word mining method, information recommendation method and device
CN104899215A (en) Data processing method, recommendation source information organization, information recommendation method and information recommendation device
CN109657043B (en) Method, device and equipment for automatically generating article and storage medium
JP5070124B2 (en) Filtering device and filtering method
CN104778232B (en) Searching result optimizing method and device based on long query
CN108763221B (en) Attribute name representation method and device
CN105824884A (en) User internet surfing information processing method and device
CN111222000B (en) Image classification method and system based on graph convolution neural network
CN110020042A (en) A kind of web-based image acquiring method and device
CN110705290A (en) Webpage classification method and device
CN110110182A (en) A kind of collecting method and system suitable for crawling in batches
CN106446198A (en) Recommending method and device of news based on artificial intelligence
Saberi¹ et al. What does the future of search engine optimization hold?

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant