CN109101491B - Author information extraction method and device, computer device and computer readable storage medium - Google Patents

Author information extraction method and device, computer device and computer readable storage medium Download PDF

Info

Publication number
CN109101491B
CN109101491B CN201810816328.9A CN201810816328A CN109101491B CN 109101491 B CN109101491 B CN 109101491B CN 201810816328 A CN201810816328 A CN 201810816328A CN 109101491 B CN109101491 B CN 109101491B
Authority
CN
China
Prior art keywords
text
webpage
author information
word
information extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810816328.9A
Other languages
Chinese (zh)
Other versions
CN109101491A (en
Inventor
郑敏
王志超
赫中翮
毛建云
周忠诚
段炼
郭建京
曾琰
陈敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Xinghan Shuzhi Technology Co ltd
Original Assignee
Hunan Xinghan Shuzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Xinghan Shuzhi Technology Co ltd filed Critical Hunan Xinghan Shuzhi Technology Co ltd
Priority to CN201810816328.9A priority Critical patent/CN109101491B/en
Publication of CN109101491A publication Critical patent/CN109101491A/en
Application granted granted Critical
Publication of CN109101491B publication Critical patent/CN109101491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering

Abstract

The invention is suitable for the technical field of Internet, and provides an author information extraction method, an author information extraction device, a computer device and a computer readable storage medium, wherein the author information extraction method comprises the following steps: acquiring a webpage text, and extracting words which accord with a preset keyword set in the webpage text; acquiring a preset author information extraction rule matched with the position of the word in the webpage text; and extracting the author information of the webpage text according to the preset author information extraction rule. The author information extraction method provided by the invention can obviously improve the accuracy and the extraction efficiency of author extraction, has a wide application range and has a certain application prospect.

Description

Author information extraction method and device, computer device and computer readable storage medium
Technical Field
The invention belongs to the technical field of internet, and particularly relates to an author information extraction method, an author information extraction device, a computer device and a computer readable storage medium.
Background
Due to the rapid development of internet technology, network information is in an explosive growth situation. In mass webpage data, a large amount of valuable information is often contained, and author information is one of the information. On one hand, people can use author information as the basis of the associated characters to carry out character relation analysis; on the other hand, whether a certain webpage has the value of further reading can be judged through the author information.
The existing author information extraction method is mainly performed for literature journals, for example, an invention patent with the patent number of 201410437424.4 discloses an author organization information extraction method of English literature issued by Chinese authors, and for example, an invention patent with the patent number of 201210072645.7 discloses an author information mining method and system of academic journal papers. Due to the specific text structure of the literature journal paper, author information has the characteristics of short content, fixed format, normative wording and the like, and extraction is relatively simple; the webpage has a complex structure and various styles, and the position of the author information is not fixed, so that the extraction accuracy is low and the extraction efficiency is low. Therefore, there is a need for an author information extraction method, an author information extraction device, a computer device, and a computer readable storage medium, which can extract author information from a web page as accurately as possible.
Disclosure of Invention
The embodiment of the invention provides an author information extraction method, an author information extraction device, a computer device and a computer readable storage medium, and aims to solve the problems of low author information extraction accuracy and low efficiency in the prior art.
The invention is realized in this way, and an author information extraction method includes:
acquiring a webpage text, and extracting words which accord with a preset keyword set in the webpage text;
acquiring a preset author information extraction rule matched with the position of the word in the webpage text, and extracting author information of the webpage text according to the preset author information extraction rule; the method specifically comprises the following steps:
determining a position of the word in the webpage text;
when the words are between the titles and the texts of the webpage texts, intercepting the contents between the titles and the texts to obtain the positions of the words in the intercepted contents; intercepting the content of the words between the line separators according to the positions of the words; filtering the content intercepted for the second time according to a preset rule to obtain author information;
when the words are in the text of the webpage text, acquiring the line indexes and the line indexes of the words in the word segmentation set of the webpage text; acquiring words which have the same row index as the words in the word segmentation set and have part-of-speech within a preset threshold difference as author information, wherein the part-of-speech is person;
when the word is behind the body of the webpage text, intercepting the content behind the body, and acquiring the position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; and filtering the content intercepted for the second time according to a preset rule to obtain author information.
The present invention also provides an author information extraction apparatus, comprising:
the word extraction module is used for acquiring a webpage text and extracting words which accord with a preset keyword set in the webpage text;
the information extraction module is used for acquiring a preset author information extraction rule matched with the position of the word in the webpage text and extracting author information of the webpage text according to the preset author information extraction rule; the information extraction module specifically comprises:
the word position determining unit is used for determining the position of the word in the webpage text;
the first information extraction unit is used for intercepting the content between the title and the text when the word is between the title and the text of the webpage text to obtain the position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; filtering the content intercepted for the second time according to a preset rule to obtain author information;
the second information extraction unit is used for acquiring a line index and a line index of the word in a word segmentation set of the webpage text when the word is in the text of the webpage text; acquiring words which have the same row index as the words in the word segmentation set and have part-of-speech within a preset threshold difference as author information, wherein the part-of-speech is person;
the third information extraction unit is used for intercepting the content of the text when the word is behind the text of the webpage text and acquiring the position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; and filtering the content intercepted for the second time according to a preset rule to obtain author information.
The invention also provides a computer device, which comprises a processor, wherein the processor is used for realizing the steps of the author information extraction method when executing the computer program in the memory.
The present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the author information extraction method as described above.
The author information extraction method provided by the invention comprises the steps of extracting words which accord with a preset keyword set in a webpage text by acquiring the webpage text; then acquiring a preset author information extraction rule matched with the position of the word in the webpage text; and finally, extracting the author information of the webpage text according to the preset author information extraction rule. The method does not limit the text structure, can extract the author information in different webpage types, and has wide application range; different extraction rules are matched according to the positions of the author information, so that the accuracy rate of author information extraction is improved, and the extraction efficiency is improved.
Drawings
Fig. 1 is a flowchart illustrating an implementation of an author information extraction method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an implementation of preprocessing a web page according to an embodiment of the present invention;
FIG. 3 is a flowchart of an implementation of determining the locations of the body and the title of a web page text according to an embodiment of the present invention;
fig. 4 is a flowchart illustrating an implementation of extracting words in a web page text that meet a preset keyword set according to an embodiment of the present invention;
fig. 5 is a flowchart illustrating a first implementation of extracting author information from a webpage text according to a preset author information extraction rule according to an embodiment of the present invention;
fig. 6 is a flowchart illustrating a second implementation of extracting author information from a web page text according to a preset author information extraction rule according to the embodiment of the present invention;
fig. 7 is a flowchart illustrating a third implementation of extracting author information from a web page text according to a preset author information extraction rule according to the embodiment of the present invention;
fig. 8 is a schematic structural diagram of an author information extraction apparatus according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of a web page preprocessing module according to an embodiment of the present invention;
FIG. 10 is a structural diagram of a DOM tree parsing module according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of a word extraction module according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of a first information extraction module according to an embodiment of the present invention;
fig. 13 is a schematic structural diagram of a second information extraction module according to an embodiment of the present invention;
fig. 14 is a schematic structural diagram of a third information extraction module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
According to the author information extraction method provided by the embodiment of the invention, words which accord with a preset keyword set in a webpage text are extracted by acquiring the webpage text; then acquiring a preset author information extraction rule matched with the position of the word in the webpage text; and finally, extracting the author information of the webpage text according to the preset author information extraction rule. The method does not limit the text structure, can extract the author information in different webpage types, and has wide application range; different extraction rules are matched according to the positions of the author information, so that the accuracy rate of author information extraction is improved, and the extraction efficiency is improved.
Fig. 1 is a flowchart illustrating an implementation of an author information extraction method according to an embodiment of the present invention; the author information extraction method comprises the following steps:
in step S101, a web page is acquired, and the web page is preprocessed to acquire web page information; the webpage information comprises a webpage text, a webpage text language and a webpage source code.
In the embodiment of the invention, a webpage is obtained through a webpage parser; the webpage parser comprises but is not limited to Jsoup, python, less. In this embodiment, a structured parsing function is implemented by using Jsoup.
As an embodiment of the invention, the webpage text comprises webpage title and webpage text information, and the webpage text language can be Chinese, English, Spanish and the like.
In step S102, constructing a DOM tree for the webpage source code;
the DOM tree in the embodiments of the present invention is common general knowledge in the art, and will not be described herein.
In step S103, parsing the DOM tree, and determining the positions of the body and the title of the web page text;
in this embodiment, by parsing the DOM tree, the article title and the text content in the DOM tree can be known explicitly, and then the text and the position of the title of the web page text are obtained by operations such as text segmentation and text similarity matching, and the specific process will be described in detail in fig. 3.
In step S104, a web page text is acquired, and words in the web page text that conform to a preset keyword set are extracted;
in the embodiment of the present invention, the preset keyword refers to a preset keyword related to "author information", for example, writing, writenby, writing on, and the like.
In step S105, obtaining a preset author information extraction rule matched with the position of the word in the web page text, and extracting author information of the web page text according to the preset author information extraction rule;
in the embodiment of the invention, the position relations of the words in the webpage text are three, the first type is that the words are between the title and the text of the webpage text, the second type is that the words are in the text of the webpage text, and the third type is that the words are behind the text of the webpage text; according to the position matching different preset author extraction rules of the word, the extraction accuracy is good, and the extraction efficiency is high.
Referring to fig. 2, the step S101 includes:
step S1011: analyzing the webpage through an open source webpage analyzing tool class to obtain a webpage text;
step S1012: carrying out natural language processing on the webpage text, and identifying the language of the webpage text;
step S1013: and performing word segmentation on the webpage text to obtain a word segmentation set corresponding to the webpage text.
The webpage text, the language of the webpage text and the word segmentation set corresponding to the webpage text can be obtained through the steps. For example, "here is the last bend of the nine-qu yellow river-xinhua web zhuang, sand blown by wind, blue-kou, countryside happy, poverty-deprivation attack here is the last bend of the nine-qu yellow river-the south of the river-the blue-kou zhuang, and the last bend of the nine-qu yellow river is connected with a common and special village. Like thousands of kingdoms in China, with the implementation of poverty-depriving, hardness-attacking and village-arousing strategies, a new kingdom of Zhangzhuang is presented to people in the world. "
After the contents of the above web pages are analyzed by a jsup, a less.html or other mainstream analysis tools, a web page text is obtained, wherein the text is the last bend of the nine-song yellow river-the Xinhua net, < keyword content > -a village, sand, orchid, countryside happy, barren and hard masses removal, < description content > -the last bend of the nine-song yellow river-the south of the river-the village of the orchid, and the last bend of the nine-song yellow river is a common and special village. Like thousands of kingdoms in China, with the implementation of poverty-depriving, hardness-attacking and village-arousing strategies, a new kingdom of Zhangzhuang is presented to people in the world. "identify the language as" Chinese "by natural language processing the web page text; and segmenting the title of the webpage text to obtain a corresponding segmentation set { here, Jiuqu yellow river, last, one, bend, Xinhua net }.
Referring to fig. 3, the step S103 includes:
step S1031: acquiring the text and the title of a webpage text;
step S1032: dividing the text into lines and storing the lines into a text set; dividing the webpage text into lines and storing the lines into a webpage text set;
step S1033: matching a first element and a last element of a text set with elements of a webpage text set one by one to obtain the positions of a first line and a last line of the text in the webpage text;
step S1034: and matching the title with the webpage text set elements one by one to obtain the position of the title in the webpage text.
In this embodiment, first, a body and a title of a webpage text are obtained through a specific tag, for example, a text in a < title > tag is an article title, and a text in a < p > tag is mostly body content; then intercepting the text and the webpage text according to the line segmentation character through the character string, and segmenting the text and the webpage text into lines; and finally, performing text similarity calculation on each line of the text and each line of the webpage text, thereby determining the positions of the text and the title of the webpage text.
Referring to fig. 4, the step S104 includes:
step S1041: judging whether a preset keyword set contains words existing in the word segmentation set or the webpage text;
step S1042: and when the judgment result is yes, acquiring a preset author information extraction rule matched with the position of the word in the webpage text.
In this embodiment, after acquiring the web page text, the specific method of determining whether the preset keyword set contains a word existing in the word segmentation set or the web page text is as follows: traversing the word segmentation set or the webpage text according to a preset keyword, and when the same word is traversed (namely, the judgment result is yes), acquiring the position of the word in the webpage text, and matching with a corresponding preset author information extraction rule, so as to extract author information of the webpage text; and when the same word is not traversed (namely, the judgment result is negative), performing the circulation of the next preset keyword.
Referring to fig. 5, the step 105 includes:
step S1051: determining a position of the word in the webpage text;
step S1052: when the words are between the titles and the texts of the webpage texts, intercepting the contents between the titles and the texts to obtain the positions of the words in the intercepted contents; intercepting the content of the words between the line separators according to the positions of the words; and filtering the content intercepted for the second time according to a preset rule to obtain author information.
In this embodiment, the position of the word in the web page text is determined by its index and inline offset in the web page text collection element.
In this embodiment, "position" refers to a paragraph line, "separator" refers to ",; ", will not be described in detail hereinafter.
As a preferred embodiment of the present invention, selecting a corresponding filtering rule according to a language specifically includes: when the language of the webpage text is Chinese and the number of characters is less than or equal to 6, the contents intercepted for the second time are considered to contain author information, otherwise, the contents are directly abandoned; and when the language of the webpage text is English or Spanish and the number of characters is less than or equal to 15, determining that the contents intercepted for the second time contain author information, and otherwise, directly discarding the contents.
Referring to fig. 6, the step 105 includes:
step S1051: determining a position of the word in the webpage text;
step S1053: when the words are in the text of the webpage text, acquiring the line indexes and the line indexes of the words in the word segmentation set of the webpage text; acquiring words which have the same row index as the words in the word segmentation set and have part-of-speech within a preset threshold difference as author information, wherein the part-of-speech is person;
it is to be understood that "row index" in this embodiment refers to the index of the row in which the word is located, and "in-row index" refers to the index of the word within the current row.
In this embodiment, the "predetermined threshold" of the intra-row index difference is between 0 and 5. For example, the line index of the acquired word in the participle set of the web page text is "10", the line index is "4", and then the word with the line index of "10", the line index of "4" and the part of speech of person in the participle set is acquired, and the word is used as the author information.
As a preferred embodiment of the present invention, the predetermined threshold value of the intra-row index difference is 2 for chinese language, and 5 for other languages such as english and spanish.
In this embodiment, when the word is in the body of the web page text, there is another processing method: acquiring a line index of the word in a text set of a webpage text; acquiring corresponding line texts in a text set according to the line indexes, and intercepting contents between the words and the line segmentation characters (such as 'within', ')' within the text lines; and filtering the content from the words to the line separators according to a preset rule to obtain the author information.
In the two processing methods, the first processing method is preferentially used (namely, extraction according to the text of the participle), if no result is extracted according to the text of the participle, the second word interception mode is used, so that words with the part of speech of the word where the author information is located in the extraction of the participle text not being Person are prevented from being missed.
Referring to fig. 7, the step 105 includes:
step S1051: determining a position of the word in the webpage text;
step S1054: when the word is behind the body of the webpage text, intercepting the content behind the body, and acquiring the position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; and filtering the content intercepted for the second time according to a preset rule to obtain author information.
In the embodiment, the content of the text is intercepted, and when the words appear in 1-15 lines behind the text, the next operation is carried out; when the word is beyond the range, the possibility of being author information is less and the word is not considered.
According to the different position relations of the three words in the webpage text, different extraction rules are matched, the extraction method is not limited to structured text types such as academic journals, papers and the like, the universality is good, the accuracy of result extraction is improved, and the extraction efficiency is improved; the extracted information is filtered through the preset rule, and the accuracy of author information extraction is further improved.
Fig. 8 is a schematic structural diagram of an author information extraction apparatus 100 according to an embodiment of the present invention, and for convenience of description, only relevant portions of the embodiment of the present invention are shown. The author information device 100 includes:
the web page preprocessing module 110 is configured to acquire a web page and preprocess the web page to acquire web page information; the webpage information comprises a webpage text, a webpage text language and a webpage source code;
in the embodiment of the present invention, the web page preprocessing module 110 obtains a web page through a web page parser; the webpage parser comprises but is not limited to Jsoup, python, less. In this embodiment, a structured parsing function is implemented by using Jsoup.
As an embodiment of the invention, the webpage text comprises webpage title and webpage text information, and the webpage text language can be Chinese, English, Spanish and the like.
A DOM tree building module 120, configured to build a DOM tree for the web page source code;
the DOM tree in the embodiments of the present invention is common general knowledge in the art, and will not be described herein.
And the DOM tree parsing module 130 is configured to parse the DOM tree and determine the positions of the body and the title of the webpage text.
In this embodiment, by parsing the DOM tree, the article title and the text content in the DOM tree can be known explicitly, and then the text and the position of the title of the web page text are obtained by operations such as text segmentation and text similarity matching, and the specific process will be described in detail in fig. 10.
The word extraction module 140 is configured to obtain a webpage text and extract words in the webpage text that meet a preset keyword set;
in the embodiment of the present invention, the preset keyword refers to a preset keyword related to "author information", for example, writing, writenby, writing on, and the like.
The information extraction module 150 is configured to acquire a preset author information extraction rule matched with the position of the word in the web page text, and extract author information of the web page text according to the preset author information extraction rule;
in the embodiment of the invention, the position relations of the words in the webpage text are three, the first type is that the words are between the title and the text of the webpage text, the second type is that the words are in the text of the webpage text, and the third type is that the words are behind the text of the webpage text; according to the position matching different preset author extraction rules of the word, the extraction accuracy is good, and the extraction efficiency is high.
Referring to fig. 9, the web page preprocessing module 110 includes:
the webpage analyzing unit 111 is used for analyzing the webpage through an open-source webpage analyzing tool class to obtain a webpage text;
a language identification unit 112, configured to perform natural language processing on the web page text, and identify a language of the web page text;
the text word segmentation unit 113 is configured to perform word segmentation on the web page text to obtain a word segmentation set corresponding to the web page text.
The web page text, the language of the web page text, and the word segmentation set corresponding to the web page text can be obtained through the web page preprocessing module 110. For example, "here is the last bend of the nine-qu yellow river-xinhua web zhuang, sand blown by wind, blue-kou, countryside happy, poverty-deprivation attack here is the last bend of the nine-qu yellow river-the south of the river-the blue-kou zhuang, and the last bend of the nine-qu yellow river is connected with a common and special village. Like thousands of kingdoms in China, with the implementation of poverty-depriving, hardness-attacking and village-arousing strategies, a new kingdom of Zhangzhuang is presented to people in the world. "
After the above web page contents are analyzed by a jsup, less, html or other mainstream analysis tools in the web page analysis unit 111, a web page text "< title >, < keywords content >, < village, sand, orchid, countryside, poverty removal, attack, and < description content >, < ninth bend, the last bend of the ninth yellow river — the south of the river lan, and the last bend of the ninth yellow river is a common and special village is obtained. Like thousands of kingdoms in China, with the implementation of poverty-depriving, hardness-attacking and village-arousing strategies, a new kingdom of Zhangzhuang is presented to people in the world. "the web page text is subjected to natural language processing by the language identification unit 112, and the language is identified as" Chinese "; the text word segmentation unit 113 performs word segmentation on the title of the web page text to obtain a corresponding word segmentation set { here, jiuqu yellow river, last, one, bend, xinhua net }.
Referring to fig. 10, the DOM tree parsing module 130 includes:
a body and title obtaining unit 131, configured to obtain a body and a title of a web page text;
a line block dividing unit 132, configured to divide the text into lines and store the lines into a text set; dividing the webpage text into lines and storing the lines into a webpage text set;
the element matching unit 133 is configured to match a first element and a last element of the text set with elements of the web text set one by one to obtain positions of a first line and a last line of the text in the web text;
and the title position determining unit 134 is configured to match the titles with the web page text set elements one by one, so as to obtain positions of the titles in the web page text.
In this embodiment, the body and title obtaining unit 131 obtains the body and the title of the webpage text through a specific tag, for example, the text in the < title > tag is an article title, and the body in the < p > tag is mostly body content; then the line block segmentation unit 132 intercepts the text and the web page text according to the line segmentation character through the character string, and segments the text and the web page text into lines; the element matching unit 133, the title position determining unit 134 calculates the text similarity between each line of the text and each line of the web text to obtain the starting and ending positions of the text; the title position determination unit 134 performs similarity calculation between the title and each line of the web page text set, thereby determining the position of the title.
Referring to fig. 11, the word extraction module 140 includes:
a word judgment unit 141, configured to judge whether a preset keyword set contains a word that exists in the word segmentation set or exists in the web page text;
and the rule matching unit 142 is configured to, if the determination result is yes, obtain a preset author information extraction rule that matches the position of the word in the web page text.
In this embodiment, after acquiring the web page text, the specific method of determining whether the preset keyword set contains a word existing in the word segmentation set or the web page text is as follows: traversing the word segmentation set or the webpage text according to a preset keyword, and when the same word is traversed (namely, the judgment result is yes), acquiring the position of the word in the webpage text, and matching with a corresponding preset author information extraction rule, so as to extract author information of the webpage text; and when the same word is not traversed (namely, the judgment result is negative), performing the circulation of the next preset keyword.
Referring to fig. 12, the information extraction module 150 includes:
a word position determination unit 151, configured to determine a position of the word in the web page text;
a first information extraction unit 152, configured to, when the word is between a title and a body of the web page text, intercept content between the title and the body, and obtain a position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; and filtering the content intercepted for the second time according to a preset rule to obtain author information.
In the present embodiment, the word position determination unit 151 determines the position of the word in the web page text by the index and the in-line offset of the word in the web page text set element.
In this embodiment, "position" refers to a paragraph line, "separator" refers to ",; ", will not be described in detail hereinafter.
As a preferred embodiment of the present invention, selecting a corresponding filtering rule according to a language specifically includes: when the language of the webpage text is Chinese and the number of characters is less than or equal to 6, the contents intercepted for the second time are considered to contain author information, otherwise, the contents are directly abandoned; and when the language of the webpage text is English or Spanish and the number of characters is less than or equal to 15, determining that the contents intercepted for the second time contain author information, and otherwise, directly discarding the contents.
Referring to fig. 13, the information extraction module 150 includes:
a word position determination unit 151, configured to determine a position of the word in the web page text;
a second information extraction unit 153, configured to obtain, when the word is in the body of the web page text, a line index and a line index of the word in a word segmentation set of the web page text; acquiring words which have the same row index as the words in the word segmentation set and have part-of-speech within a preset threshold difference as author information, wherein the part-of-speech is person;
it is to be understood that "row index" in this embodiment refers to the index of the row in which the word is located, and "in-row index" refers to the index of the word within the current row.
In this embodiment, the "predetermined threshold" of the intra-row index difference is between 0 and 5. For example, the line index of the acquired word in the participle set of the web page text is "10", the line index is "4", and then the word with the line index of "10", the line index of "4" and the part of speech of person in the participle set is acquired, and the word is used as the author information.
As a preferred embodiment of the present invention, the predetermined threshold value of the intra-row index difference is 2 for chinese language, and 5 for other languages such as english and spanish.
In this embodiment, the second information extraction unit 153 may be further configured to obtain a line index of the word in a text collection of the web page text; acquiring corresponding line texts in a text set according to the line indexes, and intercepting contents between the words and the line segmentation characters (such as 'within', ')' within the text lines; and filtering the content from the words to the line separators according to a preset rule to obtain the author information.
The two second information extraction units 153 defined above preferentially use the first type (i.e. extraction according to the text of the participle), and if there is no result in the extraction according to the text of the participle, use the second type of word and intercept mode to avoid missing words whose part of speech of the word where the author information is located in the extraction of the text of the participle is not Person.
Referring to fig. 14, the information extraction module 150 includes:
a word position determination unit 151, configured to determine a position of the word in the web page text;
a third information extraction unit 154, configured to, when the word is behind the body of the web text, intercept the content behind the body, and obtain a position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; and filtering the content intercepted for the second time according to a preset rule to obtain author information.
In the embodiment, the content of the text is intercepted, and when the words appear in 1-15 lines behind the text, the next operation is carried out; when the word is beyond the range, the possibility of being author information is less and the word is not considered.
According to the different position relations of the three words in the webpage text, different extraction rules are matched, the extraction method is not limited to structured text types such as academic journals, papers and the like, the universality is good, the accuracy of result extraction is improved, and the extraction efficiency is improved; the extracted information is filtered through the preset rule, and the accuracy of author information extraction is further improved.
An embodiment of the present invention provides a computer apparatus, where the computer apparatus includes a processor, and the processor is configured to implement the steps of the webpage text extraction method provided in each of the above method embodiments when executing a computer program in a memory.
Illustratively, a computer program can be partitioned into one or more modules, which are stored in memory and executed by a processor to implement the present invention. One or more of the modules may be a sequence of computer program instruction segments for describing the execution of a computer program in a computer device that is capable of performing certain functions. For example, the computer program may be divided into the steps of the web page text extraction method provided by the various method embodiments described above.
Those skilled in the art will appreciate that the above description of a computer apparatus is by way of example only and is not intended to be limiting of computer apparatus, and that the apparatus may include more or less components than those described, or some of the components may be combined, or different components may be included, such as input output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The modules/units integrated by the computer device may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and used by a processor to implement the steps of the embodiments of the web page text extraction method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, electrical signals, software distribution medium, and the like.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (14)

1. An author information extraction method, comprising:
acquiring a webpage text, and extracting words which accord with a preset keyword set in the webpage text;
acquiring a preset author information extraction rule matched with the position of the word in the webpage text, and extracting author information of the webpage text according to the preset author information extraction rule; the method specifically comprises the following steps:
determining a position of the word in the webpage text;
when the words are between the titles and the texts of the webpage texts, intercepting the contents between the titles and the texts to obtain the positions of the words in the intercepted contents; intercepting the content of the words between the line separators according to the positions of the words; filtering the content intercepted for the second time according to the preset author information extraction rule to obtain author information;
when the words are in the text of the webpage text, acquiring the line indexes and the line indexes of the words in the word segmentation set of the webpage text; acquiring words which have the same row index as the words in the word segmentation set and have part-of-speech within a preset threshold difference as author information, wherein the part-of-speech is person;
when the word is behind the body of the webpage text, intercepting the content behind the body, and acquiring the position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; and filtering the content intercepted for the second time according to the preset author information extraction rule to obtain the author information.
2. The author information extraction method of claim 1 wherein when the term is within the body of the web page text, a line index of the term in a body set of the web page text is obtained; acquiring corresponding line texts in a text set according to the line indexes, and intercepting contents between the words and the line separators in the text lines; and filtering the content from the words to the line separators according to the preset author information extraction rule to obtain the author information.
3. The author information extraction method of claim 1, prior to the acquiring the web page text, further comprising:
acquiring a webpage, and preprocessing the webpage to acquire webpage information; the webpage information comprises a webpage text, a webpage text language and a webpage source code;
constructing a DOM tree for the webpage source code;
and analyzing the DOM tree, and determining the positions of the text body and the title of the webpage text.
4. The author information extraction method of claim 3, wherein the preprocessing the web page comprises:
analyzing the webpage through an open source webpage analyzing tool class to obtain a webpage text;
carrying out natural language processing on the webpage text, and identifying the language of the webpage text;
and performing word segmentation on the webpage text to obtain a word segmentation set corresponding to the webpage text.
5. The author information extraction method of claim 4, wherein the extracting words in the webpage text that conform to a preset keyword set comprises:
judging whether a preset keyword set contains words existing in the word segmentation set or the webpage text;
and when the judgment result is yes, acquiring a preset author information extraction rule matched with the position of the word in the webpage text.
6. The author information extraction method of claim 3, wherein the determining the positions of the body and the title of the web page text comprises:
acquiring the text and the title of a webpage text;
dividing the text into lines and storing the lines into a text set; dividing the webpage text into lines and storing the lines into a webpage text set;
matching a first element and a last element of a text set with elements of a webpage text set one by one to obtain the positions of a first line and a last line of the text in the webpage text;
and matching the title with the webpage text set elements one by one to obtain the position of the title in the webpage text.
7. An author information extraction apparatus, comprising:
the word extraction module is used for acquiring a webpage text and extracting words which accord with a preset keyword set in the webpage text;
the information extraction module is used for acquiring a preset author information extraction rule matched with the position of the word in the webpage text and extracting author information of the webpage text according to the preset author information extraction rule; the information extraction module specifically comprises:
the word position determining unit is used for determining the position of the word in the webpage text;
the first information extraction unit is used for intercepting the content between the title and the text when the word is between the title and the text of the webpage text to obtain the position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; filtering the content intercepted for the second time according to the preset author information extraction rule to obtain author information;
the second information extraction unit is used for acquiring a line index and a line index of the word in a word segmentation set of the webpage text when the word is in the text of the webpage text; acquiring words which have the same row index as the words in the word segmentation set and have part-of-speech within a preset threshold difference as author information, wherein the part-of-speech is person;
the third information extraction unit is used for intercepting the content of the text when the word is behind the text of the webpage text and acquiring the position of the word in the intercepted content; intercepting the content of the words between the line separators according to the positions of the words; and filtering the content intercepted for the second time according to the preset author information extraction rule to obtain the author information.
8. The author information extracting apparatus of claim 7, wherein the second information extracting unit is configured to acquire a line index of the word in a body set of the web text when the word is within the body of the web text; acquiring corresponding line texts in a text set according to the line indexes, and intercepting contents between the words and the line separators in the text lines; and filtering the content from the words to the line separators according to the preset author information extraction rule to obtain the author information.
9. The author information extracting apparatus as claimed in claim 7, further comprising:
the webpage preprocessing module is used for acquiring a webpage and preprocessing the webpage to acquire webpage information; the webpage information comprises a webpage text, a webpage text language and a webpage source code;
the DOM tree building module is used for building a DOM tree for the webpage source code;
and the DOM tree analyzing module is used for analyzing the DOM tree and determining the positions of the text body and the title of the webpage text.
10. The author information extraction apparatus of claim 9, wherein the preprocessing module comprises:
the webpage analyzing unit is used for analyzing the webpage through an open-source webpage analyzing tool class to obtain a webpage text;
a language identification unit, configured to perform natural language processing on the web page text, and identify a language of the web page text;
and the text word segmentation unit is used for segmenting the webpage text to obtain a word segmentation set corresponding to the webpage text.
11. The author information extraction apparatus of claim 10, wherein the word extraction module comprises:
the word judgment unit is used for judging whether a preset keyword set contains words existing in the word segmentation set or the webpage text;
and the rule matching unit is used for acquiring a preset author information extraction rule matched with the position of the word in the webpage text when the judgment result is yes.
12. The author information extraction apparatus of claim 9, wherein the DOM tree parsing module comprises:
the text and title acquisition unit is used for acquiring the text and title of the webpage text;
the line block segmentation unit is used for segmenting the text into lines and storing the lines into a text set; dividing the webpage text into lines and storing the lines into a webpage text set;
the element matching unit is used for matching the first element and the last element of the text set with the elements of the webpage text set one by one to obtain the positions of the first line and the last line of the text in the webpage text;
and the title position determining unit is used for matching the titles with the webpage text set elements one by one to obtain the positions of the titles in the webpage text.
13. A computer arrangement, characterized in that the computer arrangement comprises a processor for implementing the steps of the author information extraction method as claimed in any one of claims 1-6 when executing a computer program in a memory.
14. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program, when being executed by a processor, carries out the steps of the author information extraction method as set forth in any one of claims 1-6.
CN201810816328.9A 2018-07-24 2018-07-24 Author information extraction method and device, computer device and computer readable storage medium Active CN109101491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810816328.9A CN109101491B (en) 2018-07-24 2018-07-24 Author information extraction method and device, computer device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810816328.9A CN109101491B (en) 2018-07-24 2018-07-24 Author information extraction method and device, computer device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN109101491A CN109101491A (en) 2018-12-28
CN109101491B true CN109101491B (en) 2021-12-17

Family

ID=64847301

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810816328.9A Active CN109101491B (en) 2018-07-24 2018-07-24 Author information extraction method and device, computer device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN109101491B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111476037B (en) * 2020-04-14 2023-03-31 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium
CN111581549B (en) * 2020-05-09 2023-11-03 腾讯科技(深圳)有限公司 Corpus collection method, device and storage medium based on artificial intelligence
CN111737623A (en) * 2020-06-19 2020-10-02 深圳市小满科技有限公司 Webpage information extraction method and related equipment
CN113298914B (en) * 2021-07-28 2021-10-15 北京明略软件系统有限公司 Knowledge chunk extraction method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345532A (en) * 2013-07-26 2013-10-09 人民搜索网络股份公司 Method and device for extracting webpage information
CN108153851A (en) * 2017-12-21 2018-06-12 北京工业大学 A kind of rule-based and semantic universal forum topic post page info abstracting method
CN108268433A (en) * 2018-02-26 2018-07-10 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080162275A1 (en) * 2006-08-21 2008-07-03 Logan James D Author-assisted information extraction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345532A (en) * 2013-07-26 2013-10-09 人民搜索网络股份公司 Method and device for extracting webpage information
CN108153851A (en) * 2017-12-21 2018-06-12 北京工业大学 A kind of rule-based and semantic universal forum topic post page info abstracting method
CN108268433A (en) * 2018-02-26 2018-07-10 杭州数梦工场科技有限公司 Title abstracting method and device based on webpage article

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于深度置信网络算法的作者信息抽取研究;路明懿;《中国优秀硕士学位论文全文数据库 (基础科学辑)》;20170415;全文 *

Also Published As

Publication number Publication date
CN109101491A (en) 2018-12-28

Similar Documents

Publication Publication Date Title
CN109101491B (en) Author information extraction method and device, computer device and computer readable storage medium
CN109062874B (en) Financial data acquisition method, terminal device and medium
CN110020422B (en) Feature word determining method and device and server
CN108763591B (en) Webpage text extraction method and device, computer device and computer readable storage medium
US9323839B2 (en) Classification rule generation device, classification rule generation method, classification rule generation program, and recording medium
US8577155B2 (en) System and method for duplicate text recognition
CN105279277A (en) Knowledge data processing method and device
CN104881458B (en) A kind of mask method and device of Web page subject
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN103544255A (en) Text semantic relativity based network public opinion information analysis method
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
WO2019080402A1 (en) Text information extraction method for structured text, storage medium and server
RU2666277C1 (en) Text segmentation
CN109635288A (en) A kind of resume abstracting method based on deep neural network
US20110302179A1 (en) Using Context to Extract Entities from a Document Collection
WO2011148571A1 (en) Information extraction system, method, and program
CN109635297A (en) A kind of entity disambiguation method, device, computer installation and computer storage medium
CN109033282B (en) Webpage text extraction method and device based on extraction template
CN103679012A (en) Clustering method and device of portable execute (PE) files
CN105653984A (en) File fingerprint check method and apparatus
CN113569050A (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN110738033B (en) Report template generation method, device and storage medium
CN111159497A (en) Regular expression generation method and regular expression-based data extraction method
CN107436931B (en) Webpage text extraction method and device
CN106126496B (en) A kind of information segmenting method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant