CN116561295A - Internet data extraction system - Google Patents

Internet data extraction system Download PDF

Info

Publication number
CN116561295A
CN116561295A CN202310350019.8A CN202310350019A CN116561295A CN 116561295 A CN116561295 A CN 116561295A CN 202310350019 A CN202310350019 A CN 202310350019A CN 116561295 A CN116561295 A CN 116561295A
Authority
CN
China
Prior art keywords
sentences
abstract
text
sentence
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310350019.8A
Other languages
Chinese (zh)
Inventor
林松林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lu'an Yili Innovation Technology Co ltd
Original Assignee
Lu'an Yili Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lu'an Yili Innovation Technology Co ltd filed Critical Lu'an Yili Innovation Technology Co ltd
Priority to CN202310350019.8A priority Critical patent/CN116561295A/en
Publication of CN116561295A publication Critical patent/CN116561295A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an Internet data extraction system, which comprises a data acquisition, an automatic abstract compiling and a system function module, wherein the data acquisition comprises a keyword acquisition and a key media acquisition, the system function module comprises a text preprocessing module, a word segmentation module, a statistics analysis module, an abstract extraction module and an abstract output module, the keyword acquisition adopts a search engine technology to automatically search preset keywords, URL deduplication, key information extraction, warehousing and other treatments are carried out on search results, the aim of monitoring Internet sensitive information is achieved, the system defines two modes of breadth search and depth search, the Internet information is searched, and the mode adopted in the system is that a search engine positioned in front of the Internet industry is called to search the keywords. The automatic abstract can automatically extract the theme ideas or central contents of the original text, has overview, objectivity, understandability and readability, and can be applied to any field.

Description

Internet data extraction system
Technical Field
The invention relates to the technical field of Internet, in particular to an Internet data extraction system.
Background
The Internet data is a new data source with low cost, quick response, high real-time performance and rich information relative to the data in the database, is a further expansion of the machine data category, is full-capacity and bidirectional communication data carried by network optical fibers connected between a client and a server, is a processed high-value service available data source, is a reference data resource capable of knowing the most comprehensive and objective service operation condition of a system in real time and in a visual way, is a processed high-value service available data source, and is used for reconstructing data transmitted in a mass network into structured data in real time to help IT operation staff to create a behavior baseline and detect abnormal behaviors and perform real-time performance fault positioning and elimination, the method is a data resource which is the most comprehensive and high in value of service operation conditions, the use of interconnection data enables IT itself to better provide a surge power for technological introduction of service innovation, the method is close to the service, and a development team is not required to be touched to modify application, the influence on a production system 0 is different from Internet big data, the method is more real-time, comprehensive and deep, the condition of an application program stack can be displayed, the condition of the whole delivery chain can be displayed, but in the past, how to analyze service data and user behaviors in the interconnection data are insufficient in practice in the international range, and especially in the link of systematic application of the interconnection data, innovation foundation promotion in various aspects such as methodology, best practice, engineering technology, technology stack maturity management and the like is also required to be established, and an extraction system is required to be used during Internet data extraction;
the existing extraction system is inconvenient for the subject ideas or central contents of the original text, and has poor applicability, so that an Internet data extraction system is provided for solving the problems.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides an Internet data extraction system.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
the Internet data extraction system comprises a data acquisition module, an automatic abstract compiling module and a system function module, wherein the data acquisition module comprises a keyword acquisition module and a key media acquisition module, and the system function module comprises a text preprocessing module, a word segmentation module, a statistical analysis module, an abstract extraction module and an abstract output module.
Preferably, the keyword collection adopts a search engine technology to automatically search preset keywords, URL duplication removal, keyword information extraction, warehouse entry and other treatments are carried out on search results, the purpose of monitoring internet sensitive information is achieved, a system defines two modes of breadth search and depth search, internet information is searched, the breadth search adopts a mode of calling a search engine with the top ranking of the internet industry to search the keywords, meanwhile, the search results are integrated, duplicated and classified, the maximization of the internet information searching capability is achieved, the depth search utilizes an open source crawler program Nutch to carry out depth mining on a user-specified website, and the webpage information matched with the keywords is searched.
Preferably, the internet data extraction system, the keyword collection function includes the following points:
1) Providing URL collection of keywords in the existing keyword library on a search engine, and providing collection of user-defined keywords;
2) The system uses a URL verification mode to remove the duplicate of the acquired URL;
3) The URL acquisition crawler comprises a depth priority algorithm and a breadth priority algorithm, and the crawling depth and the user authority are configured;
4) Providing a URL label analysis function; content under specific tags including title, date, author, metadata, text, etc. is extracted and classified;
5) Providing extraction of key information in a specific label of the searched result;
6) And the text information extraction function of the news information web page adopts a universal extraction algorithm to extract the web page text.
Preferably, an internet data extraction system, the key media collection function includes the following points:
1.1 Chinese Multi-document automatic abstract
1) The automatic digest can be divided into a single-document digest and a multi-document digest according to the number of the texts processed by the automatic digest;
2) The multi-document abstract is an important composition technology of an intelligent search engine, one of the hot spots of the research of the recent search engine is an open domain question-answering system, namely, people question in natural language, present the obtained answers to users in the form of paragraphs or sentences, and in fact, after the documents returned by the search engine are reprocessed, the related documents are organically integrated together, which is the task of the multi-document automatic abstract research, so that the research of the multi-document abstract technology greatly promotes the development of a new-generation search engine from the angle;
3) From the user's perspective, the purpose of making the digest is not two: firstly, the information is concise and comprehensive; secondly, expressed in fluent language, in fact, the idea of de-redundancy, extraction of main information, generation of fluent abstract is also reflected in the research angle, and the research on multiple documents mainly comprises two aspects: extracting main information and generating abstract;
4) The work of extracting information, the research taking paragraphs as units has no more room, and the research taking sentences as units is the main stream;
5) Overall structure of the multi-document automatic digest system: firstly, preprocessing an original text, including sentence breaking and word segmentation; then, the similarity between the text units is calculated, and the processed text units are on the sentence level; extracting abstract sentences capable of summarizing the subject from the sentence set, sequencing the extracted abstract sentences, and finally generating abstract;
6) Calculating the similarity between sentences: in different specific applications, the meaning of similarity varies, for example, in instance-based machine translation, similarity is mainly used to measure the degree of substitution of words in text; in information retrieval, the similarity reflects the degree of coincidence of the text and the user query in the sense, in automatic question answering, the similarity reflects the degree of matching of the questions and the answers, and in a multi-document abstract system, the similarity can reflect the degree of fitting of local topic information;
7) Extracting abstract sentences: extracting abstract sentences is carried out on the basis of obtaining a similarity matrix between sentences, most students in the part cluster sentences according to similarity information of sentences through a proper clustering method, and then abstract sentences are generated through extracting centers of each class, and the largest characteristic of a multi-document set is redundancy of information, so that the clustering is an effective strategy for eliminating redundancy;
8) And (3) abstract generation: the generation of the abstract is actually the sorting process of the abstract sentences, for a plurality of documents, because the limit of the documents is broken, sentences among different documents are unordered, if the extracted sentences form the abstract, the time information of the original document where the sentences are located and the position information of the sentences in the original document need to be considered under the general condition, and because the objects researched by us are mostly reports of different websites at the same event and the same time, the sorting of the abstract sentences mainly refers to the position information of the sentences in the document;
1.2 automatic abstracting technique for Chinese documents
1.2.1 automatic extraction: automatic extraction of linear sequences of text as sentences and sentences as words
1) Calculating the weight of the word;
2) Calculating the weight of sentences;
3) Arranging all sentences in the original text in descending order of weight, and determining a plurality of sentences with highest weight as abstract sentences;
4) Outputting all abstract sentences according to the appearance sequence of the abstract sentences in the original text;
1.2.2 Chinese automatic abstract design
1) Text preprocessing: the punctuation marks can be utilized to perform preprocessing work such as chapter, paragraph, sentence and the like on the original text, and the input text is converted into sentence sequences;
2) And (3) filtering: removing irrelevant sentences;
3) Word segmentation: using a given Chinese vocabulary, word segmentation is carried out on the document: for words which cannot be processed, single word processing is carried out, and judgment of word parts is not needed; removing invalid real words according to the stop word list;
4) Statistical analysis: calculating word weights by analyzing vocabulary item information of statistical sentences, and determining document keywords;
5) Extracting abstract: calculating the weight of sentences, and sorting the sentences according to the weight;
6) Outputting an abstract: and outputting the abstract according to the user requirement.
Preferably, the main task of the text preprocessing module is to divide the document into chapters, paragraphs, sentences and the like, mainly taking punctuation marks as division basis, the influence of the marks on grammar or semantics may be larger, but for text preprocessing, the marks are sentence intervals, the input original text is marked according to the information of the chapters, the paragraphs, the sentences and the like, in addition, the sentence patterns of the abstract sentences are mainly statement sentences, special sentence patterns like exclamation sentences, question sentences and the like generally do not directly express the central subject matter of the article, and these factors are considered, so that the sentence patterns are not processed during document preprocessing analysis, and the distinction of full-angle and half-angle punctuation marks is considered during document classification, so that the accuracy of text identification is ensured, various punctuation marks of the text are processed, the structure of the text is identified, and finally the purpose of separating the text by sentence units is achieved.
Preferably, the main function of the statistical analysis module is to count word frequency, calculate weight of the entry and extract keywords.
Preferably, the abstract extracting module is a base of an automatic abstract extracting system and is also a core module, and the main function is to distribute weights to sentences in the document by adopting a weight distribution algorithm to extract abstract sentences: sentences containing the main content of the document can be used as abstract sentences to form an abstract, and whether abstract sentence selection is proper or not directly relates to the quality of the abstract, so that an abstract sentence extraction module is very important.
Preferably, an internet data extraction system, the automatic summary flow is as follows:
1) Firstly, the main content of an article, namely, keyword sentences are grasped, the system mainly scans and matches the whole text through a Chinese vocabulary, removes words in a stop vocabulary in a document, extracts words in the vocabulary from the stop vocabulary, and then screens to determine keywords;
2) Under the current technical conditions, although a computer can combine the key sentences into a complete sentence through the analysis of the key words, the realization is complex, and the technology is very immature, so that the current simpler method still extracts the original sentence from the original text as the key sentence, the system adopts a method for counting sentence weights, namely, the related sentences are weighted according to the formulated rule, and then abstract sentences are selected according to the weighted result;
3) Combining and outputting article abstract: because the abstract is a statement sentence, the question sentence and the exclamation sentence in the sentence are firstly removed, then the weight is properly integrated, the weight of each selected sentence is ordered according to the size after the weight is weighted, and the sentences with the weight larger than the threshold value are arranged according to the sequence in the original text to form the document abstract and output.
The beneficial effects of the invention are as follows: the automatic abstract can automatically extract the theme ideas or central contents of the original text, and the abstract has overview, objectivity, understandability and readability and can be applied to any field.
Drawings
Fig. 1 is a schematic operation diagram of an internet data extraction system according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.
Referring to fig. 1, an internet data extraction system comprises a data acquisition, an automatic summarization and a system function module, wherein the data acquisition comprises keyword acquisition and key media acquisition, and the system function module comprises a text preprocessing module, a word segmentation module, a statistical analysis module, an abstract extraction module and an abstract output module;
the method comprises the steps that preset keywords are automatically searched by adopting a search engine technology, URL deduplication, keyword information extraction, warehousing and the like are carried out on search results, the purpose of monitoring Internet sensitive information is achieved, a system defines two modes of breadth search and depth search, the Internet information is searched, the breadth search is carried out in the system in such a way that a search engine with the front ranking of the Internet industry is called to search the keywords, meanwhile, the search results are integrated, deduplicated and classified, the maximization of the Internet information searching capability is achieved, the depth search is carried out on a user-specified website by using an open source crawler program Nutch, and the webpage information matched with the keywords is searched;
an internet data extraction system, the keyword acquisition function includes the following points:
1) Providing URL collection of keywords in the existing keyword library on a search engine, and providing collection of user-defined keywords;
2) The system uses a URL verification mode to remove the duplicate of the acquired URL;
3) The URL acquisition crawler comprises a depth priority algorithm and a breadth priority algorithm, and the crawling depth and the user authority are configured;
4) Providing a URL label analysis function; content under specific tags including title, date, author, metadata, text, etc. is extracted and classified;
5) Providing extraction of key information in a specific label of the searched result;
6) And the text information extraction function of the news information web page adopts a universal extraction algorithm to extract the web page text.
An internet data extraction system, the key media collection function includes the following points:
1.1 Chinese Multi-document automatic abstract
1) The automatic digest can be divided into a single-document digest and a multi-document digest according to the number of the texts processed by the automatic digest;
2) The multi-document abstract is an important composition technology of an intelligent search engine, one of the hot spots of the research of the recent search engine is an open domain question-answering system, namely, people question in natural language, present the obtained answers to users in the form of paragraphs or sentences, and in fact, after the documents returned by the search engine are reprocessed, the related documents are organically integrated together, which is the task of the multi-document automatic abstract research, so that the research of the multi-document abstract technology greatly promotes the development of a new-generation search engine from the angle;
3) From the user's perspective, the purpose of making the digest is not two: firstly, the information is concise and comprehensive; secondly, expressed in fluent language, in fact, the idea of de-redundancy, extraction of main information, generation of fluent abstract is also reflected in the research angle, and the research on multiple documents mainly comprises two aspects: extracting main information and generating abstract;
4) The work of extracting information, the research taking paragraphs as units has no more room, and the research taking sentences as units is the main stream;
5) Overall structure of the multi-document automatic digest system: firstly, preprocessing an original text, including sentence breaking and word segmentation; then, the similarity between the text units is calculated, and the processed text units are on the sentence level; extracting abstract sentences capable of summarizing the subject from the sentence set, sequencing the extracted abstract sentences, and finally generating abstract;
6) Calculating the similarity between sentences: in different specific applications, the meaning of similarity varies, for example, in instance-based machine translation, similarity is mainly used to measure the degree of substitution of words in text; in information retrieval, the similarity reflects the degree of coincidence of the text and the user query in the sense, in automatic question answering, the similarity reflects the degree of matching of the questions and the answers, and in a multi-document abstract system, the similarity can reflect the degree of fitting of local topic information;
7) Extracting abstract sentences: extracting abstract sentences is carried out on the basis of obtaining a similarity matrix between sentences, most students in the part cluster sentences according to similarity information of sentences through a proper clustering method, and then abstract sentences are generated through extracting centers of each class, and the largest characteristic of a multi-document set is redundancy of information, so that the clustering is an effective strategy for eliminating redundancy;
8) And (3) abstract generation: the generation of the abstract is actually the sorting process of the abstract sentences, for a plurality of documents, because the limit of the documents is broken, sentences among different documents are unordered, if the extracted sentences form the abstract, the time information of the original document where the sentences are located and the position information of the sentences in the original document need to be considered under the general condition, and because the objects researched by us are mostly reports of different websites at the same event and the same time, the sorting of the abstract sentences mainly refers to the position information of the sentences in the document;
1.2 automatic abstracting technique for Chinese documents
1.2.1 automatic extraction: automatic extraction of linear sequences of text as sentences and sentences as words
1) Calculating the weight of the word;
2) Calculating the weight of sentences;
3) Arranging all sentences in the original text in descending order of weight, and determining a plurality of sentences with highest weight as abstract sentences;
4) Outputting all abstract sentences according to the appearance sequence of the abstract sentences in the original text;
1.2.2 Chinese automatic abstract design
1) Text preprocessing: the punctuation marks can be utilized to perform preprocessing work such as chapter, paragraph, sentence and the like on the original text, and the input text is converted into sentence sequences;
2) And (3) filtering: removing irrelevant sentences;
3) Word segmentation: using a given Chinese vocabulary, word segmentation is carried out on the document: for words which cannot be processed, single word processing is carried out, and judgment of word parts is not needed; removing invalid real words according to the stop word list;
4) Statistical analysis: calculating word weights by analyzing vocabulary item information of statistical sentences, and determining document keywords;
5) Extracting abstract: calculating the weight of sentences, and sorting the sentences according to the weight;
6) Outputting an abstract: and outputting the abstract according to the user requirement.
Preferably, the main task of the text preprocessing module is to divide the document into chapters, paragraphs, sentences and the like, mainly taking punctuation marks as division basis, wherein the marks have a relatively large influence on grammar or semantics, but for text preprocessing, the marks are sentence intervals, the input original text is marked according to the information of the chapters, the paragraphs, the sentences and the like, in addition, the sentence patterns of the abstract sentences are mainly statement sentences, special sentence patterns such as exclamation sentences, question sentences and the like generally do not directly express the central subject of the article, and these factors are considered, so that the sentence patterns are not processed during document preprocessing analysis, and the distinction of full-angle and half-angle punctuation marks is considered during document dividing, so that the accuracy of text identification is ensured, various punctuation marks of the text are processed, and the structure of the text is identified, and finally the aim of dividing the text by sentence units is achieved;
the statistical analysis module is mainly used for counting word frequency, calculating weight of an entry and extracting keywords;
the abstract extraction module is a basis of an automatic abstract system and also is a core module, and has the main functions of distributing weights to sentences in a document by adopting a weight distribution algorithm and extracting abstract sentences: sentences containing the main content of the document can be used as abstract sentences to form an abstract, and whether the abstract sentence selection is proper or not directly relates to the quality of the abstract, so that an extraction module of the abstract sentence is very important;
an internet data extraction system, the automatic abstract flow is as follows:
1) Firstly, the main content of an article, namely, keyword sentences are grasped, the system mainly scans and matches the whole text through a Chinese vocabulary, removes words in a stop vocabulary in a document, extracts words in the vocabulary from the stop vocabulary, and then screens to determine keywords;
2) Under the current technical conditions, although a computer can combine the key sentences into a complete sentence through the analysis of the key words, the realization is complex, and the technology is very immature, so that the current simpler method still extracts the original sentence from the original text as the key sentence, the system adopts a method for counting sentence weights, namely, the related sentences are weighted according to the formulated rule, and then abstract sentences are selected according to the weighted result;
3) Combining and outputting article abstract: because the abstract is a statement sentence, the question sentence and the exclamation sentence in the sentence are firstly removed, then the weight is properly integrated, the weight of each selected sentence is ordered according to the size after the weight is weighted, and the sentences with the weight larger than the threshold value are arranged according to the sequence in the original text to form the document abstract and output.
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (8)

1. The Internet data extraction system comprises a data acquisition module, an automatic abstract compiling module and a system functional module, and is characterized in that the data acquisition module comprises a keyword acquisition module and a key media acquisition module, and the system functional module comprises a text preprocessing module, a word segmentation module, a statistical analysis module, an abstract extraction module and an abstract output module.
2. The internet data extraction system according to claim 1, wherein the keyword collection adopts a search engine technology to automatically search preset keywords, URL duplication removal, keyword information extraction, warehouse entry and other treatments are performed on search results, the purpose of monitoring internet sensitive information is achieved, the system defines two modes of breadth search and depth search, the internet information is searched, the breadth search is performed in the system by calling a search engine with the top ranking of the internet industry to search the keywords, and meanwhile, the search results are integrated, duplicated and classified, the maximization of the internet information searching capability is achieved, the depth search is performed by using an open source crawler Nutch to deeply mine a user-specified website, and the web page information matched with the keywords is searched.
3. The internet data extraction system of claim 1, wherein the keyword collection function comprises the following:
1) Providing URL collection of keywords in the existing keyword library on a search engine, and providing collection of user-defined keywords;
2) The system uses a URL verification mode to remove the duplicate of the acquired URL;
3) The URL acquisition crawler comprises a depth priority algorithm and a breadth priority algorithm, and the crawling depth and the user authority are configured;
4) Providing a URL label analysis function; content under specific tags including title, date, author, meta data, text, etc. is extracted and classified;
5) Providing extraction of key information in a specific label of the searched result;
6) And the text information extraction function of the news information web page adopts a universal extraction algorithm to extract the web page text.
4. The internet data extraction system of claim 1, wherein the key media collection function comprises the following:
1.1 Chinese Multi-document automatic abstract
1) The automatic digest can be divided into a single-document digest and a multi-document digest according to the number of the texts processed by the automatic digest;
2) The multi-document abstract is an important composition technology of an intelligent search engine, one of the hot spots of the research of the recent search engine is an open domain question-answering system, namely, people question in natural language, present the obtained answers to users in the form of paragraphs or sentences, and in fact, after the documents returned by the search engine are reprocessed, the related documents are organically integrated together, which is the task of the multi-document automatic abstract research, so that the research of the multi-document abstract technology greatly promotes the development of a new-generation search engine from the angle;
3) From the user's perspective, the purpose of making the digest is not two: firstly, the information is concise and comprehensive; secondly, expressed in fluent language, in fact, the idea of de-redundancy, extraction of main information, generation of fluent abstract is also reflected in the research angle, and the research on multiple documents mainly comprises two aspects: extracting main information and generating abstract;
4) The work of extracting information, the research taking paragraphs as units has no more room, and the research taking sentences as units is the main stream;
5) Overall structure of the multi-document automatic digest system: firstly, preprocessing an original text, including sentence breaking and word segmentation; then, the similarity between the text units is calculated, and the processed text units are on the sentence level; extracting abstract sentences capable of summarizing the subject from the sentence set, sequencing the extracted abstract sentences, and finally generating abstract;
6) Calculating the similarity between sentences: in different specific applications, the meaning of similarity varies, for example, in instance-based machine translation, similarity is mainly used to measure the degree of substitution of words in text; in information retrieval, the similarity reflects the degree of coincidence of the text and the user query in the sense, in automatic question answering, the similarity reflects the degree of matching of the questions and the answers, and in a multi-document abstract system, the similarity can reflect the degree of fitting of local topic information;
7) Extracting abstract sentences: extracting abstract sentences is carried out on the basis of obtaining a similarity matrix between sentences, most students in the part cluster sentences according to similarity information of sentences through a proper clustering method, and then abstract sentences are generated through extracting centers of each class, and the largest characteristic of a multi-document set is redundancy of information, so that the clustering is an effective strategy for eliminating redundancy;
8) And (3) abstract generation: the generation of the abstract is actually the sorting process of the abstract sentences, for a plurality of documents, because the limit of the documents is broken, sentences among different documents are unordered, if the extracted sentences form the abstract, the time information of the original document where the sentences are located and the position information of the sentences in the original document need to be considered under the general condition, and because the objects researched by us are mostly reports of different websites at the same event and the same time, the sorting of the abstract sentences mainly refers to the position information of the sentences in the document;
1.2 automatic abstracting technique for Chinese documents
1.2.1 automatic extraction: automatic extraction of linear sequences of text as sentences and sentences as words
1) Calculating the weight of the word;
2) Calculating the weight of sentences;
3) Arranging all sentences in the original text in descending order of weight, and determining a plurality of sentences with highest weight as abstract sentences;
4) Outputting all abstract sentences according to the appearance sequence of the abstract sentences in the original text;
1.2.2 Chinese automatic abstract design
1) Text preprocessing: the punctuation marks can be utilized to perform preprocessing work such as chapter, paragraph, sentence and the like on the original text, and the input text is converted into sentence sequences;
2) And (3) filtering: removing irrelevant sentences;
3) Word segmentation: using a given Chinese vocabulary, word segmentation is carried out on the document: for words which cannot be processed, single word processing is carried out, and judgment of word parts is not needed; removing invalid real words according to the stop word list;
4) Statistical analysis: calculating word weights by analyzing vocabulary item information of statistical sentences, and determining document keywords;
5) Extracting abstract: calculating the weight of sentences, and sorting the sentences according to the weight;
6) Outputting an abstract: and outputting the abstract according to the user requirement.
5. The internet data extraction system according to claim 1, wherein the main task of the text preprocessing module is to divide the document into chapters, paragraphs, sentences and the like, mainly based on punctuation marks, the marks may have a larger influence on grammar or semantics, but for text preprocessing, the marks are sentence intervals, the input original text is marked according to the information of the chapters, paragraphs, sentences and the like, in addition, the sentence patterns of the abstract sentence are mainly statement sentences, special sentence patterns such as exclamation sentences, question sentences and the like generally do not directly express the central subject matter of the article, and these factors are considered, so that the sentence patterns are not processed during document preprocessing analysis, and the distinction of full-angle half-angle punctuation marks is considered during document classification, so as to ensure the accuracy of text identification, various punctuation marks of the text are processed, the structure of the text is recognized, and finally the purpose of dividing the text in sentence units is achieved.
6. The internet data extraction system according to claim 1, wherein the statistical analysis module has a main function of counting word frequencies, calculating weights of terms, and extracting keywords.
7. The internet data extraction system according to claim 1, wherein the abstract extraction module is a base of an automatic abstract system and is also a core module, and the main function is to assign weights to sentences in the document by using a weight assignment algorithm, and extract abstract sentences: sentences containing the main content of the document can be used as abstract sentences to form an abstract, and whether abstract sentence selection is proper or not directly relates to the quality of the abstract, so that an abstract sentence extraction module is very important.
8. The internet data extraction system of claim 1, wherein the automatic summarization process is as follows:
1) Firstly, the main content of an article, namely, keyword sentences are grasped, the system mainly scans and matches the whole text through a Chinese vocabulary, removes words in a stop vocabulary in a document, extracts words in the vocabulary from the stop vocabulary, and then screens to determine keywords;
2) Under the current technical conditions, although a computer can combine the key sentences into a complete sentence through the analysis of the key words, the realization is complex, and the technology is very immature, so that the current simpler method still extracts the original sentence from the original text as the key sentence, the system adopts a method for counting sentence weights, namely, the related sentences are weighted according to the formulated rule, and then abstract sentences are selected according to the weighted result;
3) Combining and outputting article abstract: because the abstract is a statement sentence, the question sentence and the exclamation sentence in the sentence are firstly removed, then the weight is properly integrated, the weight of each selected sentence is ordered according to the size after the weight is weighted, and the sentences with the weight larger than the threshold value are arranged according to the sequence in the original text to form the document abstract and output.
CN202310350019.8A 2023-04-04 2023-04-04 Internet data extraction system Pending CN116561295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310350019.8A CN116561295A (en) 2023-04-04 2023-04-04 Internet data extraction system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310350019.8A CN116561295A (en) 2023-04-04 2023-04-04 Internet data extraction system

Publications (1)

Publication Number Publication Date
CN116561295A true CN116561295A (en) 2023-08-08

Family

ID=87499059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310350019.8A Pending CN116561295A (en) 2023-04-04 2023-04-04 Internet data extraction system

Country Status (1)

Country Link
CN (1) CN116561295A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150106A (en) * 2023-10-31 2023-12-01 北京大学 Data processing method, system and electronic equipment
CN117271710A (en) * 2023-11-17 2023-12-22 山东接力教育集团有限公司 Teaching assistance hot spot data intelligent analysis system based on big data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117150106A (en) * 2023-10-31 2023-12-01 北京大学 Data processing method, system and electronic equipment
CN117150106B (en) * 2023-10-31 2024-02-13 北京大学 Data processing method, system and electronic equipment
CN117271710A (en) * 2023-11-17 2023-12-22 山东接力教育集团有限公司 Teaching assistance hot spot data intelligent analysis system based on big data
CN117271710B (en) * 2023-11-17 2024-01-30 山东接力教育集团有限公司 Teaching assistance hot spot data intelligent analysis system based on big data

Similar Documents

Publication Publication Date Title
CN109992645B (en) Data management system and method based on text data
Martins et al. Language identification in web pages
Shinzato et al. Tsubaki: An open search engine infrastructure for developing information access methodology
Liu et al. Special issue on web content mining
CN103514183B (en) Information search method and system based on interactive document clustering
Hamborg et al. Automated identification of media bias by word choice and labeling in news articles
Bisandu et al. Clustering news articles using efficient similarity measure and N-grams
CN116561295A (en) Internet data extraction system
CN101667194A (en) Automatic abstracting method and system based on user comment text feature
Chen et al. Template detection for large scale search engines
Kogilavani et al. Clustering and feature specific sentence extraction based summarization of multiple documents
CN102789464A (en) Natural language processing method, device and system based on semanteme recognition
CN110633375A (en) System for media information integration utilization based on government affair work
Subhashini et al. Shallow NLP techniques for noun phrase extraction
Shah et al. An automatic text summarization on Naive Bayes classifier using latent semantic analysis
Ullah et al. Pattern and semantic analysis to improve unsupervised techniques for opinion target identification
CN113934910A (en) Automatic optimization and updating theme library construction method and hot event real-time updating method
Malagi et al. Content Modelling Intelligence System Based on Automatic Text Summarization
Tsapatsoulis Web image indexing using WICE and a learning-free language model
Selvadurai A natural language processing based web mining system for social media analysis
Tedmori et al. Locating knowledge sources through keyphrase extraction
TWI290684B (en) Incremental thesaurus construction method
Bhaskar et al. Theme based English and Bengali ad-hoc monolingual information retrieval in fire 2010
Cha et al. The automatic text summarization using semantic relevance and hierarchical structure of wordnet
Abd Rahim et al. A Summarisation Tool for Hotel Reviews

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination