CN116561295A

CN116561295A - Internet data extraction system

Info

Publication number: CN116561295A
Application number: CN202310350019.8A
Authority: CN
Inventors: 林松林
Original assignee: Lu'an Yili Innovation Technology Co ltd
Current assignee: Lu'an Yili Innovation Technology Co ltd
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-08-08

Abstract

The invention discloses an Internet data extraction system, which comprises a data acquisition, an automatic abstract compiling and a system function module, wherein the data acquisition comprises a keyword acquisition and a key media acquisition, the system function module comprises a text preprocessing module, a word segmentation module, a statistics analysis module, an abstract extraction module and an abstract output module, the keyword acquisition adopts a search engine technology to automatically search preset keywords, URL deduplication, key information extraction, warehousing and other treatments are carried out on search results, the aim of monitoring Internet sensitive information is achieved, the system defines two modes of breadth search and depth search, the Internet information is searched, and the mode adopted in the system is that a search engine positioned in front of the Internet industry is called to search the keywords. The automatic abstract can automatically extract the theme ideas or central contents of the original text, has overview, objectivity, understandability and readability, and can be applied to any field.

Description

Internet data extraction system

Technical Field

The invention relates to the technical field of Internet, in particular to an Internet data extraction system.

Background

The Internet data is a new data source with low cost, quick response, high real-time performance and rich information relative to the data in the database, is a further expansion of the machine data category, is full-capacity and bidirectional communication data carried by network optical fibers connected between a client and a server, is a processed high-value service available data source, is a reference data resource capable of knowing the most comprehensive and objective service operation condition of a system in real time and in a visual way, is a processed high-value service available data source, and is used for reconstructing data transmitted in a mass network into structured data in real time to help IT operation staff to create a behavior baseline and detect abnormal behaviors and perform real-time performance fault positioning and elimination, the method is a data resource which is the most comprehensive and high in value of service operation conditions, the use of interconnection data enables IT itself to better provide a surge power for technological introduction of service innovation, the method is close to the service, and a development team is not required to be touched to modify application, the influence on a production system 0 is different from Internet big data, the method is more real-time, comprehensive and deep, the condition of an application program stack can be displayed, the condition of the whole delivery chain can be displayed, but in the past, how to analyze service data and user behaviors in the interconnection data are insufficient in practice in the international range, and especially in the link of systematic application of the interconnection data, innovation foundation promotion in various aspects such as methodology, best practice, engineering technology, technology stack maturity management and the like is also required to be established, and an extraction system is required to be used during Internet data extraction;

the existing extraction system is inconvenient for the subject ideas or central contents of the original text, and has poor applicability, so that an Internet data extraction system is provided for solving the problems.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides an Internet data extraction system.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

the Internet data extraction system comprises a data acquisition module, an automatic abstract compiling module and a system function module, wherein the data acquisition module comprises a keyword acquisition module and a key media acquisition module, and the system function module comprises a text preprocessing module, a word segmentation module, a statistical analysis module, an abstract extraction module and an abstract output module.

Preferably, the keyword collection adopts a search engine technology to automatically search preset keywords, URL duplication removal, keyword information extraction, warehouse entry and other treatments are carried out on search results, the purpose of monitoring internet sensitive information is achieved, a system defines two modes of breadth search and depth search, internet information is searched, the breadth search adopts a mode of calling a search engine with the top ranking of the internet industry to search the keywords, meanwhile, the search results are integrated, duplicated and classified, the maximization of the internet information searching capability is achieved, the depth search utilizes an open source crawler program Nutch to carry out depth mining on a user-specified website, and the webpage information matched with the keywords is searched.

Preferably, the internet data extraction system, the keyword collection function includes the following points:

1) Providing URL collection of keywords in the existing keyword library on a search engine, and providing collection of user-defined keywords;

2) The system uses a URL verification mode to remove the duplicate of the acquired URL;

3) The URL acquisition crawler comprises a depth priority algorithm and a breadth priority algorithm, and the crawling depth and the user authority are configured;

4) Providing a URL label analysis function; content under specific tags including title, date, author, metadata, text, etc. is extracted and classified;

5) Providing extraction of key information in a specific label of the searched result;

6) And the text information extraction function of the news information web page adopts a universal extraction algorithm to extract the web page text.

Preferably, an internet data extraction system, the key media collection function includes the following points:

1.1 Chinese Multi-document automatic abstract

1) The automatic digest can be divided into a single-document digest and a multi-document digest according to the number of the texts processed by the automatic digest;

2) The multi-document abstract is an important composition technology of an intelligent search engine, one of the hot spots of the research of the recent search engine is an open domain question-answering system, namely, people question in natural language, present the obtained answers to users in the form of paragraphs or sentences, and in fact, after the documents returned by the search engine are reprocessed, the related documents are organically integrated together, which is the task of the multi-document automatic abstract research, so that the research of the multi-document abstract technology greatly promotes the development of a new-generation search engine from the angle;

3) From the user's perspective, the purpose of making the digest is not two: firstly, the information is concise and comprehensive; secondly, expressed in fluent language, in fact, the idea of de-redundancy, extraction of main information, generation of fluent abstract is also reflected in the research angle, and the research on multiple documents mainly comprises two aspects: extracting main information and generating abstract;

4) The work of extracting information, the research taking paragraphs as units has no more room, and the research taking sentences as units is the main stream;

5) Overall structure of the multi-document automatic digest system: firstly, preprocessing an original text, including sentence breaking and word segmentation; then, the similarity between the text units is calculated, and the processed text units are on the sentence level; extracting abstract sentences capable of summarizing the subject from the sentence set, sequencing the extracted abstract sentences, and finally generating abstract;

6) Calculating the similarity between sentences: in different specific applications, the meaning of similarity varies, for example, in instance-based machine translation, similarity is mainly used to measure the degree of substitution of words in text; in information retrieval, the similarity reflects the degree of coincidence of the text and the user query in the sense, in automatic question answering, the similarity reflects the degree of matching of the questions and the answers, and in a multi-document abstract system, the similarity can reflect the degree of fitting of local topic information;

7) Extracting abstract sentences: extracting abstract sentences is carried out on the basis of obtaining a similarity matrix between sentences, most students in the part cluster sentences according to similarity information of sentences through a proper clustering method, and then abstract sentences are generated through extracting centers of each class, and the largest characteristic of a multi-document set is redundancy of information, so that the clustering is an effective strategy for eliminating redundancy;

8) And (3) abstract generation: the generation of the abstract is actually the sorting process of the abstract sentences, for a plurality of documents, because the limit of the documents is broken, sentences among different documents are unordered, if the extracted sentences form the abstract, the time information of the original document where the sentences are located and the position information of the sentences in the original document need to be considered under the general condition, and because the objects researched by us are mostly reports of different websites at the same event and the same time, the sorting of the abstract sentences mainly refers to the position information of the sentences in the document;

1.2 automatic abstracting technique for Chinese documents

1.2.1 automatic extraction: automatic extraction of linear sequences of text as sentences and sentences as words

1) Calculating the weight of the word;

2) Calculating the weight of sentences;

3) Arranging all sentences in the original text in descending order of weight, and determining a plurality of sentences with highest weight as abstract sentences;

4) Outputting all abstract sentences according to the appearance sequence of the abstract sentences in the original text;

1.2.2 Chinese automatic abstract design

1) Text preprocessing: the punctuation marks can be utilized to perform preprocessing work such as chapter, paragraph, sentence and the like on the original text, and the input text is converted into sentence sequences;

2) And (3) filtering: removing irrelevant sentences;

3) Word segmentation: using a given Chinese vocabulary, word segmentation is carried out on the document: for words which cannot be processed, single word processing is carried out, and judgment of word parts is not needed; removing invalid real words according to the stop word list;

4) Statistical analysis: calculating word weights by analyzing vocabulary item information of statistical sentences, and determining document keywords;

5) Extracting abstract: calculating the weight of sentences, and sorting the sentences according to the weight;

6) Outputting an abstract: and outputting the abstract according to the user requirement.

Preferably, the main task of the text preprocessing module is to divide the document into chapters, paragraphs, sentences and the like, mainly taking punctuation marks as division basis, the influence of the marks on grammar or semantics may be larger, but for text preprocessing, the marks are sentence intervals, the input original text is marked according to the information of the chapters, the paragraphs, the sentences and the like, in addition, the sentence patterns of the abstract sentences are mainly statement sentences, special sentence patterns like exclamation sentences, question sentences and the like generally do not directly express the central subject matter of the article, and these factors are considered, so that the sentence patterns are not processed during document preprocessing analysis, and the distinction of full-angle and half-angle punctuation marks is considered during document classification, so that the accuracy of text identification is ensured, various punctuation marks of the text are processed, the structure of the text is identified, and finally the purpose of separating the text by sentence units is achieved.

Preferably, the main function of the statistical analysis module is to count word frequency, calculate weight of the entry and extract keywords.

Preferably, the abstract extracting module is a base of an automatic abstract extracting system and is also a core module, and the main function is to distribute weights to sentences in the document by adopting a weight distribution algorithm to extract abstract sentences: sentences containing the main content of the document can be used as abstract sentences to form an abstract, and whether abstract sentence selection is proper or not directly relates to the quality of the abstract, so that an abstract sentence extraction module is very important.

Preferably, an internet data extraction system, the automatic summary flow is as follows:

1) Firstly, the main content of an article, namely, keyword sentences are grasped, the system mainly scans and matches the whole text through a Chinese vocabulary, removes words in a stop vocabulary in a document, extracts words in the vocabulary from the stop vocabulary, and then screens to determine keywords;

2) Under the current technical conditions, although a computer can combine the key sentences into a complete sentence through the analysis of the key words, the realization is complex, and the technology is very immature, so that the current simpler method still extracts the original sentence from the original text as the key sentence, the system adopts a method for counting sentence weights, namely, the related sentences are weighted according to the formulated rule, and then abstract sentences are selected according to the weighted result;

3) Combining and outputting article abstract: because the abstract is a statement sentence, the question sentence and the exclamation sentence in the sentence are firstly removed, then the weight is properly integrated, the weight of each selected sentence is ordered according to the size after the weight is weighted, and the sentences with the weight larger than the threshold value are arranged according to the sequence in the original text to form the document abstract and output.

The beneficial effects of the invention are as follows: the automatic abstract can automatically extract the theme ideas or central contents of the original text, and the abstract has overview, objectivity, understandability and readability and can be applied to any field.

Drawings

Fig. 1 is a schematic operation diagram of an internet data extraction system according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments.

Referring to fig. 1, an internet data extraction system comprises a data acquisition, an automatic summarization and a system function module, wherein the data acquisition comprises keyword acquisition and key media acquisition, and the system function module comprises a text preprocessing module, a word segmentation module, a statistical analysis module, an abstract extraction module and an abstract output module;

the method comprises the steps that preset keywords are automatically searched by adopting a search engine technology, URL deduplication, keyword information extraction, warehousing and the like are carried out on search results, the purpose of monitoring Internet sensitive information is achieved, a system defines two modes of breadth search and depth search, the Internet information is searched, the breadth search is carried out in the system in such a way that a search engine with the front ranking of the Internet industry is called to search the keywords, meanwhile, the search results are integrated, deduplicated and classified, the maximization of the Internet information searching capability is achieved, the depth search is carried out on a user-specified website by using an open source crawler program Nutch, and the webpage information matched with the keywords is searched;

an internet data extraction system, the keyword acquisition function includes the following points:

An internet data extraction system, the key media collection function includes the following points:

1.1 Chinese Multi-document automatic abstract

1.2 automatic abstracting technique for Chinese documents

1) Calculating the weight of the word;

2) Calculating the weight of sentences;

1.2.2 Chinese automatic abstract design

2) And (3) filtering: removing irrelevant sentences;

Preferably, the main task of the text preprocessing module is to divide the document into chapters, paragraphs, sentences and the like, mainly taking punctuation marks as division basis, wherein the marks have a relatively large influence on grammar or semantics, but for text preprocessing, the marks are sentence intervals, the input original text is marked according to the information of the chapters, the paragraphs, the sentences and the like, in addition, the sentence patterns of the abstract sentences are mainly statement sentences, special sentence patterns such as exclamation sentences, question sentences and the like generally do not directly express the central subject of the article, and these factors are considered, so that the sentence patterns are not processed during document preprocessing analysis, and the distinction of full-angle and half-angle punctuation marks is considered during document dividing, so that the accuracy of text identification is ensured, various punctuation marks of the text are processed, and the structure of the text is identified, and finally the aim of dividing the text by sentence units is achieved;

the statistical analysis module is mainly used for counting word frequency, calculating weight of an entry and extracting keywords;

the abstract extraction module is a basis of an automatic abstract system and also is a core module, and has the main functions of distributing weights to sentences in a document by adopting a weight distribution algorithm and extracting abstract sentences: sentences containing the main content of the document can be used as abstract sentences to form an abstract, and whether the abstract sentence selection is proper or not directly relates to the quality of the abstract, so that an extraction module of the abstract sentence is very important;

an internet data extraction system, the automatic abstract flow is as follows:

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims

1. The Internet data extraction system comprises a data acquisition module, an automatic abstract compiling module and a system functional module, and is characterized in that the data acquisition module comprises a keyword acquisition module and a key media acquisition module, and the system functional module comprises a text preprocessing module, a word segmentation module, a statistical analysis module, an abstract extraction module and an abstract output module.

2. The internet data extraction system according to claim 1, wherein the keyword collection adopts a search engine technology to automatically search preset keywords, URL duplication removal, keyword information extraction, warehouse entry and other treatments are performed on search results, the purpose of monitoring internet sensitive information is achieved, the system defines two modes of breadth search and depth search, the internet information is searched, the breadth search is performed in the system by calling a search engine with the top ranking of the internet industry to search the keywords, and meanwhile, the search results are integrated, duplicated and classified, the maximization of the internet information searching capability is achieved, the depth search is performed by using an open source crawler Nutch to deeply mine a user-specified website, and the web page information matched with the keywords is searched.

3. The internet data extraction system of claim 1, wherein the keyword collection function comprises the following:

4) Providing a URL label analysis function; content under specific tags including title, date, author, meta data, text, etc. is extracted and classified;

4. The internet data extraction system of claim 1, wherein the key media collection function comprises the following:

1.1 Chinese Multi-document automatic abstract

1.2 automatic abstracting technique for Chinese documents

1) Calculating the weight of the word;

2) Calculating the weight of sentences;

1.2.2 Chinese automatic abstract design

2) And (3) filtering: removing irrelevant sentences;

5. The internet data extraction system according to claim 1, wherein the main task of the text preprocessing module is to divide the document into chapters, paragraphs, sentences and the like, mainly based on punctuation marks, the marks may have a larger influence on grammar or semantics, but for text preprocessing, the marks are sentence intervals, the input original text is marked according to the information of the chapters, paragraphs, sentences and the like, in addition, the sentence patterns of the abstract sentence are mainly statement sentences, special sentence patterns such as exclamation sentences, question sentences and the like generally do not directly express the central subject matter of the article, and these factors are considered, so that the sentence patterns are not processed during document preprocessing analysis, and the distinction of full-angle half-angle punctuation marks is considered during document classification, so as to ensure the accuracy of text identification, various punctuation marks of the text are processed, the structure of the text is recognized, and finally the purpose of dividing the text in sentence units is achieved.

6. The internet data extraction system according to claim 1, wherein the statistical analysis module has a main function of counting word frequencies, calculating weights of terms, and extracting keywords.

7. The internet data extraction system according to claim 1, wherein the abstract extraction module is a base of an automatic abstract system and is also a core module, and the main function is to assign weights to sentences in the document by using a weight assignment algorithm, and extract abstract sentences: sentences containing the main content of the document can be used as abstract sentences to form an abstract, and whether abstract sentence selection is proper or not directly relates to the quality of the abstract, so that an abstract sentence extraction module is very important.

8. The internet data extraction system of claim 1, wherein the automatic summarization process is as follows: