CN117592470A

CN117592470A - Low-cost gazette data extraction method driven by large language model

Info

Publication number: CN117592470A
Application number: CN202311475771.1A
Authority: CN
Inventors: 伍三威
Original assignee: Shenzhen Xiaoying Network Technology Co ltd
Current assignee: Shenzhen Xiaoying Network Technology Co ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-02-23

Abstract

The invention relates to a system method capable of efficiently extracting predefined index data from a bulletin board web page on conventional computer equipment. The system method mainly comprises the following four steps: first, the content of the bulletin text is obtained by using a crawler tool and a text extraction algorithm. Secondly, a form picture recognition tool is used for recognizing form pictures in the report, and all information is integrated into plain text. Next, the text information is segmented and index names and data in each segment are extracted through the large language model API plus specific hinting words. And finally, generating word vectors of all the extracted index names and word vectors of the predefined index names, carrying out word vector matching screening, calling a large language model API to carry out secondary judgment, and determining and recording whether the predefined index exists in all the extracted indexes. By the mode, the system realizes automatic extraction of a large number of data indexes in the webpage gazette contents released in different areas, and the market research efficiency is remarkably improved.

Description

Low-cost gazette data extraction method driven by large language model

1. Technical field

The invention belongs to the field of computer data processing, and particularly relates to an index data extraction method in network bulletin information, in particular to a method for deep learning, application of a large language model and natural language processing.

2. Background art

In market research activities we often need to count a number of predefined indicators for each province and region, which refers to the index data that has been determined to be collected and analyzed prior to conducting market research or data analysis. These index data are typically preset for evaluating a particular performance, effect, or other relevant factor. For example, in consumer market research we will set predefined indicators such as "resident population", "average consumption expenditure", "average available income" to be concerned in advance.

However, the index data are scattered and distributed in the statistical gazettes of all websites, and the index data are often required to be queried manually one by one during investigation, and the well-organized table results are not formatted; in order to extract them into the form for statistical analysis, a data extraction system is needed that can accurately understand and extract the indices in the text. Usually, a large language model is used for solving the problems, the large language model is a deep learning model trained based on massive text data, and the input text and the requirements can be understood and corresponding answers can be output. Compared with the prior language model, the large language model greatly expands the model parameter size, pre-training data and total computation, has higher understanding and processing capacity of text information, can give answer meeting requirements by a contact context, and completes a plurality of natural language processing tasks such as question-answer dialogue, text abstract, text color rendering, information extraction and the like.

Through investigation, the task of extracting the bulletin data by using the existing tools and methods on the current market still faces some problems:

(1) Directly calling the large language model API to accomplish the task, the large language model API (Application Programming Interface) is a software interface that allows the developer to send information to the large language model for processing and obtaining output results by way of network communication, which is the dominant way to use the large language model. However, when the entire report content is input to the large language model API and requests data for determining whether there are predefined indicators and extracting, a problem of context limitation is encountered. For example: the text memory length of the "text-to-speech" of the domestic large language model product is limited to 3000 words, however, the content of reports of many gazettes and the like is usually more than 7000 words and even tens of thousands of words, and a large number of predefined index names which also need to be input are added, so that the large language model of the current mainstream is difficult to input and process huge information at one time.

(2) The method comprises the steps that knowledge retrieval is carried out by adopting a large language model API and word vector database combined method to complete the task, for example, a browsing document tool in a 'religion' can read all contents in a text, word vectors of each sentence are generated and all recorded in the database, and then related contents in the text are found by utilizing a word vector matching technology and input to the large language model API to generate replies according to predefined indexes required to be extracted by a user; this method requires that word vectors are generated for each sentence to facilitate retrieval, but this is very costly and requires very high hardware equipment, and the efficiency of extracting index data is very low, and the cost of completing a large amount of data extraction tasks is relatively high and not very suitable for individual users and small enterprises in the current large environment where GPU computing power is scarce.

Therefore, a method for extracting the bulletin content data in batches, which can be operated on a conventional networking computer in multiple ways and is low in cost under the condition of limited GPU computing power, is still lacking at present, so the patent is still the first time.

3. Summary of the invention

Based on the defects of the prior art, the invention aims to meet the market analysis and research requirements of individual users and small enterprises for extracting predefined index data in communique, and provides a low-cost method which can be operated in multiple ways on a conventional networked computer device and can complete data extraction tasks of large-batch network communique contents.

In order to achieve the above purpose, the present invention is mainly realized by four modules: web page text acquisition, report information analysis, index data extraction, word vector matching and language model judgment.

(1) The webpage text acquisition module acquires all webpage information dynamically loaded from a specified gazette website; and then acquiring a bulletin text from the webpage information by adopting a webpage text extraction algorithm, and downloading an attachment document existing in the bulletin webpage to be processed.

(2) The report information analysis module is used for analyzing text information in documents in formats such as PDF and DOC in the attachment, identifying form information in form pictures in the report, and combining and arranging all the information into a piece of plain text information according to the original arrangement sequence.

(3) And the index data extraction module is used for dividing the tidied text information into a plurality of paragraphs with the length not exceeding a set length, calling a large language model API to extract all index data in each paragraph by using a specific prompt word, wherein the index data comprises index names and index values, and finally reading all index data returned in a JSON list format and storing the all index data in a local database.

(4) The word vector matching and large language model judging module generates word vectors for names of all the index data actually extracted in the index data extracting module, generates word vectors for names of the predefined indexes and synonymous names thereof, calculates the similarity between the word vectors corresponding to the names of each predefined index and all the word vectors obtained in the last step by utilizing a word vector matching technology, and screens out indexes with the similarity higher than a set threshold value from all the actually extracted indexes; and finally, calling a large language model API for each predefined index to perform secondary judgment on the multiple index results screened in the previous step, and determining whether the predefined index data exists.

Further, the web page text acquisition module acquires all web page information loaded dynamically by adopting a full-automatic crawler tool, acquires the content of the bulletin text by adopting a web page text extraction algorithm based on text density, and can start operation only by inputting the website of the bulletin.

Further, the report information analysis module has a picture identification function and an accessory document content extraction function, and the picture identification function adopts a lightweight AI model available on a local conventional computer device, so that characters of each cell and cell position information in a form picture can be obtained.

Further, the index data extraction module inputs the text and the specific prompt word of each segment to the large language model API to extract the data index in each segment; the specific prompt words comprise a requirement for extracting index data and a requirement for returning according to a JSON list format; wherein, the JSON list format is as the "[ { index name: index value }, … ] ".

Furthermore, the word vector matching and large language model judging module can manually add a predefined index name and a synonym name thereof before the method is operated, and the synonym name is used as a word vector matching result and is also used as a matching result of the predefined index name; the word vector matching function employs a lightweight word vector model available on local conventional computer equipment; and screening a group of indexes with similarity higher than a set threshold value through word vector matching, calling a large language model API to judge whether indexes synonymous with the predefined index names exist in the indexes, and if so, recording data until the table is not left.

Compared with the prior art, the method has the advantages that the large language model is used for extracting full-text indexes once, and then word vectors are generated for the index names, so that the content and hardware consumption of the word vectors to be generated can be greatly reduced; and then, index data with high predefined index similarity is screened out by further utilizing a word vector matching technology, and finally, secondary judgment is carried out by combining a large language model, so that the information quantity required to be input into the large language model for processing is reduced.

The method can successfully complete the task of extracting the predefined index data from a large number of webpage gazettes under the conditions of limited memory length of the context of the large language model and insufficient GPU (graphic processing unit) on the local conventional computer equipment.

4. Description of the drawings

The drawings serve to further illustrate the invention, but the embodiments in the drawings do not constitute any limitation of the invention, and other drawings can be obtained by a person skilled in the art without inventive effort from the following drawings.

FIG. 1 is a schematic diagram of the overall module operation flow of the present invention

Fig. 2 is a diagram showing a recognition result of a form picture in the report information parsing module, in which fig. 2A is a form picture example, and fig. 2B is an HTML text result example with cell position information to be recognized in fig. 2A;

FIG. 3 is an example of a specific hint word for directing a large language to extract all of the index data for each segment in step two of the index data extraction module;

FIG. 4 is an example of the index data extraction result returned by the large language model API according to the JSON list format in step three of the index data extraction module;

FIG. 5 is an example of a template that supplements each predefined indicator name with multiple synonym names in the word vector matching and language model decision module;

FIG. 6 is an example of the final result of sorting the decision results of all predefined indicators into a table in the word vector matching and language model decision module.

5. Detailed description of the preferred embodiments

It is to be understood that, according to the technical method of the present invention, those skilled in the art may propose various structural manners and implementation manners that may be replaced with each other without changing the true spirit of the present invention. Accordingly, the following detailed description and the accompanying drawings are merely illustrative of the present technology and are not to be taken as an all-or-limiting definition or limitation of the present technology.

In order that the contents of the present invention may be more clearly understood, embodiments of the present invention will be further described in detail with reference to the accompanying drawings.

A large language model driven low cost data extraction method comprises a webpage text acquisition module, a report information analysis module, an index data extraction module, a word vector matching and language model judgment module, wherein the flow is shown in figure 1, and after the website and the name of the predefined index of each region are input in batches, the data statistics form of the predefined index of each region is finally obtained through sequential processing of the four modules.

(1) The webpage text acquisition module comprises the following working steps:

step one: and each time one of the batch input gazette website links is read, an automatic crawler tool (such as a Sepenum or the like) is called to open the webpage links and acquire all the dynamically loaded webpage information.

Step two: and filtering information such as columns and decorations in the webpage by applying a webpage text extraction algorithm based on text density, and obtaining report text content in the webpage. The body of the report may include form pictures and some attachment documents, which are automatically downloaded when the body is obtained.

Since the large language model can only process plain text information, but the report text content is not only plain text, but also can have PDF (portable document format) and other data of pictures and attachments, the information in the pictures (mostly table pictures) and the attachment documents needs to be completely parsed and recognized as plain text.

(2) The report information analysis module comprises the following working steps:

step one: and analyzing the information of the attachment document, wherein no proper pdf analysis tool exists at the current stage, so that the pdf file can be converted into a docx file by using an open-source document conversion tool pdf2docx, a windows application interactive tool window 32com client is used for controlling Microsoft Word to convert the doc file into docx, and finally, the open-source tool docx is used for analyzing texts and pictures in the converted docx file, and the texts and pictures are stored for standby.

Step two: the pictures in the body and the pictures analyzed by the attachment are recognized as text information (shown in fig. 2) in an HTML format by a table picture recognition tool (such as PaddleOCR, etc.) and are separately listed as a piece of text. And then recombining all the text and picture recognition results into plain text information according to the original sequence.

In the method, a table picture recognition tool provided with an AI model is adopted to recognize table information in pictures, and the tool is used for a lightweight table recognition model, a text detection model, a text recognition model and a layout analysis model. These models may run on conventional personal computer devices.

(3) The index data extraction module comprises the following working steps:

step one: all the obtained plain text information is divided into a segment of not more than 1000 words, wherein each segment is to be finished with a complete sentence.

Step two: each text is added with a specific Prompt word (Prompt) and then is input into a large language model API, and all data indexes are required to be extracted. The hint words refer to words or phrases that are used to guide a large language model to understand and process a particular task, which can help the model understand the current task requirements and context, thereby generating more accurate results. The hint words used in this step of the method are shown in figure 3.

Step three: index data (shown in figure 4) returned according to the JSON list format is extracted and stored in Excel or a database, and meanwhile, the section from which each index comes can be recorded, so that subsequent checking is facilitated.

(4) In the gazette, the index names with the same meaning as the predefined index have the problem of inconsistent words used by the predefined index names, so that all indexes actually extracted cannot be directly matched with the required predefined index one by one, and a word vector matching method is used. Word vectors are a technique of converting words into vector data so that synonyms or frequently occurring words are closer to each other in terms of a cosine distance calculated between the word vectors corresponding to the synonyms or frequently occurring words, and thus the similarity of two words can be obtained through cosine calculation.

The calculation formula is as follows:

where A and B are two word vectors, A.B is the dot product of the two vectors, and II A and II B are the modulus of the vectors A and B, respectively (i.e., the length of the vectors).

The word vector matching and language model judging module comprises the following working steps:

step one: generating word vectors for all the extracted index names, and generating word vectors for the predefined index names and the synonym names thereof.

Step two: calculating the similarity between a group of word vectors of each predefined index and the synonymous name thereof and the word vector obtained in the last step by using a word vector matching technology (such as text2vec and other tool libraries); and screening out indexes with the similarity higher than a set threshold value from all the extracted indexes. In the implementation of the method, for each predefined index, a plurality of matching results with the similarity of which is greater than a threshold value of 0.6 are screened from all the extracted indexes.

Because of the performance limitations of the word vector model, word vector matching may also miss names with higher relevance. In order to ensure that the result of word vector matching fully covers synonyms, the method can supplement and set a plurality of synonym names for each predefined index name in advance, as shown in fig. 5, the word vector generated by the synonym name is also used as the word vector corresponding to the predefined index name, and the result of word vector matching screening is also used as the similar matching result of the predefined index name, so that the defect of word vector matching can be effectively overcome.

Step three: calling the large language model API service, inputting each predefined index name and the matching result obtained by the predefined index name to the large language model to judge, determining whether the data conforming to the predefined index exists in the large language model API service, and also requiring the result to be returned according to the JSON list format. The predefined index, if any, may be determined and all of the decision results may be sorted into a table as a final result, as illustrated in fig. 6.

The word vector matching is used for preliminary screening in the module, so that the information quantity input into the large language model for processing can be reduced again, the information extraction efficiency is improved, and the cost consumption of using the large language model API is reduced. And then, determining a final result by combining the large language model judgment, thereby ensuring the accuracy of data extraction.

The foregoing is merely a preferred embodiment of the present invention, and is not intended to limit the present invention so that those skilled in the art can understand or practice the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present patent should be included in the protection scope of the present patent.

Claims

1. A large language model driven low cost gazette data extraction method characterized in that data is extracted by a corresponding module, the extraction method comprising the steps of:

a. the webpage text acquisition module is used for:

i. acquiring all dynamically loaded webpage information from a specified bulletin website;

ii, acquiring a bulletin text from the webpage information by adopting a webpage text extraction algorithm;

downloading the accessory document existing in the gazette to be processed;

b. the report information analysis module is used for:

i. analyzing text information in documents in formats such as PDF, DOC and the like in the attachment;

identifying form information in the form pictures in the report;

combining all the information according to the original arrangement sequence and finishing the information into a part of plain text information;

c. the index data extraction module is used for:

i. dividing the tidied text information into a plurality of paragraphs with the length not exceeding a set length;

calling a large language model API to extract all index data in each section by using a specific prompt word, wherein the index data comprises index names and index numerical values;

reading and storing all index data results returned in a JSON list format;

d. the word vector matching and large language model judging module is used for:

i. generating word vectors for the names of all index data actually extracted in the previous step;

and ii, generating word vectors for the names of the predefined indexes and the synonymous names thereof, calculating the similarity between the word vector corresponding to the name of each predefined index and all word vectors obtained in the previous step by utilizing a word vector matching technology, and screening indexes with the similarity higher than a set threshold value from all indexes actually extracted.

Calling a large language model API for each predefined index to make secondary judgment on the multiple index results screened in the previous step, and determining whether the predefined index data exists.

2. The method for extracting the data of the low-cost gazette driven by the large language model according to claim 1, wherein the web page text acquisition module adopts a fully-automatic crawler tool to acquire all web page information loaded dynamically, and adopts a web page text extraction algorithm based on text density to acquire the content of the gazette text, and the operation can be started only by inputting the website of the gazette.

3. The large language model driven low cost bulletin data extraction method of claim 1, wherein the report information analysis module has a picture recognition function and an accessory document content extraction function, the picture recognition function adopts a lightweight AI model available on a local conventional computer device, and can obtain the text of each cell and the cell position information in the form picture.

4. The large language model driven low cost gazette data extraction method of claim 1 wherein the index data extraction module inputs text and specific hint words of each segment to a large language model API to extract data indexes in each segment; the specific prompt word content comprises a requirement of extracting index data and a requirement of returning according to a JSON list format; wherein, the JSON list format is as the "[ { index name: index value }, … ] ".

5. The large language model driven low cost gazette data extraction method of claim 1 wherein in the word vector matching and large language model decision module, a plurality of predefined index names and their synonym names may be manually added before the method is run, the word vector matching function employing a lightweight word vector model available on a local conventional computer device; and screening a group of indexes with similarity higher than a set threshold value through word vector matching, calling a large language model API to judge whether indexes synonymous with the predefined index names exist in the indexes, and recording data to a table if the indexes exist, and leaving blank if the indexes exist.

6. The large language model driven low cost gazette data extraction method of claim 1 wherein the method is capable of running automatically through a whole process on a networked conventional computer device and the performance of a single device is sufficient to support simultaneous operation of multiple programs.