CN113127595B

CN113127595B - Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract

Info

Publication number: CN113127595B
Application number: CN202110451466.3A
Authority: CN
Inventors: 王静; 贾宁
Original assignee: Chinascope Shanghai Technology Co ltd
Current assignee: Chinascope Shanghai Technology Co ltd
Priority date: 2021-04-26
Filing date: 2021-04-26
Publication date: 2022-08-16
Anticipated expiration: 2041-04-26
Also published as: CN113127595A

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a storage medium for extracting viewpoint details of a research and report abstract. The method comprises the following steps: acquiring a research report file, and acquiring a plurality of text data and text blocks from the research report file; searching a summary part in the text block, and acquiring a summary text from the corresponding text data; analyzing effective classification features in the abstract text according to the text features, and classifying the abstract text according to the effective classification features and paragraph sequences; and extracting the viewpoint and the details of the abstract text according to the classification. The invention can process the research and report with various complex formats, can accurately divide the summary part of the research and report, and can adaptively select characteristics for classification when dividing viewpoints and details.

Description

Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a storage medium for extracting viewpoint details of a research and report abstract.

Background

In order to deal with the company research and report on the market with large and complicated information amount and help professional investors to integrate key viewpoints in the massive research and report, the viewpoints and the corresponding details need to be extracted from the summary part of the research and report. The research reports sent by different dealer have different formats and different languages, which brings difficulty to the extraction of viewpoints and details.

Chinese patent CN 107358208A structured information extraction method and device for PDF document, which proposes a structured information extraction method for PDF document, the method firstly deletes the directory, header and footer information in the original page to obtain an actual page, and extracts the title and the text content belonging to the title from the actual page according to the title level. In the process of acquiring the actual page, whether the page is a directory or not and whether header footers exist or not are judged according to keywords or rules, and the generalization capability of the method is limited according to the judgment of the keywords or the rules.

Disclosure of Invention

The invention aims to solve the technical problems that the company reviews with large information quantity and complex information are difficult to extract viewpoints and corresponding details from reviews, and aims to provide a viewpoint detail extraction method, a device, equipment and a storage medium for reviews, which can accurately extract viewpoints and details from reviews in various complex formats.

The method for extracting the viewpoint details of the research and report abstract comprises the following steps:

acquiring a research report file, and acquiring a plurality of text data and a plurality of text blocks from the research report file;

searching an abstract part in the text block, and acquiring an abstract text from the corresponding text data according to the coordinate of the abstract part;

analyzing effective classification features in the abstract text according to text features, and classifying the abstract text according to paragraph sequences according to the effective classification features;

and extracting the viewpoint and the details of the abstract text according to the classification.

Optionally, the obtaining a research report obtains a plurality of text data and a plurality of text blocks from the research report file, and includes:

acquiring a research report, analyzing the research report file by adopting a preset analysis tool to obtain complete text paragraph data containing text characteristics, coordinates and page numbers, dividing the complete text paragraph data into a plurality of text data according to the text characteristics, wherein the text characteristics comprise at least one of character colors, character sizes, character fonts or thickening, and the coordinates of the text data comprise X coordinates, Y coordinates, text width and text height;

and carrying out target detection on the research message through a preset target detection model to obtain a target detection result as the text block, wherein the target detection result is the coordinate, page number and category information of a plurality of targets in the research and report file, and the coordinate of the text block comprises an X coordinate, a Y coordinate, the width of the text block and the height of the text block.

Optionally, the category information includes at least one or a combination of a report title, a special structure, a statistical chart, a structure chart, a table, a chart title, a chart annotation, a header, a footer, a text or a text title.

Optionally, the performing target detection on the messaging item through a preset target detection model to obtain a target detection result as the text block includes:

and acquiring the research and report file, converting the page of each page in the research and report file into a picture to obtain a picture file, calling the target detection model, and inputting the picture file into the target detection model to obtain the target detection result.

Optionally, the performing, by using a preset target detection model, target detection on the research message to obtain a target detection result as the text block includes:

and screening out the text blocks of which the category information is the text or the text title from the plurality of text blocks.

Optionally, after acquiring a plurality of text data and a plurality of text blocks from the research and report file, the acquiring of the research and report file further includes filtering the text data according to the text blocks to obtain filtered text data:

respectively reconstructing the text data and the screened text blocks, wherein the reconstruction method is that starting from a second page, the height of the footer of the previous page and the height of the header of the current page are subtracted from the Y coordinate of the next page, and then the reconstructed page is accumulated to the Y coordinate of the front page;

and judging whether the coordinates of the text data are in the coordinate range of the text block, if so, keeping the text data, otherwise, discarding the text data.

Optionally, the searching for the abstract portion in the text block and obtaining the abstract text from the corresponding text data according to the coordinates of the abstract portion includes:

taking the starting position of the text block as the starting part of the abstract part, if any text block contains a preset abstract ending mark, the text block containing the abstract ending mark belongs to the abstract part;

if the text block does not contain the abstract ending mark, acquiring an ending position according to the distance between two adjacent text blocks, and determining the text block with the abstract ending.

Optionally, the obtaining a cut-off position according to a distance between two adjacent text blocks and determining the text block with the cut-off abstract portion includes:

when the distance between two adjacent text blocks is greater than a preset distance threshold, determining that the abstract part is cut off from the previous text block;

and when two adjacent text blocks have continuous picture types or table types, determining that the abstract part is ended at the previous text block.

Optionally, the analyzing effective classification features in the abstract text according to text features, and classifying the abstract text according to paragraph orders according to the effective classification features include:

counting the text features of all abstract texts in the research and report file, analyzing the quantity and distribution of the text features, if any text feature meets a preset quantity condition and a preset distribution condition, considering that the text feature belongs to an effective classification feature, and classifying whether paragraphs in all the abstract texts contain titles or not according to a paragraph sequence;

and if none of the text features in the research and report file meets the condition, dividing each section of the abstract text into sections containing headings.

Optionally, the extracting viewpoints and details of the summary text according to the classification includes:

if the classification of a section does not contain a title, combining the classification into the details of the previous section, and defaulting that the first section contains the title;

if the classification of a paragraph contains a title, judging whether the text beginning of the current paragraph has a bold font, if the text beginning has the bold font, the bold font is a viewpoint, and the rest is details, otherwise, the first sentence of the current paragraph is the viewpoint, and the rest is details.

An opinion detail extracting apparatus for a research summary, comprising:

the data acquisition module is used for acquiring a research report and acquiring a plurality of text data and a plurality of text blocks from the research report file;

the text block extraction module is used for extracting text blocks from the text data;

the classification module is used for analyzing effective classification features in the abstract text according to text features and classifying the abstract text according to paragraph sequences according to the effective classification features;

and the viewpoint and detail extracting module is used for extracting the viewpoints and details of the abstract text according to classification.

A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the above-described method of opinion detail extraction of a research and report summary.

A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described method of opinion detail extraction of a research summary.

The positive progress effects of the invention are as follows: the invention adopts the method, the device, the equipment and the storage medium for extracting the viewpoint details of the research and report abstract, can process the research and report with various complex formats, can accurately mark out the part of the research and report abstract, and can self-adaptively select the characteristics for classification when the viewpoint and the details are marked.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic flow chart of the invention for extracting aspects and details.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific drawings.

Referring to fig. 1, a method for extracting perspective details of a report summary includes:

s1, acquiring data: and acquiring a research report file, and acquiring a plurality of text data and a plurality of text blocks from the research report file.

The research report of this step is a research report issued by a dealer, which is mostly disclosed as a PDF file, and for a research report PDF file, before extracting viewpoints and details, it is first necessary to analyze text data and detect text blocks.

In one embodiment, step S1 includes:

s101, analyzing the report file, and acquiring text data: the method comprises the steps of obtaining a research report, analyzing a research report file by adopting a preset analysis tool to obtain complete text paragraph data containing text characteristics, coordinates and page numbers, dividing the complete text paragraph data into a plurality of text data according to the text characteristics, wherein the text characteristics comprise at least one of character colors, character sizes, character fonts or thickening, and the coordinates of the text data comprise X coordinates, Y coordinates, text width and text height.

When parsing the message, parsing tools in the prior art, such as pdfminer, pdfplumber, etc., may be used. The complete text paragraph data after being analyzed by the analysis tool is discrete, so that the data is divided into a plurality of text data to be respectively stored according to different text characteristics.

S102, acquiring a text block through target detection: and carrying out target detection on the research and report file through a preset target detection model to obtain a target detection result as a text block, wherein the target detection result is the coordinate, page number and category information of a plurality of targets in the research and report file, and the coordinate of the text block comprises an X coordinate, a Y coordinate, the width of the text block and the height of the text block.

The category information in this step includes at least one or a combination of a report title, a special structure, a statistical chart, a structure chart, a table, a chart title, a chart annotation, a header, a footer, a text or a text title.

The method comprises the following steps of when the target detection is carried out on a research and report file through a preset target detection model and a target detection result is obtained and is used as a text block: the method comprises the steps of obtaining a research and report file, converting a page of each page in the research and report file into a picture to obtain a picture file, calling a target detection model, inputting the picture file into the target detection model, and obtaining a target detection result.

Before the step is carried out, the target detection model can be trained in advance to obtain the trained target detection model, before the training, research reports of multiple dealer securities can be obtained, PDF (Portable document Format) is converted into pictures to form a large number of pictures, for example, the research reports of 30 dealer securities are obtained and converted into 5000 pictures, all the pictures are marked to form a training set and a testing set, a YOLO (YOLO) target detection algorithm training model is preferably adopted, model parameters are adjusted repeatedly, and the model achieves an ideal effect on the testing set.

After this step, also include: and screening out the text blocks of which the category information is the text or the text title from the plurality of text blocks.

A research report comprises text information such as titles, analysts, headers and footers, and the abstract view and detail extraction only aims at the body of the research report, so that the body part needs to be defined, and the extracted content is ensured to have no redundant text and no content loss. The invention divides effective text blocks by using a target detection model trained by a target detection algorithm, and selects the text blocks of which the category information is a text or a text title.

In one embodiment, step S1 further includes:

s103, filtering the text data according to the text block to obtain filtered text data: respectively reconstructing a plurality of text data and screened text blocks, wherein the reconstruction method is that starting from a second page, the height of a footer of the previous page and the height of a header of the page are subtracted from the Y coordinate of the next page, and then the reconstructed page is accumulated to the Y coordinate of the front page; and judging whether the coordinates of the text data are in the coordinate range of the text block, if so, keeping the text data, otherwise, discarding the text data.

The data result obtained by using the target detection algorithm is the coordinates of the text block and the page number where the text block is located, and the coordinates comprise the X coordinate, the Y coordinate, the width of the text block and the height of the text block. Therefore, the Y coordinate of the text data and the Y coordinate of the text block are reconstructed in the same manner by subtracting the footer height of the previous page and the header height of the current page from the Y coordinate of the next page starting from the second page and then adding up to the previous Y coordinate. After reconstruction, the text data and the page number information of the text block are not used any more, so that the use is convenient and the code can be kept concise. And finally, if the coordinates of the text data information are in the range of the text block, retaining the data, and otherwise, discarding the data.

S2, acquiring abstract text: and searching the abstract part in the text block, and acquiring the abstract text from the corresponding text data according to the coordinates of the abstract part.

Not all the data in the text block is summary data of the research report, such as "analyst commitment" or specific content of the research report, it can be determined that the position of the beginning of the text block is the beginning part of the summary of the research report, but it is necessary to determine the end position of the summary, the partial summary of the research report is located at the first page of the research report, and the partial summary content reaches the next pages, such as page 5, which requires the start position and the end position of the summary to be determined.

In one embodiment, step S2 includes:

taking the starting position of the text block as the starting part of the abstract part, and if any text block contains a preset abstract ending mark, the text block containing the abstract ending mark belongs to the abstract part; if the text block does not contain the abstract ending mark, the ending position is obtained according to the distance between two adjacent text blocks, and the text block with the abstract ending is determined.

Most research summaries are cut off by the "risk tip" or by the specific content of the risk tip. Therefore, when the text block is used for filtering the text to search for the abstract part, if the text block contains an abstract cutoff mark such as the related content of the risk prompt, the cutoff indicates that the text block belongs to the abstract part of the research and report file up to the moment.

And a part of the abstract is far away from other text parts, or a picture or a table is arranged between the part of the abstract and other text parts, and the cut-off position of the abstract part can be determined according to the distance between two text blocks.

In one embodiment, obtaining a cut-off position according to a distance between two adjacent text blocks, and determining the text block with the cut-off abstract part, includes:

when the distance between two adjacent text blocks is greater than a preset distance threshold, determining that the abstract part is cut off from the previous text block; when two adjacent text blocks have continuous picture types or table types, the summary part is determined to be cut off from the previous text block.

If the text blocks are far apart or continuous pictures or tables are arranged in the middle, the pictures or tables represent the distinction between the abstract and the specific content, and the abstract is ended at the last text block.

After the abstract part is found in all the text blocks, the text blocks have coordinates and have corresponding relation with the coordinates in the text data, so that the corresponding abstract text can be directly found from the text data according to the coordinates of the abstract part.

S3, classifying the abstract text: effective classification features are analyzed in the abstract text according to the text features, and the abstract text is classified according to the effective classification features and paragraph sequences.

After the summary text in the research and report file is defined, the viewpoint of the research and report and the detailed part belonging to the viewpoint are extracted. In the abstract text, some viewpoints and details are in the same text segment, and some viewpoints and details have two sections of details, because the formats of the research and report files are different, and the viewpoint and details do not have a unified division standard, a classification standard belonging to the current research and report is needed. The method comprises the steps of analyzing available text features as effective classification features by counting text features in the abstract text, and classifying each section of text according to whether the text contains a title or not according to the effective classification features.

In one embodiment, step S3 includes:

counting text characteristics of all abstract texts in a research and report file, analyzing the quantity and distribution of the text characteristics, if any text characteristic meets a preset quantity condition and a preset distribution condition, considering that the text characteristics belong to effective classification characteristics, and classifying whether paragraphs in all the abstract texts contain titles or not according to the order of the paragraphs; if none of the text features in the document meets the condition, each section of the summary text is divided into sections containing headings.

The method includes the steps of firstly counting text characteristics of all abstract texts in a current research and report file, analyzing the number of the text characteristics and distribution of the text characteristics in the whole abstract text, wherein the text characteristics comprise whether characters are thickened or not, character colors, chapter titles or not, and the like, and if certain text characteristics meet a preset number condition and a preset distribution condition, the characteristics belong to effective classification characteristics. If the character color characteristic meets the condition that the number of the red font paragraphs is at least 3 and the red font paragraphs are distributed with other paragraphs, the red font is considered to belong to the effective classification characteristic. After the analysis of the paragraph with the valid classification feature, the paragraph is classified into two categories, i.e. into a paragraph containing a title and a paragraph containing no title, by determining whether each paragraph contains a title. If none of the text features in the whole paragraph satisfies the condition, each paragraph is classified as a paragraph containing a title.

S4, extracting viewpoints and details: and extracting the viewpoint and the details of the abstract text according to the classification.

In this step, after the paragraph is classified by whether or not the title is included in step S3, the viewpoint and the details are extracted according to the classification.

In one embodiment, step S4 includes:

if the classification of a paragraph contains no title, it is incorporated into the details of the previous paragraph, with the default that the first paragraph contains a title. If the classification of a paragraph contains a title, judging whether the text beginning of the current paragraph has a bold font, if the text beginning has the bold font, the bold font is a viewpoint, and the rest is details, otherwise, the first sentence of the current paragraph is the viewpoint, and the rest is details.

By means of the method, accurate viewpoints and details can be extracted from each segment of all summary files in the research file.

In one embodiment, referring to FIG. 2, the process of extracting the perspective and details of a review document includes the steps of:

1) acquiring text data and text block information;

2) reconstructing the text data and the Y coordinate of the text block;

3) judging whether the text data is in the text block, and if the text data is not in the text block, discarding the text block;

4) if the text data are in the text blocks, judging whether the text blocks contain the abstract stop marks, and if not, acquiring the abstract text after acquiring the stop positions according to the distance between the text blocks;

5) if yes, directly acquiring the abstract text according to the abstract cutoff mark;

6) keeping abstract text data and counting text features;

7) analyzing the available features as effective classification features for classification;

8) judging whether the current paragraph contains a title or not from the paragraph sequence, and if not, merging the current paragraph into the details of the previous paragraph;

9) if yes, judging whether the text beginning of the current paragraph has a bold font, if not, taking the first sentence of the text as a viewpoint, and taking the other sentences as details;

10) if the bold font exists, the bold font is taken as a viewpoint, and the other is taken as details.

The invention utilizes the text block information, can automatically filter information such as the title, the rating, the analyst, the header, the footer and the like of the newspaper, can not cause the extraction of redundant text, and can determine the ending position of the abstract by combining the ending information of the abstract and the distance between the text blocks. Thus, the accuracy and completeness of the abstract part are guaranteed. Due to the diversity of the report formats, views or details cannot be divided in a uniform manner. When the viewpoint and the details are divided, firstly, the text characteristics of the text of the summary are counted, the classification basis suitable for the summary is analyzed, and the problem that one division basis cannot be uniformly used due to multiple research and report formats is solved. The method can process the research and report messages with various complex formats, can accurately mark the summary part of the research and report, and can self-adaptively select the characteristics for classification when dividing viewpoints and details.

In one embodiment, a opinion detail extracting apparatus for a research summary is provided, including:

the text abstract obtaining module is used for searching an abstract part in the text block and obtaining an abstract text from corresponding text data according to the coordinates of the abstract part;

the classification module is used for analyzing effective classification characteristics in the abstract text according to the text characteristics and classifying the abstract text according to the effective classification characteristics and paragraph sequence;

and the viewpoint and detail extracting module is used for extracting the viewpoints and details of the abstract text according to the classification.

In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores computer readable instructions, and when executed by the processor, the processor executes the steps in the above method for extracting the viewpoint details of the summary.

In one embodiment, a storage medium storing computer-readable instructions is provided, and the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described method for extracting viewpoint details of a research summary of various embodiments. The storage medium may be a nonvolatile storage medium.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A method for extracting viewpoint details of a research and report abstract is characterized by comprising the following steps:

searching a summary part in the text block, and acquiring a summary text from the corresponding text data according to the coordinates of the summary part, specifically comprising:

if the text block does not contain the abstract ending mark, acquiring an ending position according to the distance between two adjacent text blocks, and determining the text block with the abstract ending;

analyzing effective classification features in the abstract text according to text features, and classifying the abstract text according to the effective classification features and paragraph order, specifically comprising:

if none of the text features in the research and report file meets the condition, dividing each section of the abstract text into sections containing titles;

2. The method of claim 1, wherein said retrieving a research report file, retrieving a plurality of text data and a plurality of text blocks from said research report file, comprises:

3. The method of extracting opinion details of a research abstract according to claim 2, wherein the category information includes at least one or a combination of a research title, a special structure, a statistical chart, a structure chart, a table, a chart title, a chart annotation, a header, a footer, a body or a body title.

4. The method as claimed in claim 2, wherein the step of performing target detection on the research message by using a preset target detection model to obtain a target detection result as the text block comprises:

5. The method as claimed in claim 2, wherein the step of performing target detection on the research message by using a preset target detection model to obtain a target detection result as the text block comprises:

6. The method of claim 1, wherein the retrieving a research message file, after retrieving a plurality of text data and a plurality of text blocks from the research file, further comprises filtering the text data according to the text blocks to obtain filtered text data:

7. The method of claim 1, wherein said obtaining a cut-off position according to a distance between two adjacent text blocks and determining said text block with a cut-off summary part comprises:

8. The method for extracting opinion details of a research and report summary according to claim 1, wherein said extracting opinions and details of said summary text according to classification comprises:

if the classification of a paragraph does not contain a title, the classification is merged into the details of the previous paragraph, and the default first paragraph contains the title;

9. An apparatus for extracting viewpoint details of a research and report digest, comprising:

the abstract text obtaining module is configured to search an abstract portion in the text block, and obtain an abstract text from the corresponding text data according to coordinates of the abstract portion, and specifically includes:

the classification module is used for analyzing effective classification features in the abstract text according to text features and classifying the abstract text according to the effective classification features and paragraph sequence, and specifically comprises the following steps:

10. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the method of opinion detail extraction of a research summary according to any of claims 1 to 8.

11. A storage medium having computer readable instructions stored thereon which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of opinion detail extraction of a research summary according to any of claims 1 to 8.