CN113127595B - Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract - Google Patents

Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract Download PDF

Info

Publication number
CN113127595B
CN113127595B CN202110451466.3A CN202110451466A CN113127595B CN 113127595 B CN113127595 B CN 113127595B CN 202110451466 A CN202110451466 A CN 202110451466A CN 113127595 B CN113127595 B CN 113127595B
Authority
CN
China
Prior art keywords
text
abstract
research
details
report
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110451466.3A
Other languages
Chinese (zh)
Other versions
CN113127595A (en
Inventor
王静
贾宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinascope Shanghai Technology Co ltd
Original Assignee
Chinascope Shanghai Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinascope Shanghai Technology Co ltd filed Critical Chinascope Shanghai Technology Co ltd
Priority to CN202110451466.3A priority Critical patent/CN113127595B/en
Publication of CN113127595A publication Critical patent/CN113127595A/en
Application granted granted Critical
Publication of CN113127595B publication Critical patent/CN113127595B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a storage medium for extracting viewpoint details of a research and report abstract. The method comprises the following steps: acquiring a research report file, and acquiring a plurality of text data and text blocks from the research report file; searching a summary part in the text block, and acquiring a summary text from the corresponding text data; analyzing effective classification features in the abstract text according to the text features, and classifying the abstract text according to the effective classification features and paragraph sequences; and extracting the viewpoint and the details of the abstract text according to the classification. The invention can process the research and report with various complex formats, can accurately divide the summary part of the research and report, and can adaptively select characteristics for classification when dividing viewpoints and details.

Description

Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method, a device, equipment and a storage medium for extracting viewpoint details of a research and report abstract.
Background
In order to deal with the company research and report on the market with large and complicated information amount and help professional investors to integrate key viewpoints in the massive research and report, the viewpoints and the corresponding details need to be extracted from the summary part of the research and report. The research reports sent by different dealer have different formats and different languages, which brings difficulty to the extraction of viewpoints and details.
Chinese patent CN 107358208A structured information extraction method and device for PDF document, which proposes a structured information extraction method for PDF document, the method firstly deletes the directory, header and footer information in the original page to obtain an actual page, and extracts the title and the text content belonging to the title from the actual page according to the title level. In the process of acquiring the actual page, whether the page is a directory or not and whether header footers exist or not are judged according to keywords or rules, and the generalization capability of the method is limited according to the judgment of the keywords or the rules.
Disclosure of Invention
The invention aims to solve the technical problems that the company reviews with large information quantity and complex information are difficult to extract viewpoints and corresponding details from reviews, and aims to provide a viewpoint detail extraction method, a device, equipment and a storage medium for reviews, which can accurately extract viewpoints and details from reviews in various complex formats.
The method for extracting the viewpoint details of the research and report abstract comprises the following steps:
acquiring a research report file, and acquiring a plurality of text data and a plurality of text blocks from the research report file;
searching an abstract part in the text block, and acquiring an abstract text from the corresponding text data according to the coordinate of the abstract part;
analyzing effective classification features in the abstract text according to text features, and classifying the abstract text according to paragraph sequences according to the effective classification features;
and extracting the viewpoint and the details of the abstract text according to the classification.
Optionally, the obtaining a research report obtains a plurality of text data and a plurality of text blocks from the research report file, and includes:
acquiring a research report, analyzing the research report file by adopting a preset analysis tool to obtain complete text paragraph data containing text characteristics, coordinates and page numbers, dividing the complete text paragraph data into a plurality of text data according to the text characteristics, wherein the text characteristics comprise at least one of character colors, character sizes, character fonts or thickening, and the coordinates of the text data comprise X coordinates, Y coordinates, text width and text height;
and carrying out target detection on the research message through a preset target detection model to obtain a target detection result as the text block, wherein the target detection result is the coordinate, page number and category information of a plurality of targets in the research and report file, and the coordinate of the text block comprises an X coordinate, a Y coordinate, the width of the text block and the height of the text block.
Optionally, the category information includes at least one or a combination of a report title, a special structure, a statistical chart, a structure chart, a table, a chart title, a chart annotation, a header, a footer, a text or a text title.
Optionally, the performing target detection on the messaging item through a preset target detection model to obtain a target detection result as the text block includes:
and acquiring the research and report file, converting the page of each page in the research and report file into a picture to obtain a picture file, calling the target detection model, and inputting the picture file into the target detection model to obtain the target detection result.
Optionally, the performing, by using a preset target detection model, target detection on the research message to obtain a target detection result as the text block includes:
and screening out the text blocks of which the category information is the text or the text title from the plurality of text blocks.
Optionally, after acquiring a plurality of text data and a plurality of text blocks from the research and report file, the acquiring of the research and report file further includes filtering the text data according to the text blocks to obtain filtered text data:
respectively reconstructing the text data and the screened text blocks, wherein the reconstruction method is that starting from a second page, the height of the footer of the previous page and the height of the header of the current page are subtracted from the Y coordinate of the next page, and then the reconstructed page is accumulated to the Y coordinate of the front page;
and judging whether the coordinates of the text data are in the coordinate range of the text block, if so, keeping the text data, otherwise, discarding the text data.
Optionally, the searching for the abstract portion in the text block and obtaining the abstract text from the corresponding text data according to the coordinates of the abstract portion includes:
taking the starting position of the text block as the starting part of the abstract part, if any text block contains a preset abstract ending mark, the text block containing the abstract ending mark belongs to the abstract part;
if the text block does not contain the abstract ending mark, acquiring an ending position according to the distance between two adjacent text blocks, and determining the text block with the abstract ending.
Optionally, the obtaining a cut-off position according to a distance between two adjacent text blocks and determining the text block with the cut-off abstract portion includes:
when the distance between two adjacent text blocks is greater than a preset distance threshold, determining that the abstract part is cut off from the previous text block;
and when two adjacent text blocks have continuous picture types or table types, determining that the abstract part is ended at the previous text block.
Optionally, the analyzing effective classification features in the abstract text according to text features, and classifying the abstract text according to paragraph orders according to the effective classification features include:
counting the text features of all abstract texts in the research and report file, analyzing the quantity and distribution of the text features, if any text feature meets a preset quantity condition and a preset distribution condition, considering that the text feature belongs to an effective classification feature, and classifying whether paragraphs in all the abstract texts contain titles or not according to a paragraph sequence;
and if none of the text features in the research and report file meets the condition, dividing each section of the abstract text into sections containing headings.
Optionally, the extracting viewpoints and details of the summary text according to the classification includes:
if the classification of a section does not contain a title, combining the classification into the details of the previous section, and defaulting that the first section contains the title;
if the classification of a paragraph contains a title, judging whether the text beginning of the current paragraph has a bold font, if the text beginning has the bold font, the bold font is a viewpoint, and the rest is details, otherwise, the first sentence of the current paragraph is the viewpoint, and the rest is details.
An opinion detail extracting apparatus for a research summary, comprising:
the data acquisition module is used for acquiring a research report and acquiring a plurality of text data and a plurality of text blocks from the research report file;
the text block extraction module is used for extracting text blocks from the text data;
the classification module is used for analyzing effective classification features in the abstract text according to text features and classifying the abstract text according to paragraph sequences according to the effective classification features;
and the viewpoint and detail extracting module is used for extracting the viewpoints and details of the abstract text according to classification.
A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the above-described method of opinion detail extraction of a research and report summary.
A storage medium having stored thereon computer-readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described method of opinion detail extraction of a research summary.
The positive progress effects of the invention are as follows: the invention adopts the method, the device, the equipment and the storage medium for extracting the viewpoint details of the research and report abstract, can process the research and report with various complex formats, can accurately mark out the part of the research and report abstract, and can self-adaptively select the characteristics for classification when the viewpoint and the details are marked.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic flow chart of the invention for extracting aspects and details.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific drawings.
Referring to fig. 1, a method for extracting perspective details of a report summary includes:
s1, acquiring data: and acquiring a research report file, and acquiring a plurality of text data and a plurality of text blocks from the research report file.
The research report of this step is a research report issued by a dealer, which is mostly disclosed as a PDF file, and for a research report PDF file, before extracting viewpoints and details, it is first necessary to analyze text data and detect text blocks.
In one embodiment, step S1 includes:
s101, analyzing the report file, and acquiring text data: the method comprises the steps of obtaining a research report, analyzing a research report file by adopting a preset analysis tool to obtain complete text paragraph data containing text characteristics, coordinates and page numbers, dividing the complete text paragraph data into a plurality of text data according to the text characteristics, wherein the text characteristics comprise at least one of character colors, character sizes, character fonts or thickening, and the coordinates of the text data comprise X coordinates, Y coordinates, text width and text height.
When parsing the message, parsing tools in the prior art, such as pdfminer, pdfplumber, etc., may be used. The complete text paragraph data after being analyzed by the analysis tool is discrete, so that the data is divided into a plurality of text data to be respectively stored according to different text characteristics.
S102, acquiring a text block through target detection: and carrying out target detection on the research and report file through a preset target detection model to obtain a target detection result as a text block, wherein the target detection result is the coordinate, page number and category information of a plurality of targets in the research and report file, and the coordinate of the text block comprises an X coordinate, a Y coordinate, the width of the text block and the height of the text block.
The category information in this step includes at least one or a combination of a report title, a special structure, a statistical chart, a structure chart, a table, a chart title, a chart annotation, a header, a footer, a text or a text title.
The method comprises the following steps of when the target detection is carried out on a research and report file through a preset target detection model and a target detection result is obtained and is used as a text block: the method comprises the steps of obtaining a research and report file, converting a page of each page in the research and report file into a picture to obtain a picture file, calling a target detection model, inputting the picture file into the target detection model, and obtaining a target detection result.
Before the step is carried out, the target detection model can be trained in advance to obtain the trained target detection model, before the training, research reports of multiple dealer securities can be obtained, PDF (Portable document Format) is converted into pictures to form a large number of pictures, for example, the research reports of 30 dealer securities are obtained and converted into 5000 pictures, all the pictures are marked to form a training set and a testing set, a YOLO (YOLO) target detection algorithm training model is preferably adopted, model parameters are adjusted repeatedly, and the model achieves an ideal effect on the testing set.
After this step, also include: and screening out the text blocks of which the category information is the text or the text title from the plurality of text blocks.
A research report comprises text information such as titles, analysts, headers and footers, and the abstract view and detail extraction only aims at the body of the research report, so that the body part needs to be defined, and the extracted content is ensured to have no redundant text and no content loss. The invention divides effective text blocks by using a target detection model trained by a target detection algorithm, and selects the text blocks of which the category information is a text or a text title.
In one embodiment, step S1 further includes:
s103, filtering the text data according to the text block to obtain filtered text data: respectively reconstructing a plurality of text data and screened text blocks, wherein the reconstruction method is that starting from a second page, the height of a footer of the previous page and the height of a header of the page are subtracted from the Y coordinate of the next page, and then the reconstructed page is accumulated to the Y coordinate of the front page; and judging whether the coordinates of the text data are in the coordinate range of the text block, if so, keeping the text data, otherwise, discarding the text data.
The data result obtained by using the target detection algorithm is the coordinates of the text block and the page number where the text block is located, and the coordinates comprise the X coordinate, the Y coordinate, the width of the text block and the height of the text block. Therefore, the Y coordinate of the text data and the Y coordinate of the text block are reconstructed in the same manner by subtracting the footer height of the previous page and the header height of the current page from the Y coordinate of the next page starting from the second page and then adding up to the previous Y coordinate. After reconstruction, the text data and the page number information of the text block are not used any more, so that the use is convenient and the code can be kept concise. And finally, if the coordinates of the text data information are in the range of the text block, retaining the data, and otherwise, discarding the data.
S2, acquiring abstract text: and searching the abstract part in the text block, and acquiring the abstract text from the corresponding text data according to the coordinates of the abstract part.
Not all the data in the text block is summary data of the research report, such as "analyst commitment" or specific content of the research report, it can be determined that the position of the beginning of the text block is the beginning part of the summary of the research report, but it is necessary to determine the end position of the summary, the partial summary of the research report is located at the first page of the research report, and the partial summary content reaches the next pages, such as page 5, which requires the start position and the end position of the summary to be determined.
In one embodiment, step S2 includes:
taking the starting position of the text block as the starting part of the abstract part, and if any text block contains a preset abstract ending mark, the text block containing the abstract ending mark belongs to the abstract part; if the text block does not contain the abstract ending mark, the ending position is obtained according to the distance between two adjacent text blocks, and the text block with the abstract ending is determined.
Most research summaries are cut off by the "risk tip" or by the specific content of the risk tip. Therefore, when the text block is used for filtering the text to search for the abstract part, if the text block contains an abstract cutoff mark such as the related content of the risk prompt, the cutoff indicates that the text block belongs to the abstract part of the research and report file up to the moment.
And a part of the abstract is far away from other text parts, or a picture or a table is arranged between the part of the abstract and other text parts, and the cut-off position of the abstract part can be determined according to the distance between two text blocks.
In one embodiment, obtaining a cut-off position according to a distance between two adjacent text blocks, and determining the text block with the cut-off abstract part, includes:
when the distance between two adjacent text blocks is greater than a preset distance threshold, determining that the abstract part is cut off from the previous text block; when two adjacent text blocks have continuous picture types or table types, the summary part is determined to be cut off from the previous text block.
If the text blocks are far apart or continuous pictures or tables are arranged in the middle, the pictures or tables represent the distinction between the abstract and the specific content, and the abstract is ended at the last text block.
After the abstract part is found in all the text blocks, the text blocks have coordinates and have corresponding relation with the coordinates in the text data, so that the corresponding abstract text can be directly found from the text data according to the coordinates of the abstract part.
S3, classifying the abstract text: effective classification features are analyzed in the abstract text according to the text features, and the abstract text is classified according to the effective classification features and paragraph sequences.
After the summary text in the research and report file is defined, the viewpoint of the research and report and the detailed part belonging to the viewpoint are extracted. In the abstract text, some viewpoints and details are in the same text segment, and some viewpoints and details have two sections of details, because the formats of the research and report files are different, and the viewpoint and details do not have a unified division standard, a classification standard belonging to the current research and report is needed. The method comprises the steps of analyzing available text features as effective classification features by counting text features in the abstract text, and classifying each section of text according to whether the text contains a title or not according to the effective classification features.
In one embodiment, step S3 includes:
counting text characteristics of all abstract texts in a research and report file, analyzing the quantity and distribution of the text characteristics, if any text characteristic meets a preset quantity condition and a preset distribution condition, considering that the text characteristics belong to effective classification characteristics, and classifying whether paragraphs in all the abstract texts contain titles or not according to the order of the paragraphs; if none of the text features in the document meets the condition, each section of the summary text is divided into sections containing headings.
The method includes the steps of firstly counting text characteristics of all abstract texts in a current research and report file, analyzing the number of the text characteristics and distribution of the text characteristics in the whole abstract text, wherein the text characteristics comprise whether characters are thickened or not, character colors, chapter titles or not, and the like, and if certain text characteristics meet a preset number condition and a preset distribution condition, the characteristics belong to effective classification characteristics. If the character color characteristic meets the condition that the number of the red font paragraphs is at least 3 and the red font paragraphs are distributed with other paragraphs, the red font is considered to belong to the effective classification characteristic. After the analysis of the paragraph with the valid classification feature, the paragraph is classified into two categories, i.e. into a paragraph containing a title and a paragraph containing no title, by determining whether each paragraph contains a title. If none of the text features in the whole paragraph satisfies the condition, each paragraph is classified as a paragraph containing a title.
S4, extracting viewpoints and details: and extracting the viewpoint and the details of the abstract text according to the classification.
In this step, after the paragraph is classified by whether or not the title is included in step S3, the viewpoint and the details are extracted according to the classification.
In one embodiment, step S4 includes:
if the classification of a paragraph contains no title, it is incorporated into the details of the previous paragraph, with the default that the first paragraph contains a title. If the classification of a paragraph contains a title, judging whether the text beginning of the current paragraph has a bold font, if the text beginning has the bold font, the bold font is a viewpoint, and the rest is details, otherwise, the first sentence of the current paragraph is the viewpoint, and the rest is details.
By means of the method, accurate viewpoints and details can be extracted from each segment of all summary files in the research file.
In one embodiment, referring to FIG. 2, the process of extracting the perspective and details of a review document includes the steps of:
1) acquiring text data and text block information;
2) reconstructing the text data and the Y coordinate of the text block;
3) judging whether the text data is in the text block, and if the text data is not in the text block, discarding the text block;
4) if the text data are in the text blocks, judging whether the text blocks contain the abstract stop marks, and if not, acquiring the abstract text after acquiring the stop positions according to the distance between the text blocks;
5) if yes, directly acquiring the abstract text according to the abstract cutoff mark;
6) keeping abstract text data and counting text features;
7) analyzing the available features as effective classification features for classification;
8) judging whether the current paragraph contains a title or not from the paragraph sequence, and if not, merging the current paragraph into the details of the previous paragraph;
9) if yes, judging whether the text beginning of the current paragraph has a bold font, if not, taking the first sentence of the text as a viewpoint, and taking the other sentences as details;
10) if the bold font exists, the bold font is taken as a viewpoint, and the other is taken as details.
The invention utilizes the text block information, can automatically filter information such as the title, the rating, the analyst, the header, the footer and the like of the newspaper, can not cause the extraction of redundant text, and can determine the ending position of the abstract by combining the ending information of the abstract and the distance between the text blocks. Thus, the accuracy and completeness of the abstract part are guaranteed. Due to the diversity of the report formats, views or details cannot be divided in a uniform manner. When the viewpoint and the details are divided, firstly, the text characteristics of the text of the summary are counted, the classification basis suitable for the summary is analyzed, and the problem that one division basis cannot be uniformly used due to multiple research and report formats is solved. The method can process the research and report messages with various complex formats, can accurately mark the summary part of the research and report, and can self-adaptively select the characteristics for classification when dividing viewpoints and details.
In one embodiment, a opinion detail extracting apparatus for a research summary is provided, including:
the data acquisition module is used for acquiring a research report and acquiring a plurality of text data and a plurality of text blocks from the research report file;
the text abstract obtaining module is used for searching an abstract part in the text block and obtaining an abstract text from corresponding text data according to the coordinates of the abstract part;
the classification module is used for analyzing effective classification characteristics in the abstract text according to the text characteristics and classifying the abstract text according to the effective classification characteristics and paragraph sequence;
and the viewpoint and detail extracting module is used for extracting the viewpoints and details of the abstract text according to the classification.
In one embodiment, a computer device is provided, which includes a memory and a processor, the memory stores computer readable instructions, and when executed by the processor, the processor executes the steps in the above method for extracting the viewpoint details of the summary.
In one embodiment, a storage medium storing computer-readable instructions is provided, and the computer-readable instructions, when executed by one or more processors, cause the one or more processors to perform the steps of the above-described method for extracting viewpoint details of a research summary of various embodiments. The storage medium may be a nonvolatile storage medium.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (11)

1. A method for extracting viewpoint details of a research and report abstract is characterized by comprising the following steps:
acquiring a research report file, and acquiring a plurality of text data and a plurality of text blocks from the research report file;
searching a summary part in the text block, and acquiring a summary text from the corresponding text data according to the coordinates of the summary part, specifically comprising:
taking the starting position of the text block as the starting part of the abstract part, if any text block contains a preset abstract ending mark, the text block containing the abstract ending mark belongs to the abstract part;
if the text block does not contain the abstract ending mark, acquiring an ending position according to the distance between two adjacent text blocks, and determining the text block with the abstract ending;
analyzing effective classification features in the abstract text according to text features, and classifying the abstract text according to the effective classification features and paragraph order, specifically comprising:
counting the text features of all abstract texts in the research and report file, analyzing the quantity and distribution of the text features, if any text feature meets a preset quantity condition and a preset distribution condition, considering that the text feature belongs to an effective classification feature, and classifying whether paragraphs in all the abstract texts contain titles or not according to a paragraph sequence;
if none of the text features in the research and report file meets the condition, dividing each section of the abstract text into sections containing titles;
and extracting the viewpoint and the details of the abstract text according to the classification.
2. The method of claim 1, wherein said retrieving a research report file, retrieving a plurality of text data and a plurality of text blocks from said research report file, comprises:
acquiring a research report, analyzing the research report file by adopting a preset analysis tool to obtain complete text paragraph data containing text characteristics, coordinates and page numbers, dividing the complete text paragraph data into a plurality of text data according to the text characteristics, wherein the text characteristics comprise at least one of character colors, character sizes, character fonts or thickening, and the coordinates of the text data comprise X coordinates, Y coordinates, text width and text height;
and carrying out target detection on the research message through a preset target detection model to obtain a target detection result as the text block, wherein the target detection result is the coordinate, page number and category information of a plurality of targets in the research and report file, and the coordinate of the text block comprises an X coordinate, a Y coordinate, the width of the text block and the height of the text block.
3. The method of extracting opinion details of a research abstract according to claim 2, wherein the category information includes at least one or a combination of a research title, a special structure, a statistical chart, a structure chart, a table, a chart title, a chart annotation, a header, a footer, a body or a body title.
4. The method as claimed in claim 2, wherein the step of performing target detection on the research message by using a preset target detection model to obtain a target detection result as the text block comprises:
and acquiring the research and report file, converting the page of each page in the research and report file into a picture to obtain a picture file, calling the target detection model, and inputting the picture file into the target detection model to obtain the target detection result.
5. The method as claimed in claim 2, wherein the step of performing target detection on the research message by using a preset target detection model to obtain a target detection result as the text block comprises:
and screening out the text blocks of which the category information is the text or the text title from the plurality of text blocks.
6. The method of claim 1, wherein the retrieving a research message file, after retrieving a plurality of text data and a plurality of text blocks from the research file, further comprises filtering the text data according to the text blocks to obtain filtered text data:
respectively reconstructing the text data and the screened text blocks, wherein the reconstruction method is that starting from a second page, the height of the footer of the previous page and the height of the header of the current page are subtracted from the Y coordinate of the next page, and then the reconstructed page is accumulated to the Y coordinate of the front page;
and judging whether the coordinates of the text data are in the coordinate range of the text block, if so, keeping the text data, otherwise, discarding the text data.
7. The method of claim 1, wherein said obtaining a cut-off position according to a distance between two adjacent text blocks and determining said text block with a cut-off summary part comprises:
when the distance between two adjacent text blocks is greater than a preset distance threshold, determining that the abstract part is cut off from the previous text block;
and when two adjacent text blocks have continuous picture types or table types, determining that the abstract part is ended at the previous text block.
8. The method for extracting opinion details of a research and report summary according to claim 1, wherein said extracting opinions and details of said summary text according to classification comprises:
if the classification of a paragraph does not contain a title, the classification is merged into the details of the previous paragraph, and the default first paragraph contains the title;
if the classification of a paragraph contains a title, judging whether the text beginning of the current paragraph has a bold font, if the text beginning has the bold font, the bold font is a viewpoint, and the rest is details, otherwise, the first sentence of the current paragraph is the viewpoint, and the rest is details.
9. An apparatus for extracting viewpoint details of a research and report digest, comprising:
the data acquisition module is used for acquiring a research report and acquiring a plurality of text data and a plurality of text blocks from the research report file;
the abstract text obtaining module is configured to search an abstract portion in the text block, and obtain an abstract text from the corresponding text data according to coordinates of the abstract portion, and specifically includes:
taking the starting position of the text block as the starting part of the abstract part, if any text block contains a preset abstract ending mark, the text block containing the abstract ending mark belongs to the abstract part;
if the text block does not contain the abstract ending mark, acquiring an ending position according to the distance between two adjacent text blocks, and determining the text block with the abstract ending;
the classification module is used for analyzing effective classification features in the abstract text according to text features and classifying the abstract text according to the effective classification features and paragraph sequence, and specifically comprises the following steps:
counting the text features of all abstract texts in the research and report file, analyzing the quantity and distribution of the text features, if any text feature meets a preset quantity condition and a preset distribution condition, considering that the text feature belongs to an effective classification feature, and classifying whether paragraphs in all the abstract texts contain titles or not according to a paragraph sequence;
if none of the text features in the research and report file meets the condition, dividing each section of the abstract text into sections containing titles;
and the viewpoint and detail extracting module is used for extracting the viewpoints and details of the abstract text according to classification.
10. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the method of opinion detail extraction of a research summary according to any of claims 1 to 8.
11. A storage medium having computer readable instructions stored thereon which, when executed by one or more processors, cause the one or more processors to perform the steps of the method of opinion detail extraction of a research summary according to any of claims 1 to 8.
CN202110451466.3A 2021-04-26 2021-04-26 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract Active CN113127595B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110451466.3A CN113127595B (en) 2021-04-26 2021-04-26 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110451466.3A CN113127595B (en) 2021-04-26 2021-04-26 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract

Publications (2)

Publication Number Publication Date
CN113127595A CN113127595A (en) 2021-07-16
CN113127595B true CN113127595B (en) 2022-08-16

Family

ID=76780131

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110451466.3A Active CN113127595B (en) 2021-04-26 2021-04-26 Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract

Country Status (1)

Country Link
CN (1) CN113127595B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535942B (en) * 2021-07-21 2022-08-19 北京海泰方圆科技股份有限公司 Text abstract generating method, device, equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN107291723A (en) * 2016-03-30 2017-10-24 阿里巴巴集团控股有限公司 The method and apparatus of web page text classification, the method and apparatus of web page text identification
CN107315797A (en) * 2017-06-19 2017-11-03 江西洪都航空工业集团有限责任公司 A kind of Internet news is obtained and text emotion forecasting system
CN109388804A (en) * 2018-10-22 2019-02-26 平安科技(深圳)有限公司 Report core views extracting method and device are ground using the security of deep learning model
CN110717044A (en) * 2019-10-08 2020-01-21 创新奇智(南京)科技有限公司 Text classification method for research and report text
CN111831802A (en) * 2020-06-04 2020-10-27 北京航空航天大学 Urban domain knowledge detection system and method based on LDA topic model

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108959254A (en) * 2018-06-29 2018-12-07 中教汇据(北京)科技有限公司 A kind of analytic method for article content in periodical pdf document
CN109816118B (en) * 2019-01-25 2022-12-06 上海深杳智能科技有限公司 Method and terminal for creating structured document based on deep learning model
CN110348294B (en) * 2019-05-30 2024-04-16 平安科技(深圳)有限公司 Method and device for positioning chart in PDF document and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291723A (en) * 2016-03-30 2017-10-24 阿里巴巴集团控股有限公司 The method and apparatus of web page text classification, the method and apparatus of web page text identification
CN106951400A (en) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 The information extraction method and device of a kind of pdf document
CN107315797A (en) * 2017-06-19 2017-11-03 江西洪都航空工业集团有限责任公司 A kind of Internet news is obtained and text emotion forecasting system
CN109388804A (en) * 2018-10-22 2019-02-26 平安科技(深圳)有限公司 Report core views extracting method and device are ground using the security of deep learning model
CN110717044A (en) * 2019-10-08 2020-01-21 创新奇智(南京)科技有限公司 Text classification method for research and report text
CN111831802A (en) * 2020-06-04 2020-10-27 北京航空航天大学 Urban domain knowledge detection system and method based on LDA topic model

Also Published As

Publication number Publication date
CN113127595A (en) 2021-07-16

Similar Documents

Publication Publication Date Title
US8356045B2 (en) Method to identify common structures in formatted text documents
CN109062874B (en) Financial data acquisition method, terminal device and medium
CN109753909B (en) Resume analysis method based on content blocking and BilSTM model
CN107391457B (en) Document segmentation method and device based on text line
EP1907946A1 (en) A method for finding text reading order in a document
CN100432996C (en) System, method and program for extracting web page core content based on web page layout
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
CN112861489A (en) Method and device for processing word document
CN112990110B (en) Method for extracting key information from research report and related equipment
CN113127595B (en) Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract
CN114359943A (en) OFD format document paragraph identification method and device
CN112084451A (en) Webpage LOGO extraction system and method based on visual blocking
CN115204129A (en) Automatic matching and identifying method for key parameters of drilling operation report
Lee et al. Detecting and dismantling composite visualizations in the scientific literature
CN108038441A (en) A kind of System and method for based on image recognition
US9049400B2 (en) Image processing apparatus, and image processing method and program
CN113360603B (en) Contract similarity and compliance detection method and device
CN112990091A (en) Research and report analysis method, device, equipment and storage medium based on target detection
CN110889274B (en) Information quality evaluation method, device, equipment and computer readable storage medium
CN109472020A (en) A kind of feature alignment Chinese word cutting method
CN111966640A (en) Document file identification method and system
CN113779218B (en) Question-answer pair construction method, question-answer pair construction device, computer equipment and storage medium
CN108255866B (en) Method and device for checking links in website
CN114417820A (en) Content filtering method for target object
Rastan et al. A PDF wrapper for table processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant