CN110287458A - A kind of annual report text header automatic marking system - Google Patents

A kind of annual report text header automatic marking system Download PDF

Info

Publication number
CN110287458A
CN110287458A CN201910416616.XA CN201910416616A CN110287458A CN 110287458 A CN110287458 A CN 110287458A CN 201910416616 A CN201910416616 A CN 201910416616A CN 110287458 A CN110287458 A CN 110287458A
Authority
CN
China
Prior art keywords
title
level
mark
template
report text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910416616.XA
Other languages
Chinese (zh)
Other versions
CN110287458B (en
Inventor
梁倬骞
潘定
罗旭
龙舜
伍旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201910416616.XA priority Critical patent/CN110287458B/en
Publication of CN110287458A publication Critical patent/CN110287458A/en
Application granted granted Critical
Publication of CN110287458B publication Critical patent/CN110287458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/125Finance or payroll

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Technology Law (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of annual report text header automatic marking system, method particularly includes: A, the level-one title and second level title of different stage will be marked twice;B, level-one title is matched, it is marked for the first time by complete and title template matching, and M, S mark are not added by the level-one title that similarity calculation reaches threshold value, and leapfrog mark is screened, the title marked by similarity calculation is carried out Secondary Match and is related to annual report text header label technology field by second of mark.The annual report text header automatic marking system, invention selection identifies financial report text layout using the method for machine vision, and machine vision and rule-statistical text abstracting method combine, to solve the problems, such as more difficult to be accurately labeled title.

Description

A kind of annual report text header automatic marking system
Technical field
The present invention relates to annual report text header label technology field, specially a kind of annual report text header automatic marking system System.
Background technique
Financial report text has stringent normalization and careful logical construction, and chapters and sections paragraph contains abundant drape over one's shoulders Reveal information, identifies the feature that the pattern of font in PDF is extracted as title in financial report, sufficiently find depth after analysis Parsing PDF format can be realized the identification of PDF Chinese text, however there is no stringent unified moulds by the PDF reported and submitted due to enterprise Version requires the format informations such as font, so even if obtaining the font of each level title in PDF, it is also difficult to by these font informations It is unified into a decimation rule.
Always which can directly be judged according to certain priori knowledge when people visually contact financial report text It is title, which is body text, and also just when reading an annual report, people can be according to the disclosure mould of annual report standardization Formula judges the structure of an article and summary content of PDF financial report.
The present inventor is inspired by this, and in conjunction with principles of mathematical morphology, the present invention identifies the annual report summary of view-based access control model Problem be converted into based on mathematical morphology filtering behavior, consider real operation environment, the present invention selection use machine vision Method financial report text layout is identified, and machine vision and rule-statistical text abstracting method are organically combined Come, to solve the problems, such as more difficult to be accurately labeled title.
Summary of the invention
(1) the technical issues of solving
In view of the deficiencies of the prior art, the present invention provides a kind of annual report text header automatic marking system, solve compared with The problem of hardly possible is accurately labeled title.
(2) technical solution
In order to achieve the above object, the present invention is achieved by the following technical programs: a kind of annual report text header is marked automatically Injection system, method particularly includes:
A, the level-one title of different stage and second level title will be marked twice;
B, level-one title is matched, mark is by complete and title template matching for the first time, and reaches by similarity calculation The level-one title of threshold value does not add M, S mark, and screen leapfrog mark, and second mark will be marked by similarity calculation Title carries out Secondary Match;
C, for second level title, then the matching of all second level titles of its second level title level-one title corresponding with template is calculated The characteristics of spending, crossing financial report, determines whether the title finally adds mark, for no second level title, then its is neighbouring Upper one and next title and level-one title corresponding in template match, by judging whether it exactly matches, Or reach certain similarity value, determine finally whether add mark.
Preferably, the specific steps marked for the first time in step B are as follows:
Step 1, financial report text and the identical title of template are labeled as M.
Step 2 can not match duplicate title in template, and by similarity calculation, one is found out in template The highest title of similarity is labeled as S.
After step 3, completion step 2, it will the situation that part different stage title can be labeled occur, remove different stage The label of title crosses the pattern for counting the title of labeled M, obtains the pattern of title.
Preferably, it marks to add in S label by calculating similarity to division header for the second time in step B and not meet It is required that title screened.
Preferably, second of mark specific steps are as follows:
Step 1, reading are labeled as the title content of S;
Step 2 obtains the second level title quantity total that current head corresponds to title in template;
If step 3, total are equal to 0, it can determine whether that the level-one title of its context whether there is in template, if be located at phase Like position, if it is, S mark will be deleted, addition M mark thinks that it meets matching and requires;
If total is not equal to 0, the similarity of all second level titles of current head second level title corresponding with template is calculated. If similarity value reaches threshold value, S mark is deleted, otherwise addition M mark only deletes S mark;
Step 4 obtains final title annotation results.
Preferably, the range of the title template covering is only in financial report.
(3) beneficial effect
The present invention provides a kind of annual report text header automatic marking systems.Have it is following the utility model has the advantages that
(1), annual report text header automatic marking system, by proposing a kind of consideration title based on Text similarity computing The title mask method of contextual information, it is contemplated that annual report title as a kind of short text (such as: tax item, the mark of only one word Topic), directly accurately title is labeled by the way that cosine coefficient similarity calculating method is more difficult, and financial report disclosure content Often there is the disclosure order of preferable contextual information requirement and specification, in conjunction with the feature, considers the context of annual report title Information incorporated consideration, then goes to calculate the similarity of line of text in title and template, can accurately be mapped to title Corresponding template row, is labeled annual report using template to realize, can be in conjunction with the title mask method of contextual information Greatly improve the range and accuracy of mark.
Detailed description of the invention
Fig. 1 is the preliminary matches title flow chart that the present invention marks for the first time;
Fig. 2 is the title flow chart of the screen fraction label of second of mark of the present invention;
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The embodiment of the present invention provides a kind of technical solution referring to FIG. 1-2: a kind of annual report text header automatic marking system System, method particularly includes:
A, the level-one title of different stage and second level title will be marked twice;
B, level-one title is matched, mark is by complete and title template matching for the first time, and reaches by similarity calculation The level-one title of threshold value does not add M, S mark, and screen leapfrog mark, and second mark will be marked by similarity calculation Title carries out Secondary Match;
C, for second level title, then the matching of all second level titles of its second level title level-one title corresponding with template is calculated The characteristics of spending, crossing financial report, determines whether the title finally adds mark, for no second level title, then its is neighbouring Upper one and next title and level-one title corresponding in template match, by judging whether it exactly matches, Or reach certain similarity value, determine finally whether add mark;
Further, the specific steps marked for the first time in step B are as follows:
Step 1, financial report text and the identical title of template are labeled as M;
Step 2 can not match duplicate title in template, and by similarity calculation, one is found out in template The highest title of similarity is labeled as S;
After step 3, completion step 2, it will the situation that part different stage title can be labeled occur, remove different stage The label of title crosses the pattern for counting the title of labeled M, obtains the pattern of title;
Further, second of mark will not be inconsistent by calculating similarity and adding in S label to division header in step B Desired title is closed to be screened;
Further, second of mark specific steps are as follows:
Step 1, reading are labeled as the title content of S;
Step 2 obtains the second level title quantity total that current head corresponds to title in template;
If step 3, total are equal to 0, it can determine whether that the level-one title of its context whether there is in template, if be located at phase Like position, if it is, S mark will be deleted, addition M mark thinks that it meets matching and requires;
If total is not equal to 0, the similarity of all second level titles of current head second level title corresponding with template is calculated. If similarity value reaches threshold value, S mark is deleted, otherwise addition M mark only deletes S mark;
Step 4 obtains final title annotation results;
Further, for the range of title template covering only in financial report, the present invention operates mainly in annual report analysis " financial report " module, so the range of title template covering, only in " financial report " module, in annual report, " financial report " belongs to In level-one title, under include second level title, three-level title;For title template, level-one title is equal to " finance report Second level title in announcement ", similarly, second level title is equal to the three-level title in " financial report ";Title template is manually to be divided Analyse and extract after the annual report of 100 every field, final title annotation results show the spreadability of the template preferably, mark Topic matching rate is higher, and the matching degree of level-one title and second level title can reach 80% or more.
The logical relation between different stage title is embodied in order to balance and improves two aspect of practicability of coding, title template The rule of coding is as follows:
Level-one title: with " 10000 " beginning, it is incremented by with " 10000 " for unit, i.e. angle of the level-one heading code from mathematics It sees, having is the distinguishing feature that can be divided exactly by " 10000 ";
Second level title: it is encoded since adding " 100 " in the basis of coding of the level-one title belonging to it, is with " 100 " Incremented, i.e. for second level heading code from the point of view of mathematics, having is that can be divided exactly by " 100 " and cannot be whole by " 10000 " The distinguishing feature removed.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding And modification, the scope of the present invention is defined by the appended.

Claims (5)

1. a kind of annual report text header automatic marking system, method particularly includes:
A, the level-one title of different stage and second level title will be marked twice;
B, level-one title is matched, mark is by complete and title template matching for the first time, and reaches threshold value by similarity calculation Level-one title do not add M, S mark, and screen leapfrog mark, mark the title that will be marked by similarity calculation for the second time Carry out Secondary Match;
C, for second level title, then the matching degree of all second level titles of its second level title level-one title corresponding with template is calculated, The characteristics of crossing financial report, determines whether the title finally adds mark, for no second level title, then its is neighbouring upper One and next title match with level-one title corresponding in template, by judging whether it exactly matches, or Reach certain similarity value, determines finally whether add mark.
2. a kind of annual report text header automatic marking system according to claim 1, it is characterised in that: first in step B The specific steps of secondary mark are as follows:
Step 1, financial report text and the identical title of template are labeled as M.
Step 2 can not match duplicate title in template, by similarity calculation, found out in template one it is similar Highest title is spent, S is labeled as.
After step 3, completion step 2, it will the situation that part different stage title can be labeled occur, remove different stage title Label, cross the pattern for counting the title of labeled M, obtain the pattern of title.
3. a kind of annual report text header automatic marking system according to claim 2, it is characterised in that: second in step B Secondary mark will be screened by calculating similarity to undesirable title in division header addition S label.
4. a kind of annual report text header automatic marking system according to claim 3, it is characterised in that: second deutero-albumose Infuse specific steps are as follows:
Step 1, reading are labeled as the title content of S;
Step 2 obtains the second level title quantity total that current head corresponds to title in template;
If step 3, total are equal to 0, it can determine whether that the level-one title of its context whether there is in template, if be located at similar position It sets, if it is, S mark will be deleted, addition M mark thinks that it meets matching and requires;
If total is not equal to 0, the similarity of all second level titles of current head second level title corresponding with template is calculated.If Similarity value reaches threshold value, then deletes S mark, otherwise addition M mark only deletes S mark;
Step 4 obtains final title annotation results.
5. a kind of annual report text header automatic marking system according to claim 1, it is characterised in that: the title template The range of covering is only in financial report.
CN201910416616.XA 2019-05-20 2019-05-20 Automatic annual newspaper text title labeling system Active CN110287458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910416616.XA CN110287458B (en) 2019-05-20 2019-05-20 Automatic annual newspaper text title labeling system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910416616.XA CN110287458B (en) 2019-05-20 2019-05-20 Automatic annual newspaper text title labeling system

Publications (2)

Publication Number Publication Date
CN110287458A true CN110287458A (en) 2019-09-27
CN110287458B CN110287458B (en) 2023-05-02

Family

ID=68002115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910416616.XA Active CN110287458B (en) 2019-05-20 2019-05-20 Automatic annual newspaper text title labeling system

Country Status (1)

Country Link
CN (1) CN110287458B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026849A (en) * 2019-12-17 2020-04-17 北京百度网讯科技有限公司 Data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102780647A (en) * 2012-07-21 2012-11-14 上海量明科技发展有限公司 Method, client and system for implementing mind map function by instant messaging tool
CN105045847A (en) * 2015-07-01 2015-11-11 广州市万隆证券咨询顾问有限公司 Method for extracting Chinese institutional unit name from text information
CN105809458A (en) * 2014-12-29 2016-07-27 苏宁云商集团股份有限公司 Advertisement accurate delivery method and system in e-commerce site
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN109766524A (en) * 2018-12-28 2019-05-17 重庆邮电大学 A kind of merger & reorganization class notice information abstracting method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102780647A (en) * 2012-07-21 2012-11-14 上海量明科技发展有限公司 Method, client and system for implementing mind map function by instant messaging tool
CN105809458A (en) * 2014-12-29 2016-07-27 苏宁云商集团股份有限公司 Advertisement accurate delivery method and system in e-commerce site
CN105045847A (en) * 2015-07-01 2015-11-11 广州市万隆证券咨询顾问有限公司 Method for extracting Chinese institutional unit name from text information
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN109766524A (en) * 2018-12-28 2019-05-17 重庆邮电大学 A kind of merger & reorganization class notice information abstracting method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026849A (en) * 2019-12-17 2020-04-17 北京百度网讯科技有限公司 Data processing method and device
CN111026849B (en) * 2019-12-17 2023-09-19 北京百度网讯科技有限公司 Data processing method and device

Also Published As

Publication number Publication date
CN110287458B (en) 2023-05-02

Similar Documents

Publication Publication Date Title
WO2021042521A1 (en) Contract automatic generation method, computer device and computer non-volatile storage medium
CN102067128A (en) Data processing device, data processing method, program, and integrated circuit
CN104809117A (en) Video data aggregation processing method, aggregation system and video searching platform
CN106897364B (en) Chinese reference corpus construction method based on events
CN103823824A (en) Method and system for automatically constructing text classification corpus by aid of internet
CN108763591A (en) A kind of webpage context extraction method, device, computer installation and computer readable storage medium
CN110941720A (en) Knowledge base-based specific personnel information error correction method
CN106339481A (en) Chinese compound new-word discovery method based on maximum confidence coefficient
CN103473217A (en) Method and device for extracting keywords from text
WO2022089227A1 (en) Address parameter processing method, and related device
CN110378911A (en) Weakly supervised image, semantic dividing method based on candidate region and neighborhood classification device
CN107967494A (en) A kind of image-region mask method of view-based access control model semantic relation figure
CN110134781A (en) A kind of automatic abstracting method of finance text snippet
CN105678244B (en) A kind of near video search method based on improved edit-distance
CN110287458A (en) A kind of annual report text header automatic marking system
CN105955960B (en) Grounding grid defect text mining method based on semantic frame
CN112148735A (en) Construction method for structured form data knowledge graph
CN115113919A (en) Software scale measurement intelligent informatization system based on BERT model and Web technology
CN109543712A (en) Entity recognition method on temporal dataset
CN109325159A (en) A kind of microblog hot event method for digging
CN110516069B (en) Fasttext-CRF-based quotation metadata extraction method
CN114495138A (en) Intelligent document identification and feature extraction method, device platform and storage medium
CN112435108A (en) Method and system for managing storage of sodium system and electronic equipment
CN105426388A (en) Apparatus for extracting and comparing webpage text
CN109657684A (en) A kind of image, semantic analytic method based on Weakly supervised study

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant