CN114398853A - Extraction method and device of document numerical index based on machine learning - Google Patents

Extraction method and device of document numerical index based on machine learning Download PDF

Info

Publication number
CN114398853A
CN114398853A CN202111487097.XA CN202111487097A CN114398853A CN 114398853 A CN114398853 A CN 114398853A CN 202111487097 A CN202111487097 A CN 202111487097A CN 114398853 A CN114398853 A CN 114398853A
Authority
CN
China
Prior art keywords
indexes
key
index
paragraphs
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111487097.XA
Other languages
Chinese (zh)
Inventor
赖文波
柯学
张汉林
林康
谭则涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gf Securities Co ltd
Original Assignee
Gf Securities Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gf Securities Co ltd filed Critical Gf Securities Co ltd
Priority to CN202111487097.XA priority Critical patent/CN114398853A/en
Publication of CN114398853A publication Critical patent/CN114398853A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method and a device for extracting document numerical indexes based on machine learning, wherein the method comprises the following steps: dividing a document to be processed into key paragraphs and non-key paragraphs according to preset rules; adopting natural language processing and deep learning to construct an index extraction model, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model; and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result. The invention improves the efficiency and the accuracy of index extraction.

Description

Extraction method and device of document numerical index based on machine learning
Technical Field
The invention relates to the technical field of data processing, in particular to a method and a device for extracting a document numerical index based on machine learning.
Background
With the advent of the data age, the financial industry is required to process exponentially growing large amounts of data, which include, in addition to structured data that is easy to use directly, additional data that is not directly usable, such as a huge amount of PDF and WORD-type document data, which are huge in amount, without structuring, and key information hidden in lengthy documents. The business has strong requirements on intelligent processing of the unstructured data, but due to the fact that processing is difficult, manual processing is mainly used at present, and efficiency is low.
Unstructured data cannot be directly applied to data analysis, and can be effectively used only by structured extraction, and the extraction of the unstructured data has great difficulty due to the reasons of many expression changes, various forms, complex structures and the like. The prior art comprises two types, one is that a document is searched manually through a document reading tool directly according to keywords, a series of hit document positions are located, a text is processed through an artificial naked eye, whether the index is the index needing to be extracted or not is judged, if the index is the index needing to be extracted, the numerical value and the unit in the text are continuously extracted and normalized, and if the index is not the index needing to be extracted, other searched document positions are continuously checked, and even a plurality of keywords are replaced for searching. The other method is that all text paragraphs are extracted from the document, then regular or keyword matching is carried out, and finally the recalled information is manually processed.
Disclosure of Invention
The invention aims to provide a method and a device for extracting a document numerical index based on machine learning, which aim to solve the problem of low extraction efficiency of the document numerical index in the prior art.
In order to achieve the above object, the present invention provides a method for extracting document numerical indicators based on machine learning, including:
dividing a document to be processed into key paragraphs and non-key paragraphs according to preset rules;
adopting natural language processing and deep learning to construct an index extraction model, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model;
and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result.
Preferably, the method for extracting a document numerical indicator based on machine learning further includes:
inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.
Preferably, the dividing the document to be processed into the emphasized paragraphs and the non-emphasized paragraphs according to a preset rule includes:
determining the format of a document to be input, converting the pdf document format into a word format, and generating an input document;
analyzing the input document to obtain a text and a table element set;
and positioning the key paragraphs by adopting a rule matching method according to the text and the table element set, and outputting the key paragraphs.
Preferably, the performing index integration on the key section indexes and the non-key section indexes, inputting a preset index feature scoring model to screen the key section indexes and the non-key section indexes, and outputting an index extraction result includes:
marking indexes in the key paragraph indexes and the non-key paragraph indexes, and performing index feature sorting and normalized input logistic regression model training to obtain the preset index feature scoring model;
integrating the key paragraph indexes and the non-key paragraph indexes, inputting the integrated indexes into the preset index feature scoring model, filtering the indexes lower than a first threshold value, and outputting the index extraction result.
The invention also provides a device for extracting the document numerical index based on machine learning, which comprises:
the dividing module is used for dividing the document to be processed into key paragraphs and non-key paragraphs according to preset rules;
the extraction module is used for constructing an index extraction model by adopting natural language processing and deep learning, inputting the key paragraphs and the non-key paragraphs into the index extraction model respectively, and outputting key paragraph indexes and non-key paragraph indexes by the index extraction model respectively;
and the screening module is used for integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result.
Preferably, the system further comprises a calculation module, wherein the calculation module is used for:
inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.
Preferably, the extraction module is further configured to:
determining the format of a document to be input, converting the pdf document format into a word format, and generating an input document;
analyzing the input document to obtain a text and a table element set;
and positioning the key paragraphs by adopting a rule matching method according to the text and the table element set, and outputting the key paragraphs.
Preferably, the screening module is further configured to:
marking indexes in the key paragraph indexes and the non-key paragraph indexes, and performing index feature sorting and normalized input logistic regression model training to obtain the preset index feature scoring model;
integrating the key paragraph indexes and the non-key paragraph indexes, inputting the integrated indexes into the preset index feature scoring model, filtering the indexes lower than a first threshold value, and outputting the index extraction result.
The present invention also provides a terminal device, including:
one or more processors;
a memory coupled to the processor for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method for machine learning based extraction of document numerical indicators as described in any of the above.
The present invention also provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for extracting a document numerical indicator based on machine learning according to any one of the above items.
Compared with the prior art, the invention has the beneficial effects that:
dividing a document to be processed into key paragraphs and non-key paragraphs according to preset rules; adopting natural language processing and deep learning to construct an index extraction model, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model; and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result. According to the invention, through a layered extraction method with separated key paragraphs, the focusing capacity of extraction is greatly improved, the workload of traditional index extraction is reduced, and the accuracy of traditional index extraction is effectively improved by building an index extraction model based on natural language processing and deep learning.
Further, inputting the historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises an index name, an index numerical value, an index unit, an index confidence degree, a short sentence hitting the index and a paragraph text. By building a preset prediction model, error indexes are further screened by calculating a prediction result and a screened index result, and the stability of an index extraction effect is improved.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart illustrating a method for extracting document numerical indicators based on machine learning according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating key section extraction according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a process for index extraction according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating an example of an index extraction method according to another embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for extracting document numerical indicators based on machine learning according to another embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for extracting a document numerical indicator based on machine learning according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the step numbers used herein are for convenience of description only and are not intended as limitations on the order in which the steps are performed.
It is to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
The terms "comprises" and "comprising" indicate the presence of the described features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The term "and/or" refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1, an embodiment of the present invention provides a method for extracting a document numerical indicator based on machine learning. As shown in fig. 1, the extraction method of the document numerical index based on machine learning includes steps S10 to S30. The method comprises the following steps:
s10: and dividing the document to be processed into a key paragraph and a non-key paragraph according to a preset rule.
Referring to fig. 2, specifically, key paragraphs of a document are extracted, a few key paragraphs of the document are extracted from the document with a large number of pages, and the extracted key paragraphs are stored as a new document, so that manual processing after subsequent links is facilitated.
Determining the format of a document to be input, converting the pdf document format into a word format, generating an input document, analyzing the input document, acquiring a text and a table element set, positioning a key paragraph according to the text and the table element set by adopting a rule matching method, and outputting the key paragraph.
Inputting a document to be processed, which can be a word document or a pdf document, converting the document into a word format and then performing subsequent processing if the document is the pdf document, wherein the conversion method is to use an ilovepdf (existing tool) for transcription, and after the transcription, analyzing elements such as texts, tables, diagrams and the like through a docx packet, and then positioning key paragraphs, wherein the positioning method is rule matching, and the rule is as follows: paragraph feature canonical + keywords, such as { any digital title } credit service, { any digital title } is the paragraph feature canonical we designed, for example: r '^ is [ (\[ [ d ] \\ d \ three, five, six, seven, eight ninety, chapter no ] {1,3} [) ] \\\ \ s \ t ] {0,4}', and the key words are key words of key paragraphs provided by the service, and the document is divided into key paragraphs and non-key paragraphs.
S20: and constructing an index extraction model by adopting natural language processing and deep learning, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model.
Referring to fig. 3 and 4, specifically, an index extraction model is constructed through natural language processing and deep learning to create a regular intelligent construction method, so that a service can more conveniently construct and maintain index regular, and generate regular intelligently, as shown in fig. 4, a service input item (including an index main body a and an index dimension U) is an example of a generation method, where N: i.e., Neighbour, any character of a certain length, configurable in length. NUM: random _ num: match any number, may contain, etc. common numeric characters. '|': or a character. Arrow head: representing a string of characters.
And matching the text set item by item according to the regular pattern constructed as shown in FIG. 4, and recording the hit text, numerical value and mark hit type characteristics as the regular pattern. For the keyword recalls in the key sections and the non-key sections, the text set is recalled through the keywords, and a plurality of keywords (the relationship of the keywords includes AND or OR) are supported. And establishing an index candidate set for syntactic analysis in the key paragraphs and non-key paragraphs, performing syntactic structure analysis on sentences recalled by keywords through a Baidu open-source ddparser syntactic analysis packet, and programming to realize extraction of SVO and ATT _ N, ADV _ V relations, wherein ATT _ N represents a decoration relation, ADV _ V represents a secondary action relation, and SVO represents a main and subordinate guest relation.
Evaluation and ranking of index candidate sets: similarity calculation is carried out on the index name X recalled by the syntactic analysis and the index name I provided by the service, and the calculation method comprises the following steps: cut _ for _ search (I) & jieba cut _ for _ search (x), i.e. performing word segmentation through the search word segmentation mode of jieba, taking the union of the word segmentation results of two external names, and calculating the proportion of the union words in I, wherein the higher the proportion is, the higher the similarity is, the more likely it is the target index to be found.
The textCNN model scores the semantic granularity of sentences according to the characteristics: and establishing a TextCNN model, and outputting a score between 0 and 1, wherein 0 represents that the sentence is not a target index sentence semantically, and 1 represents that the sentence is a target index sentence.
And finally, integrating the index names, the index values, the hit types, the evaluation scores and the like of the two components (integrating the same sentence characteristics) so as to provide the integrated sentence characteristics for the next step for processing.
S30: and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result.
Marking indexes in the key paragraph indexes and the non-key paragraph indexes, performing index feature sorting and normalization input to a logistic regression model for training to obtain a preset index feature scoring model, integrating the key paragraph indexes and the non-key paragraph indexes, inputting the key paragraph indexes and the non-key paragraph indexes into the preset index feature scoring model, filtering indexes lower than a first threshold value, and outputting an index extraction result.
Specifically, the indexes of the emphasized paragraphs and the non-emphasized paragraphs are integrated according to the index names, the index values, the hit types, the evaluation scores, and the like of the emphasized paragraphs and the non-emphasized paragraphs in step S20. And (3) carrying out data set labeling on the recalled index name, the recalled index value and the characteristic, carrying out characteristic sorting and normalization when the correct index mark label is 1 and the wrong index mark label is 0, inputting a logistic regression model for training, and obtaining a preset index characteristic scoring model. And inputting the integrated indexes into a preset index characteristic scoring model for screening, predicting newly recalled indexes and characteristics through the model, and filtering indexes which are lower than a first threshold value by 0.5 (at least one index with the maximum prediction score is reserved, and if the index is lower than the first threshold value by 0.5, marking is carried out, so that manual correction is convenient for the service). And outputting an index extraction result.
The utilization of unstructured data is always a difficult point in the field of artificial intelligence, the current data analysis technology mainly aims at structured data, the unstructured data has the difficulties of many expression changes, diversified forms, complex structures and the like, the utilization of the unstructured data by the current technology has the problems of low efficiency and low accuracy, and the invention aims to solve the problem of intelligent extraction of unstructured data of indexes in the financial field.
According to the invention, through the hierarchical extraction method of key paragraph separation, the expert experience and the model capability are well combined, the focusing capability of extraction is greatly improved, and the workload of business personnel is greatly reduced by the extracted key paragraphs compared with the huge number of pages of the original document. By constructing an index extraction model based on natural language processing and deep learning, the accuracy of traditional index extraction is effectively improved, the extraction is more comprehensive and effective, meanwhile, invalid recalls are reduced, the workload of business personnel is reduced, and the working efficiency is improved.
Referring to fig. 5, in an embodiment, after the step S30 outputs the index extraction result, the method further includes the following steps:
s40: inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.
Adopting an ARIMA model, inputting a historical sequence of each index, training the model, predicting a value P (P is extremely unlikely to be zero, but for logical self-consistency, if the value is zero, 0.01 is taken) of a next sequence of the value, and calculating with an index R returned by an index extraction model in a calculation mode: 1- (R-P) ^2/P ^2, and if the value is less than 0, the value is 0.
And finally, the result mean values of the comprehensive index feature scoring model and the index historical sequence prediction model are used as confidence degrees to be sequenced, the extracted results of the service display indexes are displayed, and the index names, the index values, the index units, the index confidence degrees, short sentences and paragraph texts hitting the indexes are displayed, so that the service can be browsed quickly.
The method further establishes a sequence prediction model based on the ARIMA model, so that the whole method has the judgment capability of common knowledge and rationality, filters error indexes, and improves the stability of index extraction effect. The method is based on ARIMA, an index historical sequence prediction model is developed, the extracted numerical value can be deduced, and the accuracy rate of index extraction can be improved.
Referring to fig. 6, another embodiment of the present invention provides an apparatus for extracting document numerical indicators based on machine learning, including:
the dividing module 11 is configured to divide the document to be processed into an emphasized paragraph and a non-emphasized paragraph according to a preset rule.
The extraction module 12 is configured to construct an index extraction model by using natural language processing and deep learning, input the key paragraphs and the non-key paragraphs into the index extraction model, and output key paragraph indexes and non-key paragraph indexes by the index extraction model.
And the screening module 13 is configured to perform index integration on the key paragraph indexes and the non-key paragraph indexes, input a preset index feature scoring model to screen the key paragraph indexes and the non-key paragraph indexes, and output an index extraction result.
Preferably, the system further comprises a calculation module, wherein the calculation module is used for: inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.
Preferably, the extraction module is further configured to: determining the format of a document to be input, converting the pdf document format into a word format, and generating an input document;
analyzing the input document to obtain a text and a table element set;
and positioning the key paragraphs by adopting a rule matching method according to the text and the table element set, and outputting the key paragraphs.
Preferably, the screening module is further configured to: marking indexes in the key paragraph indexes and the non-key paragraph indexes, and performing index feature sorting and normalized input logistic regression model training to obtain the preset index feature scoring model;
integrating the key paragraph indexes and the non-key paragraph indexes, inputting the integrated indexes into the preset index feature scoring model, filtering the indexes lower than a first threshold value, and outputting the index extraction result.
For specific limitations of the extraction device for the document numerical indicator based on machine learning, reference may be made to the above limitations on the extraction method for the document numerical indicator based on machine learning, and details thereof are not repeated here. The modules in the extraction device of the document numerical indicators based on machine learning can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
Referring to fig. 7, an embodiment of the present invention provides a terminal device, including:
one or more processors;
a memory coupled to the processor for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method for machine learning based extraction of document numerical indicators as described above.
The processor is used for controlling the overall operation of the terminal device so as to complete all or part of the steps of the extraction method of the document numerical index based on the machine learning. The memory is used to store various types of data to support operation at the terminal device, and these data may include, for example, instructions for any application or method operating on the terminal device, as well as application-related data. The Memory may be implemented by any type of volatile or non-volatile Memory device or combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic disk, or optical disk.
In an exemplary embodiment, the terminal Device may be implemented by one or more Application Specific 1 integrated circuits (AS 1C), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and is configured to perform the method for extracting the machine learning based document numerical indicator according to any one of the embodiments described above, so AS to achieve the technical effects consistent with the method described above.
In another exemplary embodiment, a computer readable storage medium is also provided, which includes a computer program, when being executed by a processor, the computer program implements the steps of the extraction method based on the machine learning document numerical index according to any one of the above embodiments. For example, the computer readable storage medium may be the above-mentioned memory including a computer program, and the above-mentioned computer program may be executed by a processor of a terminal device to implement the extraction method of the document numerical indicator based on machine learning according to any one of the above-mentioned embodiments, and achieve the technical effects consistent with the above-mentioned method.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A method for extracting document numerical indicators based on machine learning is characterized by comprising the following steps:
dividing a document to be processed into key paragraphs and non-key paragraphs according to preset rules;
adopting natural language processing and deep learning to construct an index extraction model, respectively inputting the key paragraphs and the non-key paragraphs into the index extraction model, and respectively outputting key paragraph indexes and non-key paragraph indexes by the index extraction model;
and integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result.
2. The extraction method of the document numerical index based on machine learning according to claim 1, further comprising:
inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.
3. The extraction method of document numerical indicators based on machine learning according to claim 2, wherein the dividing the document to be processed into the emphasized paragraphs and the non-emphasized paragraphs according to the preset rules comprises:
determining the format of a document to be input, converting the pdf document format into a word format, and generating an input document;
analyzing the input document to obtain a text and a table element set;
and positioning the key paragraphs by adopting a rule matching method according to the text and the table element set, and outputting the key paragraphs.
4. The method for extracting document numerical indicators based on machine learning according to claim 3, wherein the method for performing indicator integration on the key paragraph indicators and the non-key paragraph indicators, inputting a preset indicator feature scoring model to screen the key paragraph indicators and the non-key paragraph indicators, and outputting an indicator extraction result comprises:
marking indexes in the key paragraph indexes and the non-key paragraph indexes, and performing index feature sorting and normalized input logistic regression model training to obtain the preset index feature scoring model;
integrating the key paragraph indexes and the non-key paragraph indexes, inputting the integrated indexes into the preset index feature scoring model, filtering the indexes lower than a first threshold value, and outputting the index extraction result.
5. An extraction device of a document numerical index based on machine learning, comprising:
the dividing module is used for dividing the document to be processed into key paragraphs and non-key paragraphs according to preset rules;
the extraction module is used for constructing an index extraction model by adopting natural language processing and deep learning, inputting the key paragraphs and the non-key paragraphs into the index extraction model respectively, and outputting key paragraph indexes and non-key paragraph indexes by the index extraction model respectively;
and the screening module is used for integrating the indexes of the key paragraphs and the indexes of the non-key paragraphs, inputting a preset index feature scoring model to screen the indexes of the key paragraphs and the indexes of the non-key paragraphs, and outputting an index extraction result.
6. The extraction apparatus of document numerical indicators based on machine learning according to claim 5, further comprising a calculation module, said calculation module is configured to:
inputting historical sequence data into a preset prediction model to obtain a prediction result, calculating the average value of the prediction result and the screened index result as confidence degrees, and sequencing the average value, wherein the average value comprises index names, index values, index units, index confidence degrees, short sentences hitting indexes and paragraph texts.
7. The device for extracting document numerical indicators based on machine learning according to claim 6, wherein the extracting module is further configured to:
determining the format of a document to be input, converting the pdf document format into a word format, and generating an input document;
analyzing the input document to obtain a text and a table element set;
and positioning the key paragraphs by adopting a rule matching method according to the text and the table element set, and outputting the key paragraphs.
8. The extraction apparatus of document numerical indicators based on machine learning according to claim 7, wherein the filtering module is further configured to:
marking indexes in the key paragraph indexes and the non-key paragraph indexes, and performing index feature sorting and normalized input logistic regression model training to obtain the preset index feature scoring model;
integrating the key paragraph indexes and the non-key paragraph indexes, inputting the integrated indexes into the preset index feature scoring model, filtering the indexes lower than a first threshold value, and outputting the index extraction result.
9. A terminal device, comprising:
one or more processors;
a memory coupled to the processor for storing one or more programs;
when executed by the one or more processors, cause the one or more processors to implement the method of machine learning based document numerical indicator extraction of any of claims 1-4.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the method for extracting a document numerical indicator based on machine learning according to any one of claims 1 to 4.
CN202111487097.XA 2021-12-06 2021-12-06 Extraction method and device of document numerical index based on machine learning Pending CN114398853A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111487097.XA CN114398853A (en) 2021-12-06 2021-12-06 Extraction method and device of document numerical index based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111487097.XA CN114398853A (en) 2021-12-06 2021-12-06 Extraction method and device of document numerical index based on machine learning

Publications (1)

Publication Number Publication Date
CN114398853A true CN114398853A (en) 2022-04-26

Family

ID=81226051

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111487097.XA Pending CN114398853A (en) 2021-12-06 2021-12-06 Extraction method and device of document numerical index based on machine learning

Country Status (1)

Country Link
CN (1) CN114398853A (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106484920A (en) * 2016-11-21 2017-03-08 北京恒华伟业科技股份有限公司 A kind of abstracting method of evaluation document index
US10410224B1 (en) * 2014-03-27 2019-09-10 Amazon Technologies, Inc. Determining item feature information from user content

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10410224B1 (en) * 2014-03-27 2019-09-10 Amazon Technologies, Inc. Determining item feature information from user content
CN106484920A (en) * 2016-11-21 2017-03-08 北京恒华伟业科技股份有限公司 A kind of abstracting method of evaluation document index

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHANGHUI LIU, YIYAN CHEN, YUANFEI DAI, CHENHAO GUO, ZUWEN ZHANG & XING CHEN: "Syntactic and Semantic Features Based Relation Extraction in Agriculture Domain", WEB INFORMATION SYSTEMS AND APPLICATIONS, 20 November 2018 (2018-11-20), pages 252 - 258, XP047501543, DOI: 10.1007/978-3-030-02934-0_23 *
郭少卿;乐小虬;: "科技论文中数值指标实际取值识别", 数据分析与知识发现, no. 01, 25 January 2018 (2018-01-25), pages 21 - 28 *

Similar Documents

Publication Publication Date Title
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN109189942B (en) Construction method and device of patent data knowledge graph
TWI662425B (en) A method of automatically generating semantic similar sentence samples
TWI426399B (en) Method and apparatus of searching and matching input data to stored data
CN111767716B (en) Method and device for determining enterprise multi-level industry information and computer equipment
CN111475623A (en) Case information semantic retrieval method and device based on knowledge graph
CN109947921B (en) Intelligent question-answering system based on natural language processing
CN112395395B (en) Text keyword extraction method, device, equipment and storage medium
CN110502642B (en) Entity relation extraction method based on dependency syntactic analysis and rules
CN110020422A (en) The determination method, apparatus and server of Feature Words
CN112035511A (en) Target data searching method based on medical knowledge graph and related equipment
CN113282689B (en) Retrieval method and device based on domain knowledge graph
CN111159330A (en) Database query statement generation method and device
CN115186050B (en) Method, system and related equipment for recommending selected questions based on natural language processing
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN108304382A (en) Mass analysis method based on manufacturing process text data digging and system
CN113254593A (en) Text abstract generation method and device, computer equipment and storage medium
CN113157903A (en) Multi-field-oriented electric power word stock construction method
JP2007047974A (en) Information extraction device and information extraction method
Galvez et al. Term conflation methods in information retrieval: Non‐linguistic and linguistic approaches
Suresh et al. Data mining and text mining—a survey
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
CN112732969A (en) Image semantic analysis method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination