CN115238217B

CN115238217B - Method for extracting numerical information from bulletin text and terminal

Info

Publication number: CN115238217B
Application number: CN202211161206.3A
Authority: CN
Inventors: 赵一宁; 朱效民; 王新明; 王茂励; 邹敏; 王琪; 杨航
Original assignee: Shandong Qilu Big Data Research Institute
Current assignee: Shandong Qilu Big Data Research Institute
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-20
Anticipated expiration: 2042-09-23
Also published as: CN115238217A

Abstract

The invention provides a method and a terminal for extracting numerical information from a bulletin text, which relate to the technical field of data identification of natural language processing technology, and are used for carrying out simulation loading on a set news bulletin webpage based on a crawler tool so as to obtain the content in the webpage; traversing all sentences in the news bulletin webpage content, and judging whether adjacent sentences in the obtained webpage content need to be spliced or not according to a preset rule; extracting effective sentences with numerical information, and extracting the numerical information in the effective sentences based on a preset extraction algorithm to form numerical relation tuples; and extracting the processed numerical value relation tuples, storing the numerical value relation tuples in a memory in a list form, and displaying the numerical value relation tuples in a preset form. The invention can extract the numerical relation with smaller granularity from the large-scale unstructured bulletin text, and meets the information requirements of deeper and finer granularity of users.

Description

Method for extracting numerical information from bulletin text and terminal

Technical Field

The invention relates to the technical field of data identification of natural language processing technology, in particular to a method and a terminal for extracting numerical value information from bulletin texts.

Background

The data in the internet is large and difficult to estimate, the data expression information has various forms, the numerical value is information formed by statistics, analysis, processing and calculation from the data, and the data description can be more accurate and more visual.

Many industries currently require the use of industry data to plan, or summarize industry development status. The importance of the numerical values is self-evident, since economic situations and industrial situations are often numerically shown, for example in some government reports or business reports.

The extraction of the current numerical value information is mainly in a template mode, the extraction of the relation examples is realized by defining characters, grammar or semantic patterns expressed in a text and taking the matching of the patterns and the text as a main means, and the portability is not strong although the accuracy is high. Because the text data has various sources, different structures and huge quantity, the texts in different fields have different language characteristics, some are more spoken, some have strong specialization, one rule template can only adapt to a certain text, and a universal rule template is difficult to establish. And because the numerical value that the user needs can not accurately be discerned to current extraction mode, cause the noise more in the numerical value of extraction, influence the accuracy of later stage analysis to the numerical value, because need extract the numerical value in numerous websites, current mode extraction efficiency is low moreover. How to quickly and accurately acquire numerical information desired by a user from large-scale text information becomes a problem to be solved at present.

Disclosure of Invention

The invention provides a method for extracting numerical information in bulletin texts, which is used for collecting the bulletin texts from the Internet, preprocessing the bulletin texts and extracting various items of numerical information contained in the bulletin texts to form various accurate data items and meet the use requirements of users.

The method for extracting the numerical information in the bulletin text comprises the following steps:

the method comprises the following steps that firstly, a set news bulletin webpage is subjected to simulation loading based on a crawler tool so as to obtain content in the webpage;

step two, traversing all sentences in the webpage content of the news bulletin, and judging whether adjacent sentences in the obtained webpage content need to be spliced or not according to a preset rule;

if the sentence is required to be spliced, carrying out splicing operation on two adjacent sentences required to be spliced so as to obtain the bulletin text;

extracting effective sentences with numerical information, and extracting the numerical information in the effective sentences based on a preset extraction algorithm to form numerical relation tuples;

the extraction method comprises the following steps:

(1) Combining nouns, verbs and entity pairs to extract a candidate relation tuple comprising a numerical value main body, a numerical value and a relation between the main body and the numerical value;

(2) Filtering the candidate relation triple through the relation indicating words;

(3) Filtering out noise in the candidate relationship triples based on the entity-to-location restriction;

(4) Expanding the candidate relation triple through semantic features;

(5) Expanding the candidate relationship triples by syntactic analysis;

and step four, extracting the processed numerical value relation tuples, storing the numerical value relation tuples in a memory in a list form, and displaying the numerical value relation tuples in a preset form.

It should be further noted that the first step further includes: disguising the crawler program by adding an appropriate request header;

analyzing the webpage by using webdrive of the selenium, setting waiting time, and waiting for all elements of the webpage to be loaded;

acquiring a webpage source code, and extracting the content of an html element corresponding to the webpage source code according to an xpath expression;

the extracted content comprises a webpage body, and the webpage body is formed by splicing the contents of all text labels in the webpage.

It should be further noted that step two further includes:

judging whether to splice adjacent sentences by taking punctuation marks as a basis, and specifically comprising the following conditions:

if the characters at the end of the sentence comprise any one of the following symbols, splicing the sentence with the adjacent later sentence;

any punctuation mark comprises: comma, colon, left half quotation marks in quotation marks, left half parentheses in parentheses, and left half title marks in title numbers; any one of the following characters of and, and including;

if the beginning character of the sentence includes any one of the following punctuations: when comma, colon, semicolon, latter half quotation mark in quotation marks, exclamation mark, period, percentile, bracket, book title number, pause number, question mark and question mark are in the same state, the sentence is spliced with the adjacent preceding sentence;

if a plurality of punctuation mark pairs exist in the sentence and the quantity of the left half of the punctuation mark pairs is larger than that of the right half, splicing the sentence with the adjacent later sentence;

if a plurality of punctuation mark pairs exist in the sentence and the quantity of the left half of the punctuation mark pairs is less than that of the right half, splicing the sentence with the adjacent preceding sentence;

if a group of punctuation mark pairs exist in the sentence and the orientation of the punctuation mark pairs is opposite, splicing the sentence with the adjacent preceding and following sentences;

and after the sentences are spliced, dividing the obtained bulletin text into a plurality of sentence sets according to the long sentence separators.

It is further noted that the segmentation, part of speech tagging, syntax parsing, semantic role tagging and named entities of the sentences in the set are identified through a preset language model;

and extracting numerical values by taking sentences as units.

It should be further noted that, the extracting of valid sentences with numerical information in step three includes:

(I) The number words exist in the sentence;

(II) the sentence contains a self-defined numerical relation trigger word;

if the input sentence is identified as the situation (I), reserving the sentence where the quantity word is located, then inputting all sentences into a numerical value relation extraction system once, and identifying the numerical value main body, the numerical value and the relation among the numerical value main body and the numerical value;

when the input sentence is judged not to belong to the situation (I), judging whether the input sentence is the situation (II);

if the situation (II) is met, the sentence where the trigger word is located is reserved, and numerical value information is identified and obtained from the sentence.

It should be further noted that if the numerical information is a null value, the numerical relationship cannot be obtained from the sentence, and the sentence does not belong to the valid sentence;

if not, the sentence is recognized as a valid sentence, and the sentence, the numerical subject, the numerical value and the relationship among the sentences are stored.

It should be further noted that, the extraction manner of the third step further includes:

the step (1) further comprises: based on the POS mark result, selecting an entity pair with a distance not exceeding two entities to the left or the right by using a verb, a verb and a noun combined word as a center, and combining to form a candidate relation triple;

the step (2) further comprises: extracting the relation indicator in each relation triple from the candidate relation triples extracted in the last step, counting the occurrence frequency of the relation indicator, generating a sorting function according to the occurrence frequency of the relation indicator, and setting a threshold to filter out the candidate relation triples of which the relation indicator ranking is greater than the threshold;

the step (3) further comprises the following steps: according to the relative position between the relation indicator and the entity pair, the position of the relation indicator in the sentence is possible, namely the relation indicator is positioned between the entity pair, positioned on the right side of the entity pair and positioned on the left side of the entity pair, and noise in the candidate relation triple is filtered according to the expression characteristics of the announcement text;

the step (4) further comprises the following steps: analyzing the relation between each component in the sentence and the predicate based on the result of SRL marking, defining three relation types A0, A1 and A2 to expand the relation triple, and directly acquiring a main and predicate triple [ A0, pred1 and A1] if A0 and A1 have semantic relation with the same predicate;

similarly, if A1 and A2 have semantic relations with the same predicate, extracting a relation triple [ A1, pred2, A2];

the step (5) further comprises the following steps: based on the results of the POS and DP tagging, four types of syntactic features are defined to extend the relationship triplets.

It should be further noted that, in the fourth step, a Python Excel operation module xlsxwrite is used for data output;

the method specifically comprises the following steps:

【1】 Importing the extracted numerical value relation tuple list;

【2】 Creating an excel table through an xlsxxwriter.Workbook function, and creating a sheet through a Workbook.add _ Worksheet function;

【3】 Customizing a header and writing the header into an excel table;

【4】 The value relation tuples are written into the rows of the excel table by the work _ row function and are displayed.

The invention also provides a terminal for realizing the method for extracting the numerical information in the bulletin text, which comprises the following steps: the memory is used for storing the computer program and the extraction method of the numerical information in the bulletin text;

and the processor is used for executing the computer program and the method for extracting the numerical information in the bulletin text so as to realize the steps of the method for extracting the numerical information in the bulletin text.

According to the technical scheme, the invention has the following advantages:

the invention is based on the establishment of a set of numerical information extraction method with strong operability, good adaptability and high accuracy, realizes rapid and real-time data updating and monitoring, provides important reference basis for making decisions for enterprises and other organizations, and better serves the development of the economy and the society.

The method for extracting the numerical information in the bulletin text provided by the invention is a beneficial supplement of other information acquisition means from the viewpoint of meeting the information requirements of users. And a numerical relation with smaller granularity is extracted from a large-scale unstructured bulletin text, so that the information requirements of a user on deeper level and finer granularity are met.

From the technical implementation point of view, the method for extracting numerical information in the bulletin text provides support for other information processing technologies. The invention is used as a technology for automatically converting unstructured information into structured information, and lays a foundation for further information processing such as database query, data analysis and the like.

The extraction method of the numerical information in the bulletin text provided by the invention plays an important role in the application fields of information collection, health service, commodity sales, economic situation prediction and the like from the engineering perspective.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings required to be used in the description will be briefly introduced below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.

FIG. 1 is a flow chart of a method for extracting numerical information from a bulletin text;

FIG. 2 is a flow chart of valid sentence extraction;

FIG. 3 is a flow chart of numeric relationship tuple extraction.

Detailed Description

The invention provides a method for extracting numerical information in a bulletin text, which extracts the numerical information from the bulletin text. The invention aims to extract the location, time, the main body of the numerical information, the numerical value and the association relationship between the main body and the numerical value from the bulletin text.

The extraction method provided by the invention aims at the defects of the prior art, combines rules and statistics, identifies and extracts various numerical information in the bulletin text based on a numerical information extraction mode of a Natural Language Processing (NLP) technology, converts the numerical information into structured data and stores the structured data into excel, and integrates the data in a uniform form so as to facilitate the query and comparison of users.

The method is different from the traditional manual recording and inputting mode, and is based on the establishment of a set of numerical information extraction method with strong operability, good adaptability and high accuracy, so that the data can be updated and monitored quickly and in real time, an important reference basis is provided for making decisions for enterprises and other organizations, and economic and social development can be better served.

Specifically, as shown in fig. 1, the method for extracting numerical information in a bulletin text provided by the present invention includes: the method comprises the following steps that firstly, a set news bulletin webpage is subjected to simulation loading based on a crawler tool so as to obtain content in the webpage;

the method comprises the steps of simulating and loading a plurality of websites pointing to a news bulletin webpage based on a crawler tool selenium to obtain content in the webpage. The selenium test runs directly in the browser, simulating the user's operational behavior.

The specific crawling method comprises the following steps: firstly, a crawler program is disguised by adding a proper request head, so that the website is prevented from identifying the crawler program and carrying out IP (Internet protocol) prohibition. And then, analyzing the webpage by using webdrive of the selenium, setting waiting time, and waiting for all elements of the webpage to be loaded. And finally, acquiring a webpage source code, extracting the content of the corresponding html element according to the xpath expression, wherein the extracted content comprises a webpage text, and the webpage text is formed by splicing the contents in all text labels in the webpage.

Step two, traversing all sentences in the news bulletin webpage content, and judging whether adjacent sentences in the obtained webpage content need to be spliced or not according to a preset rule; if the sentence is required to be spliced, carrying out splicing operation on two adjacent sentences required to be spliced so as to obtain the bulletin text;

in the second step, the punctuation marks are used as a basis to judge whether to splice the adjacent sentences, which comprises the following conditions:

(1) If the character at the end of the sentence includes any one of the following punctuations: when comma, colon, left half quotation mark in quotation mark, left half parenthesis in parenthesis, left half quotation mark in title number, or any one of the following characters is included, the left half quotation mark is spliced with the adjacent later sentence;

(2) If the beginning character of the sentence includes any one of the following punctuations: when comma, colon, semicolon, latter half quotation mark in quotation marks, exclamation mark, period, percentile, bracket, book title number, pause number, question mark, "&", then splicing the comma, colon, semicolon and adjacent preceding sentence;

(3) If a plurality of punctuation mark pairs exist in the sentence and the number of the left half sides of the punctuation mark pairs is larger than that of the right half sides, splicing the punctuation mark pairs with the adjacent rear sentence;

(4) If a plurality of punctuation mark pairs exist in the sentence and the quantity of the left half of the punctuation mark pairs is less than that of the right half, splicing the punctuation mark pairs with the adjacent preceding sentence;

(5) And if a group of punctuation mark pairs exist in the sentence and the directions of the punctuation mark pairs are opposite, splicing the punctuation mark pairs with the adjacent preceding sentence and the adjacent succeeding sentence.

After the sentences are spliced, the obtained bulletin text is divided into a plurality of sentences according to long sentence delimiters (semicolon, period, question mark, exclamation mark and ellipsis mark). Then, the word segmentation (CWS), part of speech tagging (POS), syntax parsing (DP), semantic role tagging (SRL) and Named Entity Recognition (NER) of sentences in the set are realized through a language model provided by the Hadamard language technology platform. And finally, extracting numerical values by taking sentences as units.

specifically, in the preprocessed text sentence set, a valid sentence of numerical information is first obtained, where the valid sentence refers to a sentence containing numerical information and is determined by the existence of numerical information, and the valid sentence includes any one of the following forms:

(1) The number words exist in the sentence;

(2) The sentence contains a self-defined numerical relation trigger word;

the numerical relation trigger word refers to a word capable of leading out numerical information, and is required to be customized according to the announcement type and stored in a word list.

If the input sentence is identified as the situation (1), the sentence where the quantity word is located is reserved, then all sentences are input into the numerical value relationship extraction system once, and the numerical value main body, the numerical value and the relationship among the numerical value main body and the numerical value are identified. Since the determination of the quantitative word is determined by the part-of-speech tag, a case where the part-of-speech tag tags the quantitative word "m" as the common noun "n" may be encountered. And when the input sentence is judged not to belong to the situation (1), judging whether the input sentence is the situation (2), if so, reserving the sentence where the trigger word is positioned, and identifying and acquiring numerical value information from the sentence. Finally, if the numerical information is null, the numerical relationship cannot be obtained from the sentence, and the sentence does not belong to the valid sentence; if not, the sentence is recognized as a valid sentence, and the sentence, the main body of the numerical value, the numerical value and the relation among the numerical value are stored.

The parameters related to the numerical relation extraction algorithm comprise:

algorithm numerical relationship extraction

Input: a valid sentence set preText;

output: a set of relational triples Rt; // relationship triple format [ E1, R, E2]

1. candidate triples ← extractTriples (preText)// extracting candidate relation tuples from effective sentences in turn

2. relationship keywords ← generateRK (candidateTriples)./extracting relationship indicators from candidate relationship tuples

3. PRETRILES ← sortFILTER (M)// M denotes the number of occurrences of a relation indicator

4. for each of the earch canRt in preTriples do {// canRt denotes each element of the set preTriples

5.fa←rkFilter(canRt);

6.fb←rlExpand(canRt);

7.fc←sfExpand(canRt);

8.}

9. Rt←getCombin(fa,fb,fc);

10. output Rt.

The extraction of the numerical value information is realized by combining an extraction algorithm. The specific extraction operation comprises:

(1) A candidate relationship tuple comprising a numerical body, a numerical value, and a relationship between the body and the numerical value is extracted in association with the noun, verb, and entity pair. Based on the results of the POS tags, using verbs and verb-plus-noun combinators as centers, selecting pairs of entities that are no more than two entity distances to the left or right to combine them to form a candidate relationship triple.

(2) The candidate relationship triplets are filtered by the relationship indicators. And for the candidate relation triples extracted in the last step, extracting the relation indicator in each relation triplet and counting the occurrence frequency of the relation indicator, generating a sorting function according to the occurrence frequency of the relation indicator, and setting a threshold to filter out the candidate relation triples with the relation indicator ranking larger than the threshold.

(3) The candidate relationship triples are filtered by entity-to-location restriction. According to the relative position between the relation indicator and the entity pair, the position of the relation indicator in the sentence is possible to three, namely the relation indicator is positioned between the entity pair, positioned on the right side of the entity pair and positioned on the left side of the entity pair, and the noise in the candidate relation triple is filtered according to the announcement text expression characteristics.

(4) And expanding the candidate relation triple through semantic features. And analyzing the relation between each component in the sentence and the predicate based on the result of the SRL marking, defining three relation types A0, A1 and A2 to expand the relation triple, and directly acquiring a main and predicate triple [ A0, pred1 and A1] if the A0 and the A1 have semantic relation with the same predicate. Similarly, if A1 and A2 have semantic relations with the same predicate, extracting a relation triple [ A1, pred2, A2].

(5) The candidate relationship triplets are extended by syntactic analysis. Based on the results of POS and DP tagging, four types of syntactic features are defined to extend the relational triplets. The labels and corresponding meanings are as follows:

SBV indicates a cardinal-predicate relationship

VOB represents moving guest relationship

ATT representation centering relationship

POB indicates a concierge relationship

ADV representational mesostructure

COO represents a parallel relationship

| n denotes noun

| v denotes verb

| p denotes preposition

The formalized definition mode of the syntactic characteristic extraction relation triple of the invention is as follows:

the SF1 is a main-predicate guest structure,

；

the structure of the SF2 dependency relationship,

；

SF3 contains a master-predicate guest structure of mediate-guest relationship,

the structure of SF4 in parallel relation,

in addition, for the extraction of the announcement distribution time, the distribution place and the distribution unit, the regular expression and the part-of-speech tagging can be used for realizing the independent extraction except the numerical value relational tuple.

Each extracted numerical value relation tuple is stored in a memory in a list form, and in order to better display data, the data is presented in an excel form. The method of the invention uses Python Excel operation module XlsxWriter to output data.

The specific operation comprises the following steps:

(1) Importing the extracted relation tuple list;

(2) Creating an excel table through an xlsxxwriter.Workbook function, and creating a sheet through a Workbook.add _ workhead function;

(3) Customizing a header and writing the header into the excel;

(4) Write _ row function writes data to excel's line through the works.

The method for extracting the numerical information in the bulletin text, provided by the invention, is used for constructing a numerical information extraction model with strong operability, good adaptability and high accuracy.

Compared with the prior art, the extraction method for the numerical information in the bulletin text provided by the invention is a beneficial supplement of other information acquisition means from the viewpoint of meeting the information requirements of users. Numerical value relations with smaller granularity are extracted from large-scale unstructured bulletin texts, and the requirements of users on deeper information and finer information are met.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present invention is further described with reference to fig. 1 to 3, taking numerical information extraction in the national economic development bulletin as an example, and the specific flow is as follows:

data crawling

Since the related web pages of the national economic development bulletin adopt the dynamic page technology, the text content in the web pages can not be obtained by directly requesting the web pages. Therefore, the simulation loading of the webpage is carried out by using the selenium so as to obtain all the content in the webpage. The selenium test runs directly in the browser and simulates the operation behavior of the user. The specific crawling method comprises the following steps:

(1) Adding a proper request head to avoid the website identifying the crawler program to carry out IP (Internet protocol) blocking, analyzing the webpage by using a webdrive of a selenium, waiting for the loading completion of all elements of the webpage, and retrying after prolonging the waiting time if the request fails.

(2) Extracting the content of the corresponding html element by using an xpath method of selenium, wherein the extraction of the content comprises the following steps: web page title, issuing authority, issuing time and web page body. The webpage text part is formed by splicing all contents in the text labels.

(II) data preprocessing

(1) Sentence concatenation

The crawled text is presented in the form of a list, where there are instances where the same sentence or paragraph is broken. After manual review, the invention finally judges whether the adjacent sentences are spliced or not according to punctuation marks. The present invention summarizes the following 6 possible cases, wherein the term "means" or "in the following for separating the contents of the required list. Input the ith sentence:

case 1: the last character is a punctuation symbol or word [, |, |: | "(| (| [ | { | < | and | contain ]))

Case 2: the first character of sentence i +1 is the punctuation [, |, |; | - |. L: "|" | "|! |! |% & | (|) | (|) | } | > | Zhi, | #| # | ]

Case 3: there are punctuation symbol pairs [ () | () | [ in ] | < > | in "|" ], and the number of the left half of the symbol pair is greater than the number of the right half, as in "xx (xx) xx (xx"

Case 4: the (i + 1) th sentence has punctuation mark pair [ () | () | [ l ] { } | < > | in | "], and the number of the left half side of the symbolic pair is less than that of the right half side, such as" xx) xx (xx) xx "

Case 5: there is a pair of symbols, but in opposite positions, e.g. "xxx) xxx (xxx"

Case 6: the last character is the Chinese character "it", the first character of sentence i +1 is the Chinese character "Zhong"

If the input ith sentence meets one of the above 6 conditions, the ith sentence and the (i + 1) th sentence are spliced. If the above conditions are not met, the input sentence is not needed to be spliced with the next sentence.

(2) Text segmentation

For entered bulletin text,; l. the method is used for the preparation of the medicament. The method includes | # | \8230 |, which is a separator, dividing a text into a long sentence set, and extracting information in units of sentences. Punctuation marks included in the symbol pairs such as "(), [ and ], [ are not considered in the segmentation process.

(3) LTP language technology platform

Loading an ltp model, comprising: models, pos models, parser models, pisrl _ win models, ner models are used for participling, part-of-speech tagging, syntactic parsing, semantic parsing, and named entity recognition, respectively, of sentences. Meanwhile, in order to better identify the industry words, the invention adds a total of 180 large classes in national economy industry classification and product classification for statistics provided by an official website into a custom dictionary (named user _ fact. The content of the self-defined dictionary part is as follows: agriculture; forestry; animal husbandry; fishery; the agriculture, forestry, animal husbandry, fishery service industry; coal mining and washing industries; the oil and gas exploration industry; ferrous metal ore mining and dressing; non-ferrous metal mining and dressing industry; non-metal ore mining and dressing; other mining industries; the agro-sideline food processing industry; the food manufacturing industry; the beverage manufacturing industry; a tobacco product; the industrial textile industry; textile garment, shoe, cap manufacturing; leather, fur, feather (down) and their production industries; wood processing and wood, bamboo, vine, palm and grass products industry; the furniture manufacturing industry; paper and paper products industry; reproduction of printing and recording media; the manufacturing industry of cultural and educational sports goods; chemical feedstocks for petroleum processing, coking, and nuclear fuel processing industries and chemical manufacturing industries; the pharmaceutical manufacturing industry;

(III) extraction of numerical information of each item

The information contained in the national economic development bulletin is: effective sentences, time, place, website address and various numerical value information. Their specific meanings are as follows:

valid sentences: a sentence containing numerical information determined by the existence of the numerical information;

time: the method is divided into two categories, one is announcement publishing time, and the announcement publishing time is determined by a regular expression "(\ d {4 }) \ s \ [ \/year- ] \ s \ [ \/month- ] \ s \ [ \/d {1,2 }) \ s \ [ # ]"; the other is the time mentioned in the body of the announcement, determined in the active sentence by the part-of-speech tag "nt".

A place: the method is divided into two categories, one category is province/city of the announcement issue, and is determined in the announcement title by a part-of-speech tagging label 'ns'; the other is a place mentioned in the body of the bulletin, determined as data information.

Website address: the source of the announcement web page to which the valid sentence belongs.

Numerical value information: the numerical value body and the numerical value appearing in the bulletin and the relationship between the specific body and the numerical value are determined according to the noun entity pair relationship, the semantic meaning and the syntactic characteristics in the effective sentence.

In the design of the system, the extraction of each item of numerical information is realized mainly by combining the effective sentences and the time and numerical relation extraction algorithm involved in the effective sentences.

In the design of the present system, the present invention assumes two types of valid sentences.

Case 1: there is a quantitative word labeled "m" in the sentence, such as "software business exports 15.8 billion dollars in daylight". For such sentences, the present invention considers that the sentence with the label "m" is basically valid numerical information.

Case 2: the method comprises a numerical relation trigger word and numerical information, wherein the numerical relation trigger word refers to a word capable of leading out the numerical information, such as: the 'increase' in the 'precipitation increased by manual operation 4.97 billion cubic meters' is a numerical relation trigger word, and a numerical main body and a specific numerical value are led out.

By browsing the economic development bulletin, the system constructs a corresponding numerical relationship trigger word list as follows: increasing; adding a value; increasing; the growth rate; is higher; increasing; newly adding; rising; to achieve the purpose; exceeding; descending; the rate of decline; reduction; a fall; an outlet; income; investment; reducing pressure; occupying; ratio of the components; a contribution rate; total amount; accumulating; loss;

inputting a sentence, if the sentence is identified as the case 1, directly keeping the sentence where the label'm' word is located, and saving the time involved in the sentence, if not, saving the time as the notice title time. Then all sentences are input into a numerical relation extraction system, and the numerical main body, the numerical values and the relation among the numerical main body and the numerical values are identified. If the input sentence does not belong to case 1, it is determined whether the case is case 2. Firstly, judging whether a trigger word exists or not, and if not, defaulting to a non-valid sentence. If yes, the sentence where the trigger word is located is reserved, and the time involved in the sentence is recorded. And then, identifying the numerical value body, the numerical value and the relationship among the numerical value bodies and the numerical values according to a numerical value relationship tuple extraction algorithm. In practice, the present invention extracts values and subsequent units as a whole. The specific process is as follows:

step 1: a valid sentence is input.

And 2, step: based on the results of the POS tags, using verbs and verb-plus-noun combinators as centers, selecting pairs of entities that are no more than two entity distances to the left or right to combine them to form a candidate relationship triple. For example, the sentence "insurance premium | n income | v3482.5 million | m-ary | q", where "income" is a verb connecting a subject and an object, relational triplets [ insurance premium, income, 3482.5 million-ary ] can be extracted from the sentence. The sentence "railway, road and waterway | n complete | v passenger transportation volume | n 3.0 hundred million | m times | q", here, the combination word of "complete | v" plus "passenger transportation volume | n" can be regarded as a relation word, so that a relation triple [ railway, road and waterway, complete passenger transportation volume, 3.0 hundred million times ] can be obtained.

And step 3: extracting the relation indicator in each relation triple, counting the occurrence frequency of the relation indicator, sequencing the generated relation indicators by using a program, setting a threshold value to be 7 by the system, and filtering out candidate relation tuples corresponding to the relation indicators with the sequencing more than 7.

And 4, step 4: according to the relative position between the relation indicator and the entity pair, the position of the relation indicator in the sentence is possible, namely the relation indicator is positioned between the entity pair, positioned on the right side of the entity pair and positioned on the left side of the entity pair.

And 5: and analyzing the relation between each component in the sentence and the predicate based on the result of the SRL marking, defining three relation types A0, A1 and A2 to expand the relation triple, and extracting the relation triple [ A0, pred1 and A1] if the A0 and the A1 have semantic relation with the same predicate. For example, the sentence "the national television sales amount reaches 9 million this year" reaches "is the predicate," the national television sales amount "is the subject," 9 million "is the subject, and the relational triple [ national television sales amount, reach, 9 million ] is extracted. Similarly, if A1 and A2 have semantic relations with the same predicate, extracting a relation triple [ A1, pred2, A2].

Step 6: based on the results of the POS and DP tagging, four types of syntactic features are defined to extend the relationship triplets, as described below.

SF1: the structure of a main predicate guest is as follows

The relationship triplet is denoted as [ E1, pred, E2].

For example, the sentence "the resident consumption price is increased by 2.8% more than the last year", and the tag is added to obtain "the resident consumption price | n-SBV-increase | v-VOB-2.8% | n". According to the syntactic characteristic logic expression, a relation triple [ the consumption price of residents, the rising, 2.8% ] can be extracted and obtained.

SF2: a dependency structure having a logical expression of

The relationship triplet is denoted as [ E1, attWord, E2]。

For example, the sentence "financial institution bad loan balance 1986.2 million yuan" plus a tag may result in "financial institution bad loan | n-ATT-balance | n-ATT-1986.2 million yuan | n", which may result in a relational triple [ financial institution bad loan, balance, 1986.2 million yuan ] based on syntactic characteristics.

SF3: the main and subordinate guest structure containing the interguest relationship has a logic expression of

The relationship triplet is denoted as [ E1, prep + E2 + Pred, E3].

For example, the sentence "yield increased 20% over the last year. ", the tag is added to obtain" yield | n-SBV-than lp-POB-last year | n-ADV-increase | v-VOB-20% | n ", and the relationship triple [ yield, increase last year, 20% ]canbe obtained according to syntactic characteristics.

SF4: parallel relation structure

The parallel relationship structure is divided into two cases according to the position distribution of the COO tags, and the specific content and the matching rule are described in the following two cases.

Case 1: a COO-SBV-VOB-COO structure with a logic expression of

The relational tuple is represented as [ E1, pred, E3]]&[E2,Pred,E4]。

Case 2: SBV-COO-VOB structure with logic expression of

The relation triple is represented as [ E1, pred2, E2]。

The juxtaposed verbs, which mainly describe that several different actions are produced by the same entity, are usually distributed in juxtaposed sentences containing the collocated verb. For example, the sentence "highway mileage reaches 7473.4 km, and 1026 km is added. The fact that | v ' is achieved through the parallel predicates and the fact that the entity ' expressway traffic mileage | n ' participates in two actions is expressed by the ' newly added | v '. The entity relationship related to Pred1 (predicate 1) can obtain a relationship triple [ highway mileage, reach 7473.4 kilometers ] through the syntactic feature SF1, but the entity relationship related to Pred2 (predicate 2) needs to be solved by using the second case of the syntactic feature SF 4. A new relation triple [ highway mileage, newly added and 1026 kilometers ] can be obtained by a logic expression 'highway mileage | n-SBV-reaches | v-COO-newly added | v-VOB-1026 kilometer | n'.

And 7: storing the extracted information

(IV) information output display

Each extracted numerical value relation tuple is stored in a memory in a list form, and in order to better display data, the data is presented in an excel form. The invention uses Python Excel operation module XlsxWriter to output data. The specific operation comprises the following steps:

(1) Importing the extracted relation tuple list;

(3) Self-defining a header and writing the header into the excel;

(4) Write _ row function writes data to excel's line through the works.

Taking a text with the title of "2020 national economic development announcement" as an example, a sentence in the text is taken as an input, and the output is a relational tuple. The input sentence and the output result are as follows.

A partial input sentence:

1. rural electric business develops rapidly, realizes that the retail amount of agricultural products network is 360.3 hundred million yuan, and increases by 22.3% than the last year.

2. In ten strong industries, the added values of new generation information technology manufacturing industry, new energy materials, high-end equipment and the like are respectively increased by 14.5 percent and 19.6 percent, and are sequentially higher than 9.5 percent and 14.6 percent of the industries with the scale above.

3. The business operation pressure of the enterprise is relieved, the cost of the industrial enterprise above the scale in business income per hundred yuan is 86.5 yuan, which is reduced by 0.4 yuan compared with the last year.

4. The traditional ocean industry is transformed and upgraded, and the cumulative number of 10 places of the newly added national-level ocean ranching demonstration area reaches 54 places, which accounts for 39.7 percent of the whole country.

And partial output results: as shown in table 1:

table 1 partial output results

Note: the time of outputting the result refers to the time involved in the sentence.

The embodiment extracts the numerical relation with smaller granularity from the large-scale unstructured bulletin text, and meets the information requirements of the user on deeper level and finer granularity. And laying a foundation for further information processing such as database query, data analysis and the like.

Based on the method, the invention also provides a terminal for realizing the method for extracting the numerical information in the bulletin text, which comprises the following steps: the memory is used for storing the computer program and the extraction method of the numerical information in the bulletin text; and the processor is used for executing the computer program and the method for extracting the numerical information in the bulletin text so as to realize the steps of the method for extracting the numerical information in the bulletin text.

A terminal may be implemented in various forms. For example, the terminal described in the embodiments of the present invention may include a mobile terminal such as a smart phone, a notebook computer, a Personal Digital Assistant (PDA), a tablet computer (PAD), and the like, and a fixed terminal such as a Digital TV, a desktop computer, and the like.

The terminal implementing the method for extracting numerical information in the bulletin text is a unit and algorithm steps of each example described in connection with the embodiments disclosed herein, and can be implemented by electronic hardware, computer software, or a combination of both. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Those skilled in the art will appreciate that aspects of the methods for implementing the extraction of numerical information from a bulletin text may be implemented as a system, method, or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for extracting numerical information in a bulletin text is characterized by comprising the following steps:

step two, traversing all sentences in the news bulletin webpage content, and judging whether adjacent sentences in the obtained webpage content need to be spliced or not according to a preset rule;

the extraction method comprises the following steps:

step (1): based on the result of part-of-speech tagging, selecting an entity pair with a distance not exceeding two entities to the left or right by using a verb and a verb plus noun combined word as a center, and combining to form a candidate relation triple;

step (2): extracting the relation indicator in each relation triple from the candidate relation triples extracted in the previous step, counting the occurrence times of the relation indicators, generating a sorting function according to the occurrence times of the relation indicators, and setting a threshold to filter out the candidate relation triples of which the relation indicator rank is greater than the threshold;

and (3): according to the relative position between the relation indicator and the entity pair, three possibilities exist for the position of the relation indicator in the sentence, namely the relation indicator is positioned between the entity pair, positioned on the right side of the entity pair and positioned on the left side of the entity pair, and noise in the candidate relation triple is filtered according to the announcement text expression characteristics;

and (4): analyzing the relation between each component in the sentence and the predicate based on the result of semantic role labeling, defining three relation types A0, A1 and A2 to expand the relation triple, and directly acquiring a main and predicate triple [ A0, pred1 and A1] if A0 and A1 have semantic relation with the same predicate;

and (5): defining four types of syntactic characteristics to expand the relation triple based on the results of part-of-speech tagging and syntactic parsing;

2. The method for extracting numerical information in a bulletin text as claimed in claim 1,

the first step further comprises the following steps: disguising the crawler program by adding an appropriate request header;

the extracted content comprises a webpage text, and the webpage text is formed by splicing the contents of all text labels in the webpage.

3. The method for extracting numerical information in a bulletin text as claimed in claim 1,

the second step further comprises:

the method for judging whether to splice front and back adjacent sentences by taking punctuation marks as a basis specifically comprises the following conditions:

if a group of punctuation mark pairs exist in the sentence and the directions of the punctuation mark pairs are opposite, splicing the sentence with the adjacent front and back sentences;

and after the sentences are spliced, the obtained bulletin texts are segmented into a plurality of sentence sets according to the long sentence separators.

4. The method for extracting numerical information in a bulletin text as claimed in claim 3,

identifying the participle, part of speech tagging, syntax parsing, semantic role tagging and named entity of the sentence in the set through a preset language model;

and extracting numerical values by taking sentences as units.

5. The method for extracting numerical information in a bulletin text as claimed in claim 1,

extracting valid sentences with numerical information in step three comprises:

(I) The number words exist in the sentence;

(II) the sentence contains a self-defined numerical relation trigger word;

if the input sentence is identified as the situation (I), the sentence where the quantity word is located is reserved, then all sentences are input into a numerical value relation extraction system once, and the numerical value main body, the numerical value and the relation between the numerical value main body and the numerical value are identified;

6. The method for extracting numerical information in a bulletin text as claimed in claim 5,

if the numerical information is a null value, the numerical relationship cannot be obtained from the sentence, and the sentence does not belong to the valid sentence;

7. The method for extracting numerical information in a bulletin text as claimed in claim 1,

fourthly, using a Python Excel operation module XlsxWriter to output data;

the method specifically comprises the following steps:

【1】 Importing the extracted numerical value relation tuple list;

【2】 Creating an excel table through an xlsxxwriter.Workbook function, and creating a sheet through a Workbook.add _ workhead function;

【3】 Customizing a header and writing the header into an excel table;

【4】 And writing the numeric relation tuple into a row of the excel table through a work _ row function, and displaying.

8. A terminal for realizing the method for extracting the numerical information in the bulletin text is characterized by comprising the following steps:

the memory is used for storing the computer program and the extraction method of the numerical information in the bulletin text;

a processor for executing the computer program and the method for extracting numerical information in a bulletin text to realize the steps of the method for extracting numerical information in a bulletin text as claimed in any one of claims 1 to 7.