CN113468890A

CN113468890A - Sedimentology literature mining method based on NLP information extraction and part-of-speech rules

Info

Publication number: CN113468890A
Application number: CN202110818775.XA
Authority: CN
Inventors: 胡志臣; 许小龙; 胡祥奔
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2021-07-20
Filing date: 2021-07-20
Publication date: 2021-10-01
Anticipated expiration: 2041-07-20
Also published as: CN113468890B

Abstract

A sedimentology literature mining method based on NLP information extraction and part-of-speech rules comprises the following steps of 1: downloading the related files according to the lowest flow limit and the expected value of the expected downloading time; step 2: recognizing text contents according to machine vision; and step 3: analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; and 4, step 4: identifying the keywords with the same part of speech in the text according to the types of the keywords by adopting a cosine similarity measurement analysis technology to generate an unstructured multi-classification text; and 5: respectively carrying out multi-path matching on classified entities, recording entity label attributes, and generating a large sample training data set; step 6: and (3) aiming at the large sample data set generated in the step 5and the document text to be recognized in the step 3, name recognition is carried out by utilizing a bidirectional long-short term memory neural network model and combining a conditional random field, the recognition of the required entity is realized, and the entity in the text is screened out for storage.

Description

Sedimentology literature mining method based on NLP information extraction and part-of-speech rules

Technical Field

The invention relates to the technical field of natural language processing, in particular to a sedimentology literature mining method based on NLP information extraction and part of speech rules.

Background

Natural language processing is a cross discipline integrating relevant fields such as linguistics, computer science, mathematics and the like, and natural language processing technology gradually permeates various industries and is used for text data mining and information storage. Currently, a large number of enterprises and organizations screen out valuable core hotspots, either entirely or in part, of the increasing data information through natural language processing techniques to reduce retrieval time and improve information analysis capabilities. From the named entity recognition perspective, data extensibility needs to be guaranteed while satisfying the analysis and understanding of unstructured text. The analysis amount of text data in the field of sedimentology is increasing continuously, and particularly in the big data era, data mining processing needs to learn and analyze massive label data sets. To cope with the increasing analytical demand, the field of sedimentology requires domain experts to build large rule templates and dictionaries. The current sedimentology field generally depends on manual annotation of text information, and manual annotation consumes a large amount of time, influences data timeliness and restricts dynamic development of information-based industry.

In order to meet the urgent need of saving time and expense in the field of sedimentology, how to realize a heterogeneous data source-oriented text information mining method in text data supported by natural language processing is a hot point concerned by the industry and academia. Through named body recognition, screening of text key information can be achieved. The development of the part-of-speech analysis technology promotes the efficient processing capability of the text data. Through part-of-speech analysis, the user-defined of key information can be realized, so that the quantity of interference information in data is reduced, and the label noise generated by multi-path matching in the data set generation process is reduced. However, in the part-of-speech analysis process, in addition to the error caused by reading the characters from the standard text, the noise conflict caused by the part-of-speech rule needs to be comprehensively considered. Therefore, a proper part-of-speech analysis technology needs to be designed to realize the recognition of the named object of the text data.

Text information extraction techniques have been widely used for text data mining and storage. The information screening of different keyword hotspots can be realized through the part of speech analysis technology. For example, the documents "N.Piazza, Classification Between Machine transformed Text and Original Text By Part Of speed labeling reproduction, 2020 IEEE 7th International Conference on Data Science and Advanced Analysis (DSAA), Sydney, NSW, Australia,2020, pp.739-740" use word tags primarily to create a frequency probability distribution model with BIO letters to reduce the use Of Data dictionaries. The documents "F.Hussain, U.Qamar and S.Zeb, A Novel Approach for Searching Linear Synthesis of partial Parts of Speech Tagging,2016 IEEE/WIC/ACM International Conference on Web Intelligent (WI), Omaha, NE, USA,2016, pp.465-468, doi: 10.1109/WI.2016.0076" propose a part-of-Speech Tagging method for open text data, which is short text Tagging data, and realizes information extraction of similar text sentences by Synonyms. Current part-of-speech analysis ignores the identification of value information during professional domain data mining. With the wide application of the part-of-speech analysis technology, the data of the domain text generating hot spots is increasing, and the technical challenge will be brought to the construction of the domain data set label. Therefore, a text mining method based on key words and extensible parts of speech needs to be designed to realize dynamic extraction of text data.

Disclosure of Invention

The invention provides a sedimentology literature mining method based on NLP information extraction and part-of-speech rules, aiming at the characteristic that the time consumption is increasingly prominent in the field of sedimentology for labeling data, and the method is suitable for information acquisition work of heterogeneous data.

In order to achieve the purpose, the invention adopts the following technical scheme:

a sedimentology literature mining method based on NLP information extraction and part-of-speech rules,

step 1: downloading files containing relevant chemistry contents in an RPA distributed reading website according to the lowest flow limit and an expected value of expected downloading time;

step 2: identifying the file downloaded in the step 1 according to machine vision so as to obtain the geometric attribute and the text attribute of the content object, judging the type of the content object by a heuristic algorithm to obtain the physical structure and the logical structure of the document, and identifying the text content;

and step 3: analyzing context language segments of the text content to obtain a dictionary list of user-defined multi-class entity keywords in the text content;

and 4, step 4: recognizing entity keywords with the same part of speech in the text content by using the text content obtained in the step 2 and the dictionary list obtained in the step 3 and adopting a cosine similarity measurement analysis technology according to the entity keyword types to generate an unstructured multi-classification text;

and 5: respectively performing multi-path matching on classified entities by using the unstructured multi-classification texts output in the step 4, recording entity label attributes, and generating a large sample training data set;

step 6: and (3) aiming at the large sample data set generated in the step (5), performing NER pre-training by using BilSTM in combination with CRF to realize entity identification on the readable text content in the document in the step (2), and screening entity keywords according to the context.

In order to optimize the technical scheme, the specific measures adopted further comprise:

further, when a task requests a network service system to download a file, the network platform generates an access request record, the record comprises a local network IP address and expected downloading time, the communication system is accessed by utilizing the non-invasive characteristic of the RPA to provide cross access to related sedimentology research content hotspots, and multiple paths of IP addresses are copied to a server in a cross mode;

when the server receives a download request at any time interval, calculating the response time T of a single download task as follows:

T＝t_departure-t_arrival；

in the formula, t_departureRepresenting request arrival time, t_arrivalRepresenting the request completion time, wherein an exponential random variable of a single download task response time T is e.r.v, e represents an expected value of expected download time under the single download task response time T, r represents the proportion of download flow of a server to the total bandwidth, and v represents the speed of a download hotspot;

when the download request application is successful and the download hotspot is cross-accessed, the download request is carried out within any period of time x being more than or equal to TimeN being less than or equal to y, and the response time of the download request is beta_nTherefore, the minimum response time β of the download request_(x,y)The expression of (a) is:

in the formula, x is the lowest download time, y is the highest download time, and TimeN is any download time in the time period; expected value of expected download time E [ T ] under single download task response time T_(r,t)]The expression of (a) is:

where β (T +1,1/r) represents the minimum response time from time T +1 to time 1/r, μ represents the response rate between different download request commands and the server, E [ T [ T ] ]_(r,t)]And e both represent expected values of expected download times at a single download task response time T;

selecting a value satisfying the expected value E [ T ]_(r,t)]The service IP address of (a) is downloaded in multiple ways.

Further, the specific content for identifying the text content is as follows:

identifying documents using machine vision, in which (x)_nn,y_nn) Representing the coordinates of the upper left corner of the character, (x)_nm,y_nm) Representing the coordinates of the lower right corner of the character; in the physical structure (x)_mn,y_mn) Representing the coordinates of the upper left corner of the character, (x)_mm,y_mm) Representing the coordinates of the lower right corner of the character; calculating the space area X between characters_overlapAnd a physically set threshold area Y_overlapThe expression of (a) is as follows:

X_overlap＝max(0,min(x_nm,x_mm)-max(x_nn,x_mn))；

Y_overlap＝max(0,min(y_nm,y_mm)-max(y_nn,y_mn))；

in the formula, min (x)_nm,x_mm) Represents the minimum value of the coordinate of the upper left corner of the frame of the selected character, max (x)_nn,x_mn) Represents the maximum value of the coordinate of the lower right corner of the frame of the selected character, min (x)_nm,x_mm)-max(x_nn,x_mn) Representing the diagonal values of the character frame; min (y)_nm,y_mm) Represents the minimum value of the coordinate of the upper left corner of the frame of the physical structure of the selected character, max (y)_nn,y_mn) The maximum value of the coordinate of the lower right corner of the frame representing the physical structure of the selected character, min (y)_nm,y_mm)-max(y_nn,y_mn) Representing the diagonal values of the character frame;

according to X_overlapAnd Y_overlapThereby obtaining the maximum character structured area S_overlapThe expression is as follows:

S_overlap＝X_overlap×Y_overlap；

due to the character structured area S_overlapMuch smaller than the physical structure area, and comparing the Ratio of the overlapped surfaces_overlapClustering characters, clustering words into sentences, and forming sentences into paragraphs, thereby reading text contents, wherein the Ratio of overlapped surfaces is_overlapThe expression is as follows:

in the formula, x_nnX-axis coordinate, x, representing the top left corner of a character in a document_nmX-axis coordinate, y, representing the lower right corner of a character in a document_nnY-axis coordinate, y, representing the top left corner of a character in a document_nmRepresenting the y-axis coordinates of the lower right hand corner of the characters in the document.

Further, context language segments of text contents are analyzed, dictionary lists of user-defined multi-class entity keywords in text sentences are obtained and integrated into a dictionary list data set ER ═ ER₁,er₂,...,er_NTherein, er₁A dictionary list of entity keywords, er, representing a first category₂A dictionary list of entity keywords, er, representing a second category_NEntities representing the Nth categoryA body keyword dictionary list;

the category record of the entity keyword dictionary list is a multi-attribute tuple, and the tuple of the entity keyword dictionary list of the nth category in the ER is represented as ER_n＝(tim_n,geo_n,nat_n,org_n,per_n) Wherein N is more than or equal to 1and less than or equal to N, wherein tim_nRepresenting entity Key Presence time, geo_nRepresenting geographic locations of discovered entity keywords, nat_nRepresenting device_nOf the corresponding entity keyword name, org_nRepresenting an organization that discovers entity keywords, per_nRepresenting the discoverer of the entity's keyword.

Further, calculating the relation probability between text content and terms by using a large text corpus, and setting the words with the same meaning as the same part of speech;

using cosine similarity measurement to perform text mining on a given entity keyword dictionary list in a large text corpus, and determining semantic proximity and word vectors of entity keywords; er (a)_nDenoted vc in the word vector_n＝(B-vc_n,I-vc_n) Wherein B-vc_nRepresenting the beginning position of a multi-attribute tuple in the word vector space, I-vc_nRepresenting the middle position of the multi-attribute tuple in the word vector space and utilizing a regular expression to expand the user_nAttribute tuple and English character [ A-Z ]]And numeric characters 0-9]The expression of (1);

the cosine similarity cos (θ) value calculation expression is as follows:

in the formula, vc_iDenotes the ith word vector variable, wc, in a total of m word vectors_iAn ith text sentence phrase word vector variable represented in a total of m text sentence phrase word vectors; when the cosine similarity cos (theta) value is 1, the expression can be in vc_iFind wc in the corresponding entity key word dictionary list_iThe corresponding words needed by the text corpus so as to realize entity relationMining a key word dictionary list;

after the entity keyword dictionary list is mined by the text corpus, relevant contents are extracted from the text contents, so that the unstructured multi-classification text is generated.

Further, the text sentence of the unstructured multi-classification text is related to the er_nPerforming classification matching, if the text sentence can not be matched with the user_nThen, it is represented as O;

will er_nAll the multiple classification structure subsets in the text sentence entity set matched with the corresponding classification structure are expressed as ER_n＝{B-er_n,O,I-er_nAnd thus generating a training data set with labeled BIO.

Further, let crf be the output layer of BilSTM, ER for each input tag_nThe output label PL corresponding to it is obtained_nIs predicted as input ER_nHas a probability of continuous correctness of Score (ER)_n,PL_n)，Score(ER_n,PL_n) The expression of (a) is as follows:

wherein R represents a total of R labels in the training dataset,

for the ith input label ER_iThe output is PL_iProbability of (A)_{(PLi，PLi+1)}To be from PL_iTo PL_i+1The transition probability of (2);

determining all input labels ER_nContinuous correct probability of (ER)_n,PL_n) Using Viterbi algorithm to input label ER_nAnd output label PL_nPerforming probability normalization process P_(PLn|ERn)To complete and train and mine text data, wherein the probability normalization process P_(PLn|ERn)The expression is as follows:

in the formula,

indicating the prediction of the ith input label ER_iThe index value of the continuous probability of correctness,

denoted as ER for the ith input tag_iAnd the output label rate of the error is obtained,

indicating a mispredicted input label ER_iIndex value of the continuous probability of (a).

The invention has the beneficial effects that:

1: in the process of cross configuration of multiple IP addresses in the server, the text data is downloaded, and the method is more suitable for the practical minimum flow limit and the expected downloading time.

2: in the text content identification process, a heuristic method is adopted to select the target text, so that the character identification accuracy is improved, and the identification target can be found more quickly and conveniently.

3: in the keyword part-of-speech analysis, part-of-speech rules are preferentially combined with a keyword dictionary list, so that the overall mining efficiency of the sedimentary keywords is improved, and the time cost caused by manual labeling is reduced.

4: in the process of data mining for a sedimentology data set, a bidirectional long-short-term neural network is adopted and a conditional random field is combined, so that the accuracy of the sedimentology literature mining strategy design is improved, and the identification noise caused by the wrong label of the data set is reduced.

Drawings

FIG. 1 is a flow chart of the overall process steps of the present invention.

FIG. 2 is the accuracy of the training set test in the BilSTM model in conjunction with conditional random field CRF according to the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings.

The invention discloses a sedimentology literature mining method based on NLP (Natural language processing, NLP) information extraction and part-of-speech rules, which comprises the following steps: step 1: according to the lowest flow limit and the expected downloading time value, a Robot Process Automation (RPA) is used for reading research and conference files containing relevant chemistry contents in a distributed reading website; step 2: identifying the file in the step 1 according to machine vision so as to obtain the geometric attribute and the text attribute of the content object, and judging the type of the content object by a heuristic algorithm to obtain the physical structure and the logical structure of the document; and step 3: analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; and 4, step 4: identifying the keywords with the same part of speech in the text by using the document file obtained in the step 2 and the keyword dictionary obtained in the step 3 and adopting a cosine similarity measurement analysis technology according to the types of the keywords to generate an unstructured multi-classification text; and 5: respectively performing multi-path matching on classified entities by using the unstructured multi-classification texts output in the step 4, recording entity label attributes, and generating a large sample training data set; step 6: aiming at the generation of the large sample data set in the step 5and the identification of the document text in the step 3, a two-way Long and Short Term Memory neural network model (BilSTM) is utilized to combine with a Conditional Random Field (CRF) to carry out Named object identification (NER), so that the identification of the required Entity is realized, and the Entity in the text is screened out and stored.

The sedimentology literature mining method based on NLP information extraction and part-of-speech rules provided by the invention comprises the following steps, and the flow is shown in a figure 1-2:

step 1: and reading research and conference files containing relevant chemistry contents in the website by utilizing RPA distribution from the lowest flow limit MF and the expected value of the expected download time ET.

When a task requests a network service system to download files, a network platform generates an access request record, the record comprises a local network address and expected download time, on the basis of existing network service system availability codes and analysis of distributed storage and expected download time, the communication system can be accessed in a non-intrusive mode by using RPA, cross access is carried out on related depositional research content hotspots, multiple IP addresses are copied into a server in a cross mode, the multiple IP set is named as LR (LR) 1, LR2, … and lrN), wherein N represents the number of cross IPs in LR.

When the server receives a download request at any time interval, the minimum flow limit is calculated by combining the size of the request resource and the current bandwidth congestion degree, as shown in formula (1), where T is the response time of a single download task, and T is_departureRepresenting request arrival time, t_arrivalRepresenting the request completion time. The lowest flow limit influences the arrival time and the completion time of the request, the response time of a single downloading task is longest under the lowest flow limit, the lowest flow limit is defaulted to be broadband flow meeting the lowest downloading requirement of the website file, and the lowest flow limit can contain a downloading task request at any time point;

T＝t_departure-t_arrival (1)。

and when the request application is successful, performing cross access on the download hotspot. T is assumed to be independent, response rate is assumed to be mu between different requests and servers, exponential random variable is e.r.v, namely T ═ r/v, wherein e represents download time expected value, r represents download server download flow rate to total bandwidth ratio, v represents download speed of hot spot file, as shown in formula (2), for download request x ≦ TimeN ≦ y in any period of time, download time in response is β_nWherein x is the lowest download time, y is the highest download time, the lowest download time x is the minimum time for the required download request during the highest broadband service, and it meets the server safety and will not stop downloading service because of too short request time, the highest download time y is the longest response time of the server to the download request, and the server safety will not pause downloading service because of too long download time and service overtime, TimON is any download time during this time period, therefore, the minimum response time beta in the request_(x,y)；

Wherein

The lowest download time x and the highest download time y are integrated with respect to the download speed in the range of 0 to 1, wherein the minimum response time is found assuming that the fastest speed is 1, i.e., 100%, and the slowest speed is 0.

The download expected value is in linear inverse proportion to the download rate, when a high-frequency flow signal is deployed, the download rate shows that the service is fully loaded, the download expected value shows that a curve is decreased progressively, and beta is_(x,y)And deducing an availability cross IP address for a low-flow state, wherein S represents the downloading time of an arbitrary random variable e.r.v, so that S (r, v) -Exp (mu) are uniformly distributed in the system, solving a distribution value P as represented by formula (3), wherein T is larger than the downloading time S of the arbitrary random variable, and formula (4) represents an expression formula of an expected value E of the expected downloading time at the moment T, and selecting a service IP address meeting the expected value for multi-path downloading, so that the time required by downloading is reduced, and more texts are downloaded in a unit time to the maximum extent. Wherein,

P{T_(r,t)＞S}＝exp(-μs)(1-(1-exp(-μs))^r)^t (3)；

in the formula, P { T_(r,t)S represents that the response time T of a single downloading task related to the proportion r of the downloading flow of the downloading server in the total bandwidth and the downloading time T in the distribution value P is larger than the downloading time S of any random variable, and exp (-us) represents an inverse function of an exponential calculated by the product of the downloading time S of the random variable and the assumed rate mu; beta (T +1,1/r) represents the minimum response time from T +1 to 1/r, mu represents the response rate between different download request instructions and the server, E [ T [ [ T ]_(r,t)]And e are both represented inThe expected value of the desired download time at the response time T of a single download task.

Wherein, the meaning of the expected downloading time is as follows: for example, if there is a bus with 20 minutes of departure interval, the waiting time is 10 minutes because the departure time of the bus satisfies the uniform distribution of [0,20 ]. Similarly, the expected time is reduced along with the increase of the IP addresses, the expected value is to estimate the distributed IP value, the expected time can be less than the expected value and can also be slightly greater than the expected value, but is not infinite or infinitesimal, the shortest download time and the longest delay stop time of the opposite website server are both satisfied, otherwise, the download cannot be performed.

Step 2: and identifying and downloading the standard file according to machine vision so as to obtain the geometric attributes and text attributes of the table picture in the text, judging the type of the content object by using a heuristic algorithm to obtain the physical structure and logical structure of the document, and identifying the text of the standard file.

First, machine vision is utilized (machine vision does not require parameter setting of the distance between physical structures), as shown in equation (5), where (x) is in the document_nn,y_nn) Representing the coordinates of the upper left corner of the character, (x)_nm,y_nm) Lower right corner coordinate, as shown in equation (6), where (x) is in the physical structure_mn,y_mn)、(x_mm,y_mm) Representing the coordinates of the upper left corner and the lower right corner, respectively. Calculating the space area X between characters_overlapAnd a physically set threshold area Y_overlap；

X_overlap＝max(0,min(x_nm,x_mm)-max(x_nn,x_mn)) (5)；

Y_overlap＝max(0,min(y_nm,y_mm)-max(y_nn,y_mn)) (6)；

Consider a character as a rectangular box, where min (x)_nm,x_mm) Represents the minimum value of the coordinate of the upper left corner of the frame of the selected character, max (x)_nn,x_mn) Represents the maximum value of the coordinate of the lower right corner of the frame of the selected character, min (x)_nm,x_mm)-max(x_nn,x_mn) Representing the diagonal values of the character frame, and finally calculating the longest diagonal to calculate the space area of the character, and similarly, wherein min (y)_nm,y_mm) Represents the minimum value of the coordinate of the upper left corner of the frame of the physical structure of the selected character, max (y)_nn,y_mn) The maximum value of the coordinate of the lower right corner of the frame representing the physical structure of the selected character, min (y)_nm,y_mm)-max(y_nn,y_mn) Representing the diagonal value of the character frame, and finally solving the longest diagonal so as to solve the space area of the character physical structure frame.

Then, according to X_overlapAnd Y_overlapCalculating the maximum character structured area as S_overlapAs shown in equation (7):

S_overlap＝X_overlap×Y_overlap (7)；

finally, area S is structured due to the character_overlapMuch smaller than the physical structure area, and comparing the Ratio of the overlapped surfaces_overlapClustering characters to read text content, as shown in formula (8):

and step 3: and acquiring a self-defined multi-class entity keyword dictionary list in the sentence text according to the document context language segment.

Let the multiclass entity keyword dictionary list dataset be denoted as ER, which is a set of records for entity classes, denoted ER ═ ER₁,er₂,...,er_NN represents the number of entity keyword dictionary lists in the ER;

the record of the entity class is a multi-attribute tuple, and the nth (1 ≦ N ≦ N) tuple in the ER is represented as ER_n＝(tim_n,geo_n,nat_n,org_n,per_n) Wherein tim_nRepresenting the time of existence of an entity, geo_nRepresenting the geographical location, nat, of the discovered entity_nRepresenting device_nCorresponding entity name inScale, org_nRepresenting an organization of discovered entities, per_nRepresenting the entity finder.

And 4, step 4: and identifying the keywords with the same part of speech in the text according to the types of the keywords by adopting a cosine similarity measurement analysis technology to generate the unstructured multi-classification text.

Using a large corpus of text to compute the probability of relationships between documents and terms, words of the same meaning will produce similar text, i.e., parts of the same kind. Subsequently, using cosine similarity measurements, text mining is performed on the given dictionary document and the database to determine semantic proximity and word vectors, er_nIn the expression of the word vector as vc_n＝(B-vc_n,I-vc_n) Wherein B-vc_nRepresenting the beginning of a multi-attribute tuple in the word vector space, I-vc_nRepresenting the middle position of the multi-attribute tuple in the word vector space and utilizing a regular expression to expand the user_nAttribute tuple and English character [ A-Z ]]And numeric characters 0-9]Is the manifestation of (1)? Represents the expression before matching once, represents the expression before matching any times, represents matching from the current position, represents the end of the expression before matching, for example, the time part of speech VB has three expressions, as shown in the formula (9-11):

VB₁＝r'^～？[0-9]'+r'^[A-Z].*$' (9)；

VB₂＝r'^±'+r'^～？[0-9]+(.[0-9]+)？$'+r'.^*' (10)；

VB₃＝r'^～？[0-9]+(.[0-9]+)？$'+r'and$'+r'.^*' (11)；

equation 9 the process is as follows:

the first step is as follows: do? And matching sentences with symbols or sentences without symbols in the sentences.

The second step is that: in the case where the first step is satisfied, [0-9] indicates that any one of numbers between 0 and 9 is matched.

Third, in case the second step is satisfied, a-Z represents any letter between the matching alphabetical letters a to Z.

A fourth step, where the third step is satisfied, $' indicates that the third step letter is matched multiple times and may be located at the end of the sentence, e.g., -9 Ma or 9 Ma.

Equation 10 processes as follows:

the first step is as follows: ^ + -represents that there are + -sentences in the matching sentence.

The second step is that: in the case where the first step is satisfied? And matching sentences with symbols or sentences without symbols in the sentences.

The third step: in the case where the second step is satisfied, [0-9] indicates that any one of numbers between 0 and 9 is matched.

Step four, ([ 0-9] +) indicates that the decimal point is matched first and any one number between the numbers 0 and 9 is matched second, in the case where step three is satisfied.

And a fifth step of, in case that the fourth step is satisfied,

meaning matching the previous step any number of times, e.g. + -. 9.38 or. + -. 9.1.

Equation 11 processes as follows:

Third, in the case where the second step is satisfied, ([ 0-9] +) indicates that the decimal point is matched first, and any one number between the numbers 0 and 9 is matched second.

The fourth step: in the case where the third step is satisfied, and $ represents a sentence that matches a sentence with an and.

And a fifth step of, in case that the fourth step is satisfied,

indicating that the previous step was matched any number of times, e.g., -1 and2 or 1.5and 1.68.

Cosine similarity highest value cos (theta) and vc_nConstituent part vc of_i(i is more than or equal to 1and less than or equal to m) and phrase and word vector attribute wc of each sentence of the text_nComponent wc of_iCorrelation (1. ltoreq. i. ltoreq. m), vc_iDenotes the i-th word vector variable, wc, among the m word vectors_iRepresenting the ith text sentence phrase word vector variable in the m text sentence phrase word vectors, as shown in formula (12), when the cos (theta) value is 1, representing that the variables point to the same word vector space, and pointing to the same space, successfully matching the given dictionary document with the database, and finding the required words in the database in the document;

and 5: and outputting and generating an unstructured multi-classification text through cos (theta), traversing the multi-classification text to perform multi-path matching on classification entities respectively, recording entity label attributes, and generating a large sample training data set.

Go the text sentence on er_nClassification matching, if the text sentence can not be matched, er_nThen, it is represented as O; er (a)_nThe sentence entity sets corresponding to all the multi-classification structure subsets (here, the multi-classification structure subset and the former er)_nSet of entity keyword dictionary lists representing multiple categories have the same meaning), denoted as ER_n＝{B-er_n,O,I-er_n} to generate training set data with labeled BIO; wherein B-er_nExpressed as the first letter of the beginning of the sentence entity, I-er_nThe expression is the expression mode of each residual letter except the first letter, and the generated data set is expressed as B-tim, B-geo, B-nat, O, I-tim, I-geo, I-nat and B-er_nComprises B-tim, B-geo, B-nat, I-er_nIncluding I-tim, I-geo, I-nat.

Step 6: and carrying out NER pre-training by using BilSTM in combination with CRF to identify the sedimentology entity, thereby screening out valuable hot spots according to context information.

First, using the transfer matrix in crf to avoid multiple consecutive B-ers_nUsing crf as the output layer of BilSTM, as shown in equation (13), where ER is input for each input_nEventually obtain the corresponding prediction label PL_nThen predict input ER_nHas a continuous probability of Score (ER)_n,PL_n) (e.g., the input label is O, the corresponding output label is obtained, and the probability that consecutive inputs are O is predicted),

output as PL for the ith position_i(0≤PL_iProbability of < 1), A_{(PLi，PLi+1)}To be from PL_iTo PL_i+1The transition probability of (2);

wherein successive inputs of O, or I-ern, are correct, and B-ern is incorrect if three successive occurrences occur. For transition probability, which is the probability of directly jumping from O to I-ern, in the above BIO tag, an error may occur, and I-ern is normalized to O, which is a similar error.

Then, for each ER_nTo find all PL_nScore (ER) of (C)_n,PL_n) Performing probability normalization processing P on input label, namely output label by utilizing Viterbi algorithm_(PLn|ERn)And thus, text data is mined as shown in equation (14):

in the formula,

indicating the prediction of the ith input label ER_iThe continuous correct probability of;

indicating the prediction of the ith ER_iThe error probability of (2);

denoted as ER for the ith input tag_iTo obtain the wrong output label rate (e.g. original ER)_nThe corresponding output label PLi is obtained, the probability of being correct is 0.7,

it indicates that he is incorrect with a probability of 0.3).

The idea of the invention is as follows: firstly, according to the minimum flow limit and the expected downloading time expected value, research and conference files containing relevant depositional contents in a website are read in a distributed manner by utilizing the robot process automation; then according to machine vision, acquiring the geometric attributes and text attributes of the researched content object, and performing type judgment on the content object by using a heuristic algorithm to obtain the physical structure and the logical structure of the document; further, analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; on the basis, a cosine similarity measurement analysis technology is adopted, keywords with the same part of speech in the text are identified according to the types of the keywords, and an unstructured multi-classification text is generated, so that the unstructured multi-classification text is output; then, aiming at the unstructured text, respectively carrying out multi-path matching on classified entities, recording entity label attributes, and generating a large sample training data set; and finally, in a large sample data set, recognizing a name body by using a bidirectional long-short term memory neural network model in combination with a conditional random field to realize recognition of a required entity, and screening out entities in the text for storage.

Example, this example selects depositional text matching data as the input dataset to experiment and selects tensiorflow as the simulation platform.

The parameters involved in the experimental environment are shown in table 1.

Table 1 parameter settings involved in the execution of the method

Experimental parameters	Value taking
		Substance switch	B-nat
Intermediate of matter	I-nat
		Time switch	B-tim
Middle of time	I-tim
		Beginning of a venue	B-geo
In the middle of a site	I-geo
		Others	O
Number of data set records	274292

FIG. 2 shows the accuracy of the training set test in the BilSTM model in conjunction with conditional random field CRF according to the present invention.

It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. The sedimentology literature mining method based on NLP information extraction and part of speech rules is characterized by comprising the following steps:

2. The sedimentology document mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein:

when a task requests a network service system to download files, a network platform generates an access request record, the record comprises a local network IP address and expected downloading time, the communication system is accessed by utilizing the non-invasive characteristic of RPA to provide cross access to related sedimentology research content hotspots, and multiple paths of IP addresses are cross copied to a server;

T＝t_departure-t_arrival；

in the formula, x is the lowest download time, y is the highest download time, and TimeN is any download time in the time period;

expected value of expected download time E [ T ] under single download task response time T_(r,t)]The expression of (a) is:

where β (t +1,1/r) represents the minimum response time from time t +1 to time 1/r, μ tableShowing the response rate between different download request instructions and the server, ET_(r,t)]And e both represent expected values of expected download times at a single download task response time T;

3. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein the specific content for recognizing text content is:

X_overlap＝max(0,min(x_nm,x_mm)-max(x_nn,x_mn))；

Y_overlap＝max(0,min(y_nm,y_mm)-max(y_nn,y_mn))；

S_overlap＝X_overlap×Y_overlap；

4. The sedimentology document mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein:

analyzing the context language segment of the text content, obtaining a dictionary list of user-defined multi-class entity keywords in the text sentence, and integrating into a dictionary list data set ER ═ ER₁,er₂,...,er_NTherein, er₁A dictionary list of entity keywords, er, representing a first category₂A dictionary list of entity keywords, er, representing a second category_NA list of entity keyword dictionaries representing an nth category;

the category record of the entity keyword dictionary list is a multi-attribute tuple, and the tuple of the entity keyword dictionary list of the nth category in the ER is represented as ER_n＝(tim_n,geo_n,nat_n,org_n,per_n) Wherein N is more than or equal to 1and less than or equal to N, wherein tim_nRepresenting entity Key Presence time, geo_nRepresenting geographic locations of discovered entity keywords, nat_nRepresenting device_nCorresponding entity keyword name in，org_nRepresenting an organization that discovers entity keywords, per_nRepresenting the discoverer of the entity's keyword.

5. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 4, wherein:

calculating the relation probability between text content and terms by using a large text corpus, and setting words with the same meaning as the same part of speech;

the cosine similarity cos (θ) value calculation expression is as follows:

in the formula, vc_iDenotes the ith word vector variable, wc, in a total of m word vectors_iAn ith text sentence phrase word vector variable represented in a total of m text sentence phrase word vectors; when the cosine similarity cos (theta) value is 1, the expression can be in vc_iFind wc in the corresponding entity key word dictionary list_iThe corresponding words needed by the text corpus are used for realizing the mining of the entity keyword dictionary list;

6. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 5,

text sentence and er of unstructured multi-classification text_nPerforming classification matching, if the text sentence can not be matched with the user_nThen, it is represented as O;

7. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 6,

let crf be the output layer of BilSTM, ER for each input tag_nThe output label PL corresponding to it is obtained_nIs predicted as input ER_nHas a probability of continuous correctness of Score (ER)_n,PL_n)，Score(ER_n,PL_n) The expression of (a) is as follows:

wherein R represents a total of R labels in the training dataset,

in the formula,

denoted as ER for the ith input tag_iAnd obtaining the probability of the wrong output label,