CN113468890A - Sedimentology literature mining method based on NLP information extraction and part-of-speech rules - Google Patents
Sedimentology literature mining method based on NLP information extraction and part-of-speech rules Download PDFInfo
- Publication number
- CN113468890A CN113468890A CN202110818775.XA CN202110818775A CN113468890A CN 113468890 A CN113468890 A CN 113468890A CN 202110818775 A CN202110818775 A CN 202110818775A CN 113468890 A CN113468890 A CN 113468890A
- Authority
- CN
- China
- Prior art keywords
- text
- representing
- entity
- download
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000005065 mining Methods 0.000 title claims abstract description 23
- 238000000605 extraction Methods 0.000 title claims abstract description 18
- 238000004458 analytical method Methods 0.000 claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 16
- 238000005516 engineering process Methods 0.000 claims abstract description 11
- 238000005259 measurement Methods 0.000 claims abstract description 9
- 230000014509 gene expression Effects 0.000 claims description 31
- 230000004044 response Effects 0.000 claims description 29
- 239000013598 vector Substances 0.000 claims description 26
- 230000008569 process Effects 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000011160 research Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 230000007704 transition Effects 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 2
- 230000002457 bidirectional effect Effects 0.000 abstract description 3
- 238000003062 neural network model Methods 0.000 abstract description 3
- 230000015654 memory Effects 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 description 13
- 238000007418 data mining Methods 0.000 description 5
- 238000009826 distribution Methods 0.000 description 4
- 238000002372 labelling Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000004801 process automation Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000946381 Timon Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000012466 permeate Substances 0.000 description 1
- 239000000047 product Substances 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A sedimentology literature mining method based on NLP information extraction and part-of-speech rules comprises the following steps of 1: downloading the related files according to the lowest flow limit and the expected value of the expected downloading time; step 2: recognizing text contents according to machine vision; and step 3: analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; and 4, step 4: identifying the keywords with the same part of speech in the text according to the types of the keywords by adopting a cosine similarity measurement analysis technology to generate an unstructured multi-classification text; and 5: respectively carrying out multi-path matching on classified entities, recording entity label attributes, and generating a large sample training data set; step 6: and (3) aiming at the large sample data set generated in the step 5and the document text to be recognized in the step 3, name recognition is carried out by utilizing a bidirectional long-short term memory neural network model and combining a conditional random field, the recognition of the required entity is realized, and the entity in the text is screened out for storage.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a sedimentology literature mining method based on NLP information extraction and part of speech rules.
Background
Natural language processing is a cross discipline integrating relevant fields such as linguistics, computer science, mathematics and the like, and natural language processing technology gradually permeates various industries and is used for text data mining and information storage. Currently, a large number of enterprises and organizations screen out valuable core hotspots, either entirely or in part, of the increasing data information through natural language processing techniques to reduce retrieval time and improve information analysis capabilities. From the named entity recognition perspective, data extensibility needs to be guaranteed while satisfying the analysis and understanding of unstructured text. The analysis amount of text data in the field of sedimentology is increasing continuously, and particularly in the big data era, data mining processing needs to learn and analyze massive label data sets. To cope with the increasing analytical demand, the field of sedimentology requires domain experts to build large rule templates and dictionaries. The current sedimentology field generally depends on manual annotation of text information, and manual annotation consumes a large amount of time, influences data timeliness and restricts dynamic development of information-based industry.
In order to meet the urgent need of saving time and expense in the field of sedimentology, how to realize a heterogeneous data source-oriented text information mining method in text data supported by natural language processing is a hot point concerned by the industry and academia. Through named body recognition, screening of text key information can be achieved. The development of the part-of-speech analysis technology promotes the efficient processing capability of the text data. Through part-of-speech analysis, the user-defined of key information can be realized, so that the quantity of interference information in data is reduced, and the label noise generated by multi-path matching in the data set generation process is reduced. However, in the part-of-speech analysis process, in addition to the error caused by reading the characters from the standard text, the noise conflict caused by the part-of-speech rule needs to be comprehensively considered. Therefore, a proper part-of-speech analysis technology needs to be designed to realize the recognition of the named object of the text data.
Text information extraction techniques have been widely used for text data mining and storage. The information screening of different keyword hotspots can be realized through the part of speech analysis technology. For example, the documents "N.Piazza, Classification Between Machine transformed Text and Original Text By Part Of speed labeling reproduction, 2020 IEEE 7th International Conference on Data Science and Advanced Analysis (DSAA), Sydney, NSW, Australia,2020, pp.739-740" use word tags primarily to create a frequency probability distribution model with BIO letters to reduce the use Of Data dictionaries. The documents "F.Hussain, U.Qamar and S.Zeb, A Novel Approach for Searching Linear Synthesis of partial Parts of Speech Tagging,2016 IEEE/WIC/ACM International Conference on Web Intelligent (WI), Omaha, NE, USA,2016, pp.465-468, doi: 10.1109/WI.2016.0076" propose a part-of-Speech Tagging method for open text data, which is short text Tagging data, and realizes information extraction of similar text sentences by Synonyms. Current part-of-speech analysis ignores the identification of value information during professional domain data mining. With the wide application of the part-of-speech analysis technology, the data of the domain text generating hot spots is increasing, and the technical challenge will be brought to the construction of the domain data set label. Therefore, a text mining method based on key words and extensible parts of speech needs to be designed to realize dynamic extraction of text data.
Disclosure of Invention
The invention provides a sedimentology literature mining method based on NLP information extraction and part-of-speech rules, aiming at the characteristic that the time consumption is increasingly prominent in the field of sedimentology for labeling data, and the method is suitable for information acquisition work of heterogeneous data.
In order to achieve the purpose, the invention adopts the following technical scheme:
a sedimentology literature mining method based on NLP information extraction and part-of-speech rules,
step 1: downloading files containing relevant chemistry contents in an RPA distributed reading website according to the lowest flow limit and an expected value of expected downloading time;
step 2: identifying the file downloaded in the step 1 according to machine vision so as to obtain the geometric attribute and the text attribute of the content object, judging the type of the content object by a heuristic algorithm to obtain the physical structure and the logical structure of the document, and identifying the text content;
and step 3: analyzing context language segments of the text content to obtain a dictionary list of user-defined multi-class entity keywords in the text content;
and 4, step 4: recognizing entity keywords with the same part of speech in the text content by using the text content obtained in the step 2 and the dictionary list obtained in the step 3 and adopting a cosine similarity measurement analysis technology according to the entity keyword types to generate an unstructured multi-classification text;
and 5: respectively performing multi-path matching on classified entities by using the unstructured multi-classification texts output in the step 4, recording entity label attributes, and generating a large sample training data set;
step 6: and (3) aiming at the large sample data set generated in the step (5), performing NER pre-training by using BilSTM in combination with CRF to realize entity identification on the readable text content in the document in the step (2), and screening entity keywords according to the context.
In order to optimize the technical scheme, the specific measures adopted further comprise:
further, when a task requests a network service system to download a file, the network platform generates an access request record, the record comprises a local network IP address and expected downloading time, the communication system is accessed by utilizing the non-invasive characteristic of the RPA to provide cross access to related sedimentology research content hotspots, and multiple paths of IP addresses are copied to a server in a cross mode;
when the server receives a download request at any time interval, calculating the response time T of a single download task as follows:
T=tdeparture-tarrival;
in the formula, tdepartureRepresenting request arrival time, tarrivalRepresenting the request completion time, wherein an exponential random variable of a single download task response time T is e.r.v, e represents an expected value of expected download time under the single download task response time T, r represents the proportion of download flow of a server to the total bandwidth, and v represents the speed of a download hotspot;
when the download request application is successful and the download hotspot is cross-accessed, the download request is carried out within any period of time x being more than or equal to TimeN being less than or equal to y, and the response time of the download request is betanTherefore, the minimum response time β of the download request(x,y)The expression of (a) is:
in the formula, x is the lowest download time, y is the highest download time, and TimeN is any download time in the time period; expected value of expected download time E [ T ] under single download task response time T(r,t)]The expression of (a) is:
where β (T +1,1/r) represents the minimum response time from time T +1 to time 1/r, μ represents the response rate between different download request commands and the server, E [ T [ T ] ](r,t)]And e both represent expected values of expected download times at a single download task response time T;
selecting a value satisfying the expected value E [ T ](r,t)]The service IP address of (a) is downloaded in multiple ways.
Further, the specific content for identifying the text content is as follows:
identifying documents using machine vision, in which (x)nn,ynn) Representing the coordinates of the upper left corner of the character, (x)nm,ynm) Representing the coordinates of the lower right corner of the character; in the physical structure (x)mn,ymn) Representing the coordinates of the upper left corner of the character, (x)mm,ymm) Representing the coordinates of the lower right corner of the character; calculating the space area X between charactersoverlapAnd a physically set threshold area YoverlapThe expression of (a) is as follows:
Xoverlap=max(0,min(xnm,xmm)-max(xnn,xmn));
Yoverlap=max(0,min(ynm,ymm)-max(ynn,ymn));
in the formula, min (x)nm,xmm) Represents the minimum value of the coordinate of the upper left corner of the frame of the selected character, max (x)nn,xmn) Represents the maximum value of the coordinate of the lower right corner of the frame of the selected character, min (x)nm,xmm)-max(xnn,xmn) Representing the diagonal values of the character frame; min (y)nm,ymm) Represents the minimum value of the coordinate of the upper left corner of the frame of the physical structure of the selected character, max (y)nn,ymn) The maximum value of the coordinate of the lower right corner of the frame representing the physical structure of the selected character, min (y)nm,ymm)-max(ynn,ymn) Representing the diagonal values of the character frame;
according to XoverlapAnd YoverlapThereby obtaining the maximum character structured area SoverlapThe expression is as follows:
Soverlap=Xoverlap×Yoverlap;
due to the character structured area SoverlapMuch smaller than the physical structure area, and comparing the Ratio of the overlapped surfacesoverlapClustering characters, clustering words into sentences, and forming sentences into paragraphs, thereby reading text contents, wherein the Ratio of overlapped surfaces isoverlapThe expression is as follows:
in the formula, xnnX-axis coordinate, x, representing the top left corner of a character in a documentnmX-axis coordinate, y, representing the lower right corner of a character in a documentnnY-axis coordinate, y, representing the top left corner of a character in a documentnmRepresenting the y-axis coordinates of the lower right hand corner of the characters in the document.
Further, context language segments of text contents are analyzed, dictionary lists of user-defined multi-class entity keywords in text sentences are obtained and integrated into a dictionary list data set ER ═ ER1,er2,...,erNTherein, er1A dictionary list of entity keywords, er, representing a first category2A dictionary list of entity keywords, er, representing a second categoryNEntities representing the Nth categoryA body keyword dictionary list;
the category record of the entity keyword dictionary list is a multi-attribute tuple, and the tuple of the entity keyword dictionary list of the nth category in the ER is represented as ERn=(timn,geon,natn,orgn,pern) Wherein N is more than or equal to 1and less than or equal to N, wherein timnRepresenting entity Key Presence time, geonRepresenting geographic locations of discovered entity keywords, natnRepresenting devicenOf the corresponding entity keyword name, orgnRepresenting an organization that discovers entity keywords, pernRepresenting the discoverer of the entity's keyword.
Further, calculating the relation probability between text content and terms by using a large text corpus, and setting the words with the same meaning as the same part of speech;
using cosine similarity measurement to perform text mining on a given entity keyword dictionary list in a large text corpus, and determining semantic proximity and word vectors of entity keywords; er (a)nDenoted vc in the word vectorn=(B-vcn,I-vcn) Wherein B-vcnRepresenting the beginning position of a multi-attribute tuple in the word vector space, I-vcnRepresenting the middle position of the multi-attribute tuple in the word vector space and utilizing a regular expression to expand the usernAttribute tuple and English character [ A-Z ]]And numeric characters 0-9]The expression of (1);
the cosine similarity cos (θ) value calculation expression is as follows:
in the formula, vciDenotes the ith word vector variable, wc, in a total of m word vectorsiAn ith text sentence phrase word vector variable represented in a total of m text sentence phrase word vectors; when the cosine similarity cos (theta) value is 1, the expression can be in vciFind wc in the corresponding entity key word dictionary listiThe corresponding words needed by the text corpus so as to realize entity relationMining a key word dictionary list;
after the entity keyword dictionary list is mined by the text corpus, relevant contents are extracted from the text contents, so that the unstructured multi-classification text is generated.
Further, the text sentence of the unstructured multi-classification text is related to the ernPerforming classification matching, if the text sentence can not be matched with the usernThen, it is represented as O;
will ernAll the multiple classification structure subsets in the text sentence entity set matched with the corresponding classification structure are expressed as ERn={B-ern,O,I-ernAnd thus generating a training data set with labeled BIO.
Further, let crf be the output layer of BilSTM, ER for each input tagnThe output label PL corresponding to it is obtainednIs predicted as input ERnHas a probability of continuous correctness of Score (ER)n,PLn),Score(ERn,PLn) The expression of (a) is as follows:
wherein R represents a total of R labels in the training dataset,for the ith input label ERiThe output is PLiProbability of (A)(PLi,PLi+1)To be from PLiTo PLi+1The transition probability of (2);
determining all input labels ERnContinuous correct probability of (ER)n,PLn) Using Viterbi algorithm to input label ERnAnd output label PLnPerforming probability normalization process P(PLn|ERn)To complete and train and mine text data, wherein the probability normalization process P(PLn|ERn)The expression is as follows:
in the formula,indicating the prediction of the ith input label ERiThe index value of the continuous probability of correctness,denoted as ER for the ith input tagiAnd the output label rate of the error is obtained,indicating a mispredicted input label ERiIndex value of the continuous probability of (a).
The invention has the beneficial effects that:
1: in the process of cross configuration of multiple IP addresses in the server, the text data is downloaded, and the method is more suitable for the practical minimum flow limit and the expected downloading time.
2: in the text content identification process, a heuristic method is adopted to select the target text, so that the character identification accuracy is improved, and the identification target can be found more quickly and conveniently.
3: in the keyword part-of-speech analysis, part-of-speech rules are preferentially combined with a keyword dictionary list, so that the overall mining efficiency of the sedimentary keywords is improved, and the time cost caused by manual labeling is reduced.
4: in the process of data mining for a sedimentology data set, a bidirectional long-short-term neural network is adopted and a conditional random field is combined, so that the accuracy of the sedimentology literature mining strategy design is improved, and the identification noise caused by the wrong label of the data set is reduced.
Drawings
FIG. 1 is a flow chart of the overall process steps of the present invention.
FIG. 2 is the accuracy of the training set test in the BilSTM model in conjunction with conditional random field CRF according to the present invention.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings.
The invention discloses a sedimentology literature mining method based on NLP (Natural language processing, NLP) information extraction and part-of-speech rules, which comprises the following steps: step 1: according to the lowest flow limit and the expected downloading time value, a Robot Process Automation (RPA) is used for reading research and conference files containing relevant chemistry contents in a distributed reading website; step 2: identifying the file in the step 1 according to machine vision so as to obtain the geometric attribute and the text attribute of the content object, and judging the type of the content object by a heuristic algorithm to obtain the physical structure and the logical structure of the document; and step 3: analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; and 4, step 4: identifying the keywords with the same part of speech in the text by using the document file obtained in the step 2 and the keyword dictionary obtained in the step 3 and adopting a cosine similarity measurement analysis technology according to the types of the keywords to generate an unstructured multi-classification text; and 5: respectively performing multi-path matching on classified entities by using the unstructured multi-classification texts output in the step 4, recording entity label attributes, and generating a large sample training data set; step 6: aiming at the generation of the large sample data set in the step 5and the identification of the document text in the step 3, a two-way Long and Short Term Memory neural network model (BilSTM) is utilized to combine with a Conditional Random Field (CRF) to carry out Named object identification (NER), so that the identification of the required Entity is realized, and the Entity in the text is screened out and stored.
The sedimentology literature mining method based on NLP information extraction and part-of-speech rules provided by the invention comprises the following steps, and the flow is shown in a figure 1-2:
step 1: and reading research and conference files containing relevant chemistry contents in the website by utilizing RPA distribution from the lowest flow limit MF and the expected value of the expected download time ET.
When a task requests a network service system to download files, a network platform generates an access request record, the record comprises a local network address and expected download time, on the basis of existing network service system availability codes and analysis of distributed storage and expected download time, the communication system can be accessed in a non-intrusive mode by using RPA, cross access is carried out on related depositional research content hotspots, multiple IP addresses are copied into a server in a cross mode, the multiple IP set is named as LR (LR) 1, LR2, … and lrN), wherein N represents the number of cross IPs in LR.
When the server receives a download request at any time interval, the minimum flow limit is calculated by combining the size of the request resource and the current bandwidth congestion degree, as shown in formula (1), where T is the response time of a single download task, and T isdepartureRepresenting request arrival time, tarrivalRepresenting the request completion time. The lowest flow limit influences the arrival time and the completion time of the request, the response time of a single downloading task is longest under the lowest flow limit, the lowest flow limit is defaulted to be broadband flow meeting the lowest downloading requirement of the website file, and the lowest flow limit can contain a downloading task request at any time point;
T=tdeparture-tarrival (1)。
and when the request application is successful, performing cross access on the download hotspot. T is assumed to be independent, response rate is assumed to be mu between different requests and servers, exponential random variable is e.r.v, namely T ═ r/v, wherein e represents download time expected value, r represents download server download flow rate to total bandwidth ratio, v represents download speed of hot spot file, as shown in formula (2), for download request x ≦ TimeN ≦ y in any period of time, download time in response is βnWherein x is the lowest download time, y is the highest download time, the lowest download time x is the minimum time for the required download request during the highest broadband service, and it meets the server safety and will not stop downloading service because of too short request time, the highest download time y is the longest response time of the server to the download request, and the server safety will not pause downloading service because of too long download time and service overtime, TimON is any download time during this time period, therefore, the minimum response time beta in the request(x,y);
WhereinThe lowest download time x and the highest download time y are integrated with respect to the download speed in the range of 0 to 1, wherein the minimum response time is found assuming that the fastest speed is 1, i.e., 100%, and the slowest speed is 0.
The download expected value is in linear inverse proportion to the download rate, when a high-frequency flow signal is deployed, the download rate shows that the service is fully loaded, the download expected value shows that a curve is decreased progressively, and beta is(x,y)And deducing an availability cross IP address for a low-flow state, wherein S represents the downloading time of an arbitrary random variable e.r.v, so that S (r, v) -Exp (mu) are uniformly distributed in the system, solving a distribution value P as represented by formula (3), wherein T is larger than the downloading time S of the arbitrary random variable, and formula (4) represents an expression formula of an expected value E of the expected downloading time at the moment T, and selecting a service IP address meeting the expected value for multi-path downloading, so that the time required by downloading is reduced, and more texts are downloaded in a unit time to the maximum extent. Wherein,
P{T(r,t)>S}=exp(-μs)(1-(1-exp(-μs))r)t (3);
in the formula, P { T(r,t)S represents that the response time T of a single downloading task related to the proportion r of the downloading flow of the downloading server in the total bandwidth and the downloading time T in the distribution value P is larger than the downloading time S of any random variable, and exp (-us) represents an inverse function of an exponential calculated by the product of the downloading time S of the random variable and the assumed rate mu; beta (T +1,1/r) represents the minimum response time from T +1 to 1/r, mu represents the response rate between different download request instructions and the server, E [ T [ [ T ](r,t)]And e are both represented inThe expected value of the desired download time at the response time T of a single download task.
Wherein, the meaning of the expected downloading time is as follows: for example, if there is a bus with 20 minutes of departure interval, the waiting time is 10 minutes because the departure time of the bus satisfies the uniform distribution of [0,20 ]. Similarly, the expected time is reduced along with the increase of the IP addresses, the expected value is to estimate the distributed IP value, the expected time can be less than the expected value and can also be slightly greater than the expected value, but is not infinite or infinitesimal, the shortest download time and the longest delay stop time of the opposite website server are both satisfied, otherwise, the download cannot be performed.
Step 2: and identifying and downloading the standard file according to machine vision so as to obtain the geometric attributes and text attributes of the table picture in the text, judging the type of the content object by using a heuristic algorithm to obtain the physical structure and logical structure of the document, and identifying the text of the standard file.
First, machine vision is utilized (machine vision does not require parameter setting of the distance between physical structures), as shown in equation (5), where (x) is in the documentnn,ynn) Representing the coordinates of the upper left corner of the character, (x)nm,ynm) Lower right corner coordinate, as shown in equation (6), where (x) is in the physical structuremn,ymn)、(xmm,ymm) Representing the coordinates of the upper left corner and the lower right corner, respectively. Calculating the space area X between charactersoverlapAnd a physically set threshold area Yoverlap;
Xoverlap=max(0,min(xnm,xmm)-max(xnn,xmn)) (5);
Yoverlap=max(0,min(ynm,ymm)-max(ynn,ymn)) (6);
Consider a character as a rectangular box, where min (x)nm,xmm) Represents the minimum value of the coordinate of the upper left corner of the frame of the selected character, max (x)nn,xmn) Represents the maximum value of the coordinate of the lower right corner of the frame of the selected character, min (x)nm,xmm)-max(xnn,xmn) Representing the diagonal values of the character frame, and finally calculating the longest diagonal to calculate the space area of the character, and similarly, wherein min (y)nm,ymm) Represents the minimum value of the coordinate of the upper left corner of the frame of the physical structure of the selected character, max (y)nn,ymn) The maximum value of the coordinate of the lower right corner of the frame representing the physical structure of the selected character, min (y)nm,ymm)-max(ynn,ymn) Representing the diagonal value of the character frame, and finally solving the longest diagonal so as to solve the space area of the character physical structure frame.
Then, according to XoverlapAnd YoverlapCalculating the maximum character structured area as SoverlapAs shown in equation (7):
Soverlap=Xoverlap×Yoverlap (7);
finally, area S is structured due to the characteroverlapMuch smaller than the physical structure area, and comparing the Ratio of the overlapped surfacesoverlapClustering characters to read text content, as shown in formula (8):
and step 3: and acquiring a self-defined multi-class entity keyword dictionary list in the sentence text according to the document context language segment.
Let the multiclass entity keyword dictionary list dataset be denoted as ER, which is a set of records for entity classes, denoted ER ═ ER1,er2,...,erNN represents the number of entity keyword dictionary lists in the ER;
the record of the entity class is a multi-attribute tuple, and the nth (1 ≦ N ≦ N) tuple in the ER is represented as ERn=(timn,geon,natn,orgn,pern) Wherein timnRepresenting the time of existence of an entity, geonRepresenting the geographical location, nat, of the discovered entitynRepresenting devicenCorresponding entity name inScale, orgnRepresenting an organization of discovered entities, pernRepresenting the entity finder.
And 4, step 4: and identifying the keywords with the same part of speech in the text according to the types of the keywords by adopting a cosine similarity measurement analysis technology to generate the unstructured multi-classification text.
Using a large corpus of text to compute the probability of relationships between documents and terms, words of the same meaning will produce similar text, i.e., parts of the same kind. Subsequently, using cosine similarity measurements, text mining is performed on the given dictionary document and the database to determine semantic proximity and word vectors, ernIn the expression of the word vector as vcn=(B-vcn,I-vcn) Wherein B-vcnRepresenting the beginning of a multi-attribute tuple in the word vector space, I-vcnRepresenting the middle position of the multi-attribute tuple in the word vector space and utilizing a regular expression to expand the usernAttribute tuple and English character [ A-Z ]]And numeric characters 0-9]Is the manifestation of (1)? Represents the expression before matching once, represents the expression before matching any times, represents matching from the current position, represents the end of the expression before matching, for example, the time part of speech VB has three expressions, as shown in the formula (9-11):
VB1=r'^~?[0-9]'+r'^[A-Z].*$' (9);
VB2=r'^±'+r'^~?[0-9]+(.[0-9]+)?$'+r'.*' (10);
VB3=r'^~?[0-9]+(.[0-9]+)?$'+r'and$'+r'.*' (11);
equation 9 the process is as follows:
the first step is as follows: do? And matching sentences with symbols or sentences without symbols in the sentences.
The second step is that: in the case where the first step is satisfied, [0-9] indicates that any one of numbers between 0 and 9 is matched.
Third, in case the second step is satisfied, a-Z represents any letter between the matching alphabetical letters a to Z.
A fourth step, where the third step is satisfied, $' indicates that the third step letter is matched multiple times and may be located at the end of the sentence, e.g., -9 Ma or 9 Ma.
Equation 10 processes as follows:
the first step is as follows: ^ + -represents that there are + -sentences in the matching sentence.
The second step is that: in the case where the first step is satisfied? And matching sentences with symbols or sentences without symbols in the sentences.
The third step: in the case where the second step is satisfied, [0-9] indicates that any one of numbers between 0 and 9 is matched.
Step four, ([ 0-9] +) indicates that the decimal point is matched first and any one number between the numbers 0 and 9 is matched second, in the case where step three is satisfied.
And a fifth step of, in case that the fourth step is satisfied,meaning matching the previous step any number of times, e.g. + -. 9.38 or. + -. 9.1.
Equation 11 processes as follows:
the first step is as follows: do? And matching sentences with symbols or sentences without symbols in the sentences.
The second step is that: in the case where the first step is satisfied, [0-9] indicates that any one of numbers between 0 and 9 is matched.
Third, in the case where the second step is satisfied, ([ 0-9] +) indicates that the decimal point is matched first, and any one number between the numbers 0 and 9 is matched second.
The fourth step: in the case where the third step is satisfied, and $ represents a sentence that matches a sentence with an and.
And a fifth step of, in case that the fourth step is satisfied,indicating that the previous step was matched any number of times, e.g., -1 and2 or 1.5and 1.68.
Cosine similarity highest value cos (theta) and vcnConstituent part vc ofi(i is more than or equal to 1and less than or equal to m) and phrase and word vector attribute wc of each sentence of the textnComponent wc ofiCorrelation (1. ltoreq. i. ltoreq. m), vciDenotes the i-th word vector variable, wc, among the m word vectorsiRepresenting the ith text sentence phrase word vector variable in the m text sentence phrase word vectors, as shown in formula (12), when the cos (theta) value is 1, representing that the variables point to the same word vector space, and pointing to the same space, successfully matching the given dictionary document with the database, and finding the required words in the database in the document;
and 5: and outputting and generating an unstructured multi-classification text through cos (theta), traversing the multi-classification text to perform multi-path matching on classification entities respectively, recording entity label attributes, and generating a large sample training data set.
Go the text sentence on ernClassification matching, if the text sentence can not be matched, ernThen, it is represented as O; er (a)nThe sentence entity sets corresponding to all the multi-classification structure subsets (here, the multi-classification structure subset and the former er)nSet of entity keyword dictionary lists representing multiple categories have the same meaning), denoted as ERn={B-ern,O,I-ern} to generate training set data with labeled BIO; wherein B-ernExpressed as the first letter of the beginning of the sentence entity, I-ernThe expression is the expression mode of each residual letter except the first letter, and the generated data set is expressed as B-tim, B-geo, B-nat, O, I-tim, I-geo, I-nat and B-ernComprises B-tim, B-geo, B-nat, I-ernIncluding I-tim, I-geo, I-nat.
Step 6: and carrying out NER pre-training by using BilSTM in combination with CRF to identify the sedimentology entity, thereby screening out valuable hot spots according to context information.
First, using the transfer matrix in crf to avoid multiple consecutive B-ersnUsing crf as the output layer of BilSTM, as shown in equation (13), where ER is input for each inputnEventually obtain the corresponding prediction label PLnThen predict input ERnHas a continuous probability of Score (ER)n,PLn) (e.g., the input label is O, the corresponding output label is obtained, and the probability that consecutive inputs are O is predicted),output as PL for the ith positioni(0≤PLiProbability of < 1), A(PLi,PLi+1)To be from PLiTo PLi+1The transition probability of (2);
wherein successive inputs of O, or I-ern, are correct, and B-ern is incorrect if three successive occurrences occur. For transition probability, which is the probability of directly jumping from O to I-ern, in the above BIO tag, an error may occur, and I-ern is normalized to O, which is a similar error.
Then, for each ERnTo find all PLnScore (ER) of (C)n,PLn) Performing probability normalization processing P on input label, namely output label by utilizing Viterbi algorithm(PLn|ERn)And thus, text data is mined as shown in equation (14):
in the formula,indicating the prediction of the ith input label ERiThe continuous correct probability of;indicating the prediction of the ith ERiThe error probability of (2);denoted as ER for the ith input tagiTo obtain the wrong output label rate (e.g. original ER)nThe corresponding output label PLi is obtained, the probability of being correct is 0.7,it indicates that he is incorrect with a probability of 0.3).
The idea of the invention is as follows: firstly, according to the minimum flow limit and the expected downloading time expected value, research and conference files containing relevant depositional contents in a website are read in a distributed manner by utilizing the robot process automation; then according to machine vision, acquiring the geometric attributes and text attributes of the researched content object, and performing type judgment on the content object by using a heuristic algorithm to obtain the physical structure and the logical structure of the document; further, analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; on the basis, a cosine similarity measurement analysis technology is adopted, keywords with the same part of speech in the text are identified according to the types of the keywords, and an unstructured multi-classification text is generated, so that the unstructured multi-classification text is output; then, aiming at the unstructured text, respectively carrying out multi-path matching on classified entities, recording entity label attributes, and generating a large sample training data set; and finally, in a large sample data set, recognizing a name body by using a bidirectional long-short term memory neural network model in combination with a conditional random field to realize recognition of a required entity, and screening out entities in the text for storage.
Example, this example selects depositional text matching data as the input dataset to experiment and selects tensiorflow as the simulation platform.
The parameters involved in the experimental environment are shown in table 1.
Table 1 parameter settings involved in the execution of the method
Experimental parameters | Value taking |
Substance switch | B-nat |
Intermediate of matter | I-nat |
Time switch | B-tim |
Middle of time | I-tim |
Beginning of a venue | B-geo |
In the middle of a site | I-geo |
Others | O |
Number of data set records | 274292 |
FIG. 2 shows the accuracy of the training set test in the BilSTM model in conjunction with conditional random field CRF according to the present invention.
It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.
Claims (7)
1. The sedimentology literature mining method based on NLP information extraction and part of speech rules is characterized by comprising the following steps:
step 1: downloading files containing relevant chemistry contents in an RPA distributed reading website according to the lowest flow limit and an expected value of expected downloading time;
step 2: identifying the file downloaded in the step 1 according to machine vision so as to obtain the geometric attribute and the text attribute of the content object, judging the type of the content object by a heuristic algorithm to obtain the physical structure and the logical structure of the document, and identifying the text content;
and step 3: analyzing context language segments of the text content to obtain a dictionary list of user-defined multi-class entity keywords in the text content;
and 4, step 4: recognizing entity keywords with the same part of speech in the text content by using the text content obtained in the step 2 and the dictionary list obtained in the step 3 and adopting a cosine similarity measurement analysis technology according to the entity keyword types to generate an unstructured multi-classification text;
and 5: respectively performing multi-path matching on classified entities by using the unstructured multi-classification texts output in the step 4, recording entity label attributes, and generating a large sample training data set;
step 6: and (3) aiming at the large sample data set generated in the step (5), performing NER pre-training by using BilSTM in combination with CRF to realize entity identification on the readable text content in the document in the step (2), and screening entity keywords according to the context.
2. The sedimentology document mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein:
when a task requests a network service system to download files, a network platform generates an access request record, the record comprises a local network IP address and expected downloading time, the communication system is accessed by utilizing the non-invasive characteristic of RPA to provide cross access to related sedimentology research content hotspots, and multiple paths of IP addresses are cross copied to a server;
when the server receives a download request at any time interval, calculating the response time T of a single download task as follows:
T=tdeparture-tarrival;
in the formula, tdepartureRepresenting request arrival time, tarrivalRepresenting the request completion time, wherein an exponential random variable of a single download task response time T is e.r.v, e represents an expected value of expected download time under the single download task response time T, r represents the proportion of download flow of a server to the total bandwidth, and v represents the speed of a download hotspot;
when the download request application is successful and the download hotspot is cross-accessed, the download request is carried out within any period of time x being more than or equal to TimeN being less than or equal to y, and the response time of the download request is betanTherefore, the minimum response time β of the download request(x,y)The expression of (a) is:
in the formula, x is the lowest download time, y is the highest download time, and TimeN is any download time in the time period;
expected value of expected download time E [ T ] under single download task response time T(r,t)]The expression of (a) is:
where β (t +1,1/r) represents the minimum response time from time t +1 to time 1/r, μ tableShowing the response rate between different download request instructions and the server, ET(r,t)]And e both represent expected values of expected download times at a single download task response time T;
selecting a value satisfying the expected value E [ T ](r,t)]The service IP address of (a) is downloaded in multiple ways.
3. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein the specific content for recognizing text content is:
identifying documents using machine vision, in which (x)nn,ynn) Representing the coordinates of the upper left corner of the character, (x)nm,ynm) Representing the coordinates of the lower right corner of the character; in the physical structure (x)mn,ymn) Representing the coordinates of the upper left corner of the character, (x)mm,ymm) Representing the coordinates of the lower right corner of the character; calculating the space area X between charactersoverlapAnd a physically set threshold area YoverlapThe expression of (a) is as follows:
Xoverlap=max(0,min(xnm,xmm)-max(xnn,xmn));
Yoverlap=max(0,min(ynm,ymm)-max(ynn,ymn));
in the formula, min (x)nm,xmm) Represents the minimum value of the coordinate of the upper left corner of the frame of the selected character, max (x)nn,xmn) Represents the maximum value of the coordinate of the lower right corner of the frame of the selected character, min (x)nm,xmm)-max(xnn,xmn) Representing the diagonal values of the character frame; min (y)nm,ymm) Represents the minimum value of the coordinate of the upper left corner of the frame of the physical structure of the selected character, max (y)nn,ymn) The maximum value of the coordinate of the lower right corner of the frame representing the physical structure of the selected character, min (y)nm,ymm)-max(ynn,ymn) Representing the diagonal values of the character frame;
according to XoverlapAnd YoverlapThereby obtaining the maximum character structured area SoverlapThe expression is as follows:
Soverlap=Xoverlap×Yoverlap;
due to the character structured area SoverlapMuch smaller than the physical structure area, and comparing the Ratio of the overlapped surfacesoverlapClustering characters, clustering words into sentences, and forming sentences into paragraphs, thereby reading text contents, wherein the Ratio of overlapped surfaces isoverlapThe expression is as follows:
in the formula, xnnX-axis coordinate, x, representing the top left corner of a character in a documentnmX-axis coordinate, y, representing the lower right corner of a character in a documentnnY-axis coordinate, y, representing the top left corner of a character in a documentnmRepresenting the y-axis coordinates of the lower right hand corner of the characters in the document.
4. The sedimentology document mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein:
analyzing the context language segment of the text content, obtaining a dictionary list of user-defined multi-class entity keywords in the text sentence, and integrating into a dictionary list data set ER ═ ER1,er2,...,erNTherein, er1A dictionary list of entity keywords, er, representing a first category2A dictionary list of entity keywords, er, representing a second categoryNA list of entity keyword dictionaries representing an nth category;
the category record of the entity keyword dictionary list is a multi-attribute tuple, and the tuple of the entity keyword dictionary list of the nth category in the ER is represented as ERn=(timn,geon,natn,orgn,pern) Wherein N is more than or equal to 1and less than or equal to N, wherein timnRepresenting entity Key Presence time, geonRepresenting geographic locations of discovered entity keywords, natnRepresenting devicenCorresponding entity keyword name in,orgnRepresenting an organization that discovers entity keywords, pernRepresenting the discoverer of the entity's keyword.
5. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 4, wherein:
calculating the relation probability between text content and terms by using a large text corpus, and setting words with the same meaning as the same part of speech;
using cosine similarity measurement to perform text mining on a given entity keyword dictionary list in a large text corpus, and determining semantic proximity and word vectors of entity keywords; er (a)nDenoted vc in the word vectorn=(B-vcn,I-vcn) Wherein B-vcnRepresenting the beginning position of a multi-attribute tuple in the word vector space, I-vcnRepresenting the middle position of the multi-attribute tuple in the word vector space and utilizing a regular expression to expand the usernAttribute tuple and English character [ A-Z ]]And numeric characters 0-9]The expression of (1);
the cosine similarity cos (θ) value calculation expression is as follows:
in the formula, vciDenotes the ith word vector variable, wc, in a total of m word vectorsiAn ith text sentence phrase word vector variable represented in a total of m text sentence phrase word vectors; when the cosine similarity cos (theta) value is 1, the expression can be in vciFind wc in the corresponding entity key word dictionary listiThe corresponding words needed by the text corpus are used for realizing the mining of the entity keyword dictionary list;
after the entity keyword dictionary list is mined by the text corpus, relevant contents are extracted from the text contents, so that the unstructured multi-classification text is generated.
6. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 5,
text sentence and er of unstructured multi-classification textnPerforming classification matching, if the text sentence can not be matched with the usernThen, it is represented as O;
will ernAll the multiple classification structure subsets in the text sentence entity set matched with the corresponding classification structure are expressed as ERn={B-ern,O,I-ernAnd thus generating a training data set with labeled BIO.
7. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 6,
let crf be the output layer of BilSTM, ER for each input tagnThe output label PL corresponding to it is obtainednIs predicted as input ERnHas a probability of continuous correctness of Score (ER)n,PLn),Score(ERn,PLn) The expression of (a) is as follows:
wherein R represents a total of R labels in the training dataset,for the ith input label ERiThe output is PLiProbability of (A)(PLi,PLi+1)To be from PLiTo PLi+1The transition probability of (2);
determining all input labels ERnContinuous correct probability of (ER)n,PLn) Using Viterbi algorithm to input label ERnAnd output label PLnPerforming probability normalization process P(PLn|ERn)To complete and train and mine text data, wherein the probability normalization process P(PLn|ERn)The expression is as follows:
in the formula,indicating the prediction of the ith input label ERiThe index value of the continuous probability of correctness,denoted as ER for the ith input tagiAnd obtaining the probability of the wrong output label,indicating a mispredicted input label ERiIndex value of the continuous probability of (a).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110818775.XA CN113468890B (en) | 2021-07-20 | 2021-07-20 | Sedimentology literature mining method based on NLP information extraction and part-of-speech rules |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110818775.XA CN113468890B (en) | 2021-07-20 | 2021-07-20 | Sedimentology literature mining method based on NLP information extraction and part-of-speech rules |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113468890A true CN113468890A (en) | 2021-10-01 |
CN113468890B CN113468890B (en) | 2023-05-26 |
Family
ID=77881608
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110818775.XA Active CN113468890B (en) | 2021-07-20 | 2021-07-20 | Sedimentology literature mining method based on NLP information extraction and part-of-speech rules |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113468890B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114117061A (en) * | 2021-10-27 | 2022-03-01 | 南京信息工程大学 | River facies knowledge graph reverse-deducing method based on data mining and tree structure |
CN114625885A (en) * | 2022-03-07 | 2022-06-14 | 南京信息工程大学 | Entity dependency extraction and identification method, system and device based on NLP and trigger and storage medium |
CN117076703A (en) * | 2023-10-11 | 2023-11-17 | 中邮消费金融有限公司 | Automatic card structured information extraction technical method and system |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558569A (en) * | 2018-12-14 | 2019-04-02 | 昆明理工大学 | A kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model |
CN109672613A (en) * | 2018-12-12 | 2019-04-23 | 北京数码视讯软件技术发展有限公司 | Adaptive access method, apparatus and electronic equipment |
KR20200044176A (en) * | 2018-10-05 | 2020-04-29 | 동아대학교 산학협력단 | System and Method for Korean POS Taging Using the Concatenation of Jamo and Sylable Embeding |
CN111950287A (en) * | 2020-08-20 | 2020-11-17 | 广东工业大学 | Text-based entity identification method and related device |
CN112417880A (en) * | 2020-11-30 | 2021-02-26 | 太极计算机股份有限公司 | Court electronic file oriented case information automatic extraction method |
CN112632228A (en) * | 2020-12-30 | 2021-04-09 | 深圳供电局有限公司 | Text mining-based auxiliary bid evaluation method and system |
CN112801010A (en) * | 2021-02-07 | 2021-05-14 | 华南理工大学 | Visual rich document information extraction method for actual OCR scene |
CN112817561A (en) * | 2021-02-02 | 2021-05-18 | 山东省计算中心(国家超级计算济南中心) | Structured extraction method and system for transaction function points of software requirement document |
-
2021
- 2021-07-20 CN CN202110818775.XA patent/CN113468890B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20200044176A (en) * | 2018-10-05 | 2020-04-29 | 동아대학교 산학협력단 | System and Method for Korean POS Taging Using the Concatenation of Jamo and Sylable Embeding |
CN109672613A (en) * | 2018-12-12 | 2019-04-23 | 北京数码视讯软件技术发展有限公司 | Adaptive access method, apparatus and electronic equipment |
CN109558569A (en) * | 2018-12-14 | 2019-04-02 | 昆明理工大学 | A kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model |
CN111950287A (en) * | 2020-08-20 | 2020-11-17 | 广东工业大学 | Text-based entity identification method and related device |
CN112417880A (en) * | 2020-11-30 | 2021-02-26 | 太极计算机股份有限公司 | Court electronic file oriented case information automatic extraction method |
CN112632228A (en) * | 2020-12-30 | 2021-04-09 | 深圳供电局有限公司 | Text mining-based auxiliary bid evaluation method and system |
CN112817561A (en) * | 2021-02-02 | 2021-05-18 | 山东省计算中心(国家超级计算济南中心) | Structured extraction method and system for transaction function points of software requirement document |
CN112801010A (en) * | 2021-02-07 | 2021-05-14 | 华南理工大学 | Visual rich document information extraction method for actual OCR scene |
Non-Patent Citations (2)
Title |
---|
K.E.RAVIKUMAR: "BELMiner:adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences", 《DATABASE》 * |
刘炜 等: "一种面向突发事件的文本语料自动标注方法", 《中文信息学报》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114117061A (en) * | 2021-10-27 | 2022-03-01 | 南京信息工程大学 | River facies knowledge graph reverse-deducing method based on data mining and tree structure |
CN114625885A (en) * | 2022-03-07 | 2022-06-14 | 南京信息工程大学 | Entity dependency extraction and identification method, system and device based on NLP and trigger and storage medium |
CN114625885B (en) * | 2022-03-07 | 2024-10-18 | 南京信息工程大学 | Entity affiliation extraction and identification method, system, device and storage medium based on NLP and trigger |
CN117076703A (en) * | 2023-10-11 | 2023-11-17 | 中邮消费金融有限公司 | Automatic card structured information extraction technical method and system |
CN117076703B (en) * | 2023-10-11 | 2024-02-06 | 中邮消费金融有限公司 | Automatic card structured information extraction technical method |
Also Published As
Publication number | Publication date |
---|---|
CN113468890B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298033B (en) | Keyword corpus labeling training extraction system | |
CN109858010B (en) | Method and device for recognizing new words in field, computer equipment and storage medium | |
CN107992585B (en) | Universal label mining method, device, server and medium | |
US8156053B2 (en) | Automated tagging of documents | |
CN110727779A (en) | Question-answering method and system based on multi-model fusion | |
CN112100356A (en) | Knowledge base question-answer entity linking method and system based on similarity | |
CN113468890B (en) | Sedimentology literature mining method based on NLP information extraction and part-of-speech rules | |
CN110162771B (en) | Event trigger word recognition method and device and electronic equipment | |
CN112101040A (en) | Ancient poetry semantic retrieval method based on knowledge graph | |
CN114254653A (en) | Scientific and technological project text semantic extraction and representation analysis method | |
CN110728151B (en) | Information depth processing method and system based on visual characteristics | |
CN112818093A (en) | Evidence document retrieval method, system and storage medium based on semantic matching | |
KR20220134695A (en) | System for author identification using artificial intelligence learning model and a method thereof | |
CN111274822A (en) | Semantic matching method, device, equipment and storage medium | |
CN107357765A (en) | Word document flaking method and device | |
CN113961666A (en) | Keyword recognition method, apparatus, device, medium, and computer program product | |
CN113947086A (en) | Sample data generation method, training method, corpus generation method and apparatus | |
CN114840685A (en) | Emergency plan knowledge graph construction method | |
CN118132719A (en) | Intelligent dialogue method and system based on natural language processing | |
CN116644148A (en) | Keyword recognition method and device, electronic equipment and storage medium | |
CN113946668A (en) | Semantic processing method, system and device based on edge node and storage medium | |
CN113656429A (en) | Keyword extraction method and device, computer equipment and storage medium | |
CN116955534A (en) | Intelligent complaint work order processing method, intelligent complaint work order processing device, intelligent complaint work order processing equipment and storage medium | |
WO2023083176A1 (en) | Sample processing method and device and computer readable storage medium | |
Eswaraiah et al. | A Hybrid Deep Learning GRU based Approach for Text Classification using Word Embedding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |