CN113468890A - Sedimentology literature mining method based on NLP information extraction and part-of-speech rules - Google Patents

Sedimentology literature mining method based on NLP information extraction and part-of-speech rules Download PDF

Info

Publication number
CN113468890A
CN113468890A CN202110818775.XA CN202110818775A CN113468890A CN 113468890 A CN113468890 A CN 113468890A CN 202110818775 A CN202110818775 A CN 202110818775A CN 113468890 A CN113468890 A CN 113468890A
Authority
CN
China
Prior art keywords
text
representing
entity
download
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110818775.XA
Other languages
Chinese (zh)
Other versions
CN113468890B (en
Inventor
胡志臣
许小龙
胡祥奔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202110818775.XA priority Critical patent/CN113468890B/en
Publication of CN113468890A publication Critical patent/CN113468890A/en
Application granted granted Critical
Publication of CN113468890B publication Critical patent/CN113468890B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A sedimentology literature mining method based on NLP information extraction and part-of-speech rules comprises the following steps of 1: downloading the related files according to the lowest flow limit and the expected value of the expected downloading time; step 2: recognizing text contents according to machine vision; and step 3: analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; and 4, step 4: identifying the keywords with the same part of speech in the text according to the types of the keywords by adopting a cosine similarity measurement analysis technology to generate an unstructured multi-classification text; and 5: respectively carrying out multi-path matching on classified entities, recording entity label attributes, and generating a large sample training data set; step 6: and (3) aiming at the large sample data set generated in the step 5and the document text to be recognized in the step 3, name recognition is carried out by utilizing a bidirectional long-short term memory neural network model and combining a conditional random field, the recognition of the required entity is realized, and the entity in the text is screened out for storage.

Description

Sedimentology literature mining method based on NLP information extraction and part-of-speech rules
Technical Field
The invention relates to the technical field of natural language processing, in particular to a sedimentology literature mining method based on NLP information extraction and part of speech rules.
Background
Natural language processing is a cross discipline integrating relevant fields such as linguistics, computer science, mathematics and the like, and natural language processing technology gradually permeates various industries and is used for text data mining and information storage. Currently, a large number of enterprises and organizations screen out valuable core hotspots, either entirely or in part, of the increasing data information through natural language processing techniques to reduce retrieval time and improve information analysis capabilities. From the named entity recognition perspective, data extensibility needs to be guaranteed while satisfying the analysis and understanding of unstructured text. The analysis amount of text data in the field of sedimentology is increasing continuously, and particularly in the big data era, data mining processing needs to learn and analyze massive label data sets. To cope with the increasing analytical demand, the field of sedimentology requires domain experts to build large rule templates and dictionaries. The current sedimentology field generally depends on manual annotation of text information, and manual annotation consumes a large amount of time, influences data timeliness and restricts dynamic development of information-based industry.
In order to meet the urgent need of saving time and expense in the field of sedimentology, how to realize a heterogeneous data source-oriented text information mining method in text data supported by natural language processing is a hot point concerned by the industry and academia. Through named body recognition, screening of text key information can be achieved. The development of the part-of-speech analysis technology promotes the efficient processing capability of the text data. Through part-of-speech analysis, the user-defined of key information can be realized, so that the quantity of interference information in data is reduced, and the label noise generated by multi-path matching in the data set generation process is reduced. However, in the part-of-speech analysis process, in addition to the error caused by reading the characters from the standard text, the noise conflict caused by the part-of-speech rule needs to be comprehensively considered. Therefore, a proper part-of-speech analysis technology needs to be designed to realize the recognition of the named object of the text data.
Text information extraction techniques have been widely used for text data mining and storage. The information screening of different keyword hotspots can be realized through the part of speech analysis technology. For example, the documents "N.Piazza, Classification Between Machine transformed Text and Original Text By Part Of speed labeling reproduction, 2020 IEEE 7th International Conference on Data Science and Advanced Analysis (DSAA), Sydney, NSW, Australia,2020, pp.739-740" use word tags primarily to create a frequency probability distribution model with BIO letters to reduce the use Of Data dictionaries. The documents "F.Hussain, U.Qamar and S.Zeb, A Novel Approach for Searching Linear Synthesis of partial Parts of Speech Tagging,2016 IEEE/WIC/ACM International Conference on Web Intelligent (WI), Omaha, NE, USA,2016, pp.465-468, doi: 10.1109/WI.2016.0076" propose a part-of-Speech Tagging method for open text data, which is short text Tagging data, and realizes information extraction of similar text sentences by Synonyms. Current part-of-speech analysis ignores the identification of value information during professional domain data mining. With the wide application of the part-of-speech analysis technology, the data of the domain text generating hot spots is increasing, and the technical challenge will be brought to the construction of the domain data set label. Therefore, a text mining method based on key words and extensible parts of speech needs to be designed to realize dynamic extraction of text data.
Disclosure of Invention
The invention provides a sedimentology literature mining method based on NLP information extraction and part-of-speech rules, aiming at the characteristic that the time consumption is increasingly prominent in the field of sedimentology for labeling data, and the method is suitable for information acquisition work of heterogeneous data.
In order to achieve the purpose, the invention adopts the following technical scheme:
a sedimentology literature mining method based on NLP information extraction and part-of-speech rules,
step 1: downloading files containing relevant chemistry contents in an RPA distributed reading website according to the lowest flow limit and an expected value of expected downloading time;
step 2: identifying the file downloaded in the step 1 according to machine vision so as to obtain the geometric attribute and the text attribute of the content object, judging the type of the content object by a heuristic algorithm to obtain the physical structure and the logical structure of the document, and identifying the text content;
and step 3: analyzing context language segments of the text content to obtain a dictionary list of user-defined multi-class entity keywords in the text content;
and 4, step 4: recognizing entity keywords with the same part of speech in the text content by using the text content obtained in the step 2 and the dictionary list obtained in the step 3 and adopting a cosine similarity measurement analysis technology according to the entity keyword types to generate an unstructured multi-classification text;
and 5: respectively performing multi-path matching on classified entities by using the unstructured multi-classification texts output in the step 4, recording entity label attributes, and generating a large sample training data set;
step 6: and (3) aiming at the large sample data set generated in the step (5), performing NER pre-training by using BilSTM in combination with CRF to realize entity identification on the readable text content in the document in the step (2), and screening entity keywords according to the context.
In order to optimize the technical scheme, the specific measures adopted further comprise:
further, when a task requests a network service system to download a file, the network platform generates an access request record, the record comprises a local network IP address and expected downloading time, the communication system is accessed by utilizing the non-invasive characteristic of the RPA to provide cross access to related sedimentology research content hotspots, and multiple paths of IP addresses are copied to a server in a cross mode;
when the server receives a download request at any time interval, calculating the response time T of a single download task as follows:
T=tdeparture-tarrival
in the formula, tdepartureRepresenting request arrival time, tarrivalRepresenting the request completion time, wherein an exponential random variable of a single download task response time T is e.r.v, e represents an expected value of expected download time under the single download task response time T, r represents the proportion of download flow of a server to the total bandwidth, and v represents the speed of a download hotspot;
when the download request application is successful and the download hotspot is cross-accessed, the download request is carried out within any period of time x being more than or equal to TimeN being less than or equal to y, and the response time of the download request is betanTherefore, the minimum response time β of the download request(x,y)The expression of (a) is:
Figure BDA0003171124530000031
in the formula, x is the lowest download time, y is the highest download time, and TimeN is any download time in the time period; expected value of expected download time E [ T ] under single download task response time T(r,t)]The expression of (a) is:
Figure BDA0003171124530000032
where β (T +1,1/r) represents the minimum response time from time T +1 to time 1/r, μ represents the response rate between different download request commands and the server, E [ T [ T ] ](r,t)]And e both represent expected values of expected download times at a single download task response time T;
selecting a value satisfying the expected value E [ T ](r,t)]The service IP address of (a) is downloaded in multiple ways.
Further, the specific content for identifying the text content is as follows:
identifying documents using machine vision, in which (x)nn,ynn) Representing the coordinates of the upper left corner of the character, (x)nm,ynm) Representing the coordinates of the lower right corner of the character; in the physical structure (x)mn,ymn) Representing the coordinates of the upper left corner of the character, (x)mm,ymm) Representing the coordinates of the lower right corner of the character; calculating the space area X between charactersoverlapAnd a physically set threshold area YoverlapThe expression of (a) is as follows:
Xoverlap=max(0,min(xnm,xmm)-max(xnn,xmn));
Yoverlap=max(0,min(ynm,ymm)-max(ynn,ymn));
in the formula, min (x)nm,xmm) Represents the minimum value of the coordinate of the upper left corner of the frame of the selected character, max (x)nn,xmn) Represents the maximum value of the coordinate of the lower right corner of the frame of the selected character, min (x)nm,xmm)-max(xnn,xmn) Representing the diagonal values of the character frame; min (y)nm,ymm) Represents the minimum value of the coordinate of the upper left corner of the frame of the physical structure of the selected character, max (y)nn,ymn) The maximum value of the coordinate of the lower right corner of the frame representing the physical structure of the selected character, min (y)nm,ymm)-max(ynn,ymn) Representing the diagonal values of the character frame;
according to XoverlapAnd YoverlapThereby obtaining the maximum character structured area SoverlapThe expression is as follows:
Soverlap=Xoverlap×Yoverlap
due to the character structured area SoverlapMuch smaller than the physical structure area, and comparing the Ratio of the overlapped surfacesoverlapClustering characters, clustering words into sentences, and forming sentences into paragraphs, thereby reading text contents, wherein the Ratio of overlapped surfaces isoverlapThe expression is as follows:
Figure BDA0003171124530000041
in the formula, xnnX-axis coordinate, x, representing the top left corner of a character in a documentnmX-axis coordinate, y, representing the lower right corner of a character in a documentnnY-axis coordinate, y, representing the top left corner of a character in a documentnmRepresenting the y-axis coordinates of the lower right hand corner of the characters in the document.
Further, context language segments of text contents are analyzed, dictionary lists of user-defined multi-class entity keywords in text sentences are obtained and integrated into a dictionary list data set ER ═ ER1,er2,...,erNTherein, er1A dictionary list of entity keywords, er, representing a first category2A dictionary list of entity keywords, er, representing a second categoryNEntities representing the Nth categoryA body keyword dictionary list;
the category record of the entity keyword dictionary list is a multi-attribute tuple, and the tuple of the entity keyword dictionary list of the nth category in the ER is represented as ERn=(timn,geon,natn,orgn,pern) Wherein N is more than or equal to 1and less than or equal to N, wherein timnRepresenting entity Key Presence time, geonRepresenting geographic locations of discovered entity keywords, natnRepresenting devicenOf the corresponding entity keyword name, orgnRepresenting an organization that discovers entity keywords, pernRepresenting the discoverer of the entity's keyword.
Further, calculating the relation probability between text content and terms by using a large text corpus, and setting the words with the same meaning as the same part of speech;
using cosine similarity measurement to perform text mining on a given entity keyword dictionary list in a large text corpus, and determining semantic proximity and word vectors of entity keywords; er (a)nDenoted vc in the word vectorn=(B-vcn,I-vcn) Wherein B-vcnRepresenting the beginning position of a multi-attribute tuple in the word vector space, I-vcnRepresenting the middle position of the multi-attribute tuple in the word vector space and utilizing a regular expression to expand the usernAttribute tuple and English character [ A-Z ]]And numeric characters 0-9]The expression of (1);
the cosine similarity cos (θ) value calculation expression is as follows:
Figure BDA0003171124530000042
in the formula, vciDenotes the ith word vector variable, wc, in a total of m word vectorsiAn ith text sentence phrase word vector variable represented in a total of m text sentence phrase word vectors; when the cosine similarity cos (theta) value is 1, the expression can be in vciFind wc in the corresponding entity key word dictionary listiThe corresponding words needed by the text corpus so as to realize entity relationMining a key word dictionary list;
after the entity keyword dictionary list is mined by the text corpus, relevant contents are extracted from the text contents, so that the unstructured multi-classification text is generated.
Further, the text sentence of the unstructured multi-classification text is related to the ernPerforming classification matching, if the text sentence can not be matched with the usernThen, it is represented as O;
will ernAll the multiple classification structure subsets in the text sentence entity set matched with the corresponding classification structure are expressed as ERn={B-ern,O,I-ernAnd thus generating a training data set with labeled BIO.
Further, let crf be the output layer of BilSTM, ER for each input tagnThe output label PL corresponding to it is obtainednIs predicted as input ERnHas a probability of continuous correctness of Score (ER)n,PLn),Score(ERn,PLn) The expression of (a) is as follows:
Figure BDA0003171124530000051
wherein R represents a total of R labels in the training dataset,
Figure BDA0003171124530000056
for the ith input label ERiThe output is PLiProbability of (A)(PLi,PLi+1)To be from PLiTo PLi+1The transition probability of (2);
determining all input labels ERnContinuous correct probability of (ER)n,PLn) Using Viterbi algorithm to input label ERnAnd output label PLnPerforming probability normalization process P(PLn|ERn)To complete and train and mine text data, wherein the probability normalization process P(PLn|ERn)The expression is as follows:
Figure BDA0003171124530000052
in the formula,
Figure BDA0003171124530000053
indicating the prediction of the ith input label ERiThe index value of the continuous probability of correctness,
Figure BDA0003171124530000054
denoted as ER for the ith input tagiAnd the output label rate of the error is obtained,
Figure BDA0003171124530000055
indicating a mispredicted input label ERiIndex value of the continuous probability of (a).
The invention has the beneficial effects that:
1: in the process of cross configuration of multiple IP addresses in the server, the text data is downloaded, and the method is more suitable for the practical minimum flow limit and the expected downloading time.
2: in the text content identification process, a heuristic method is adopted to select the target text, so that the character identification accuracy is improved, and the identification target can be found more quickly and conveniently.
3: in the keyword part-of-speech analysis, part-of-speech rules are preferentially combined with a keyword dictionary list, so that the overall mining efficiency of the sedimentary keywords is improved, and the time cost caused by manual labeling is reduced.
4: in the process of data mining for a sedimentology data set, a bidirectional long-short-term neural network is adopted and a conditional random field is combined, so that the accuracy of the sedimentology literature mining strategy design is improved, and the identification noise caused by the wrong label of the data set is reduced.
Drawings
FIG. 1 is a flow chart of the overall process steps of the present invention.
FIG. 2 is the accuracy of the training set test in the BilSTM model in conjunction with conditional random field CRF according to the present invention.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings.
The invention discloses a sedimentology literature mining method based on NLP (Natural language processing, NLP) information extraction and part-of-speech rules, which comprises the following steps: step 1: according to the lowest flow limit and the expected downloading time value, a Robot Process Automation (RPA) is used for reading research and conference files containing relevant chemistry contents in a distributed reading website; step 2: identifying the file in the step 1 according to machine vision so as to obtain the geometric attribute and the text attribute of the content object, and judging the type of the content object by a heuristic algorithm to obtain the physical structure and the logical structure of the document; and step 3: analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; and 4, step 4: identifying the keywords with the same part of speech in the text by using the document file obtained in the step 2 and the keyword dictionary obtained in the step 3 and adopting a cosine similarity measurement analysis technology according to the types of the keywords to generate an unstructured multi-classification text; and 5: respectively performing multi-path matching on classified entities by using the unstructured multi-classification texts output in the step 4, recording entity label attributes, and generating a large sample training data set; step 6: aiming at the generation of the large sample data set in the step 5and the identification of the document text in the step 3, a two-way Long and Short Term Memory neural network model (BilSTM) is utilized to combine with a Conditional Random Field (CRF) to carry out Named object identification (NER), so that the identification of the required Entity is realized, and the Entity in the text is screened out and stored.
The sedimentology literature mining method based on NLP information extraction and part-of-speech rules provided by the invention comprises the following steps, and the flow is shown in a figure 1-2:
step 1: and reading research and conference files containing relevant chemistry contents in the website by utilizing RPA distribution from the lowest flow limit MF and the expected value of the expected download time ET.
When a task requests a network service system to download files, a network platform generates an access request record, the record comprises a local network address and expected download time, on the basis of existing network service system availability codes and analysis of distributed storage and expected download time, the communication system can be accessed in a non-intrusive mode by using RPA, cross access is carried out on related depositional research content hotspots, multiple IP addresses are copied into a server in a cross mode, the multiple IP set is named as LR (LR) 1, LR2, … and lrN), wherein N represents the number of cross IPs in LR.
When the server receives a download request at any time interval, the minimum flow limit is calculated by combining the size of the request resource and the current bandwidth congestion degree, as shown in formula (1), where T is the response time of a single download task, and T isdepartureRepresenting request arrival time, tarrivalRepresenting the request completion time. The lowest flow limit influences the arrival time and the completion time of the request, the response time of a single downloading task is longest under the lowest flow limit, the lowest flow limit is defaulted to be broadband flow meeting the lowest downloading requirement of the website file, and the lowest flow limit can contain a downloading task request at any time point;
T=tdeparture-tarrival (1)。
and when the request application is successful, performing cross access on the download hotspot. T is assumed to be independent, response rate is assumed to be mu between different requests and servers, exponential random variable is e.r.v, namely T ═ r/v, wherein e represents download time expected value, r represents download server download flow rate to total bandwidth ratio, v represents download speed of hot spot file, as shown in formula (2), for download request x ≦ TimeN ≦ y in any period of time, download time in response is βnWherein x is the lowest download time, y is the highest download time, the lowest download time x is the minimum time for the required download request during the highest broadband service, and it meets the server safety and will not stop downloading service because of too short request time, the highest download time y is the longest response time of the server to the download request, and the server safety will not pause downloading service because of too long download time and service overtime, TimON is any download time during this time period, therefore, the minimum response time beta in the request(x,y)
Figure BDA0003171124530000071
Wherein
Figure BDA0003171124530000072
The lowest download time x and the highest download time y are integrated with respect to the download speed in the range of 0 to 1, wherein the minimum response time is found assuming that the fastest speed is 1, i.e., 100%, and the slowest speed is 0.
The download expected value is in linear inverse proportion to the download rate, when a high-frequency flow signal is deployed, the download rate shows that the service is fully loaded, the download expected value shows that a curve is decreased progressively, and beta is(x,y)And deducing an availability cross IP address for a low-flow state, wherein S represents the downloading time of an arbitrary random variable e.r.v, so that S (r, v) -Exp (mu) are uniformly distributed in the system, solving a distribution value P as represented by formula (3), wherein T is larger than the downloading time S of the arbitrary random variable, and formula (4) represents an expression formula of an expected value E of the expected downloading time at the moment T, and selecting a service IP address meeting the expected value for multi-path downloading, so that the time required by downloading is reduced, and more texts are downloaded in a unit time to the maximum extent. Wherein,
P{T(r,t)>S}=exp(-μs)(1-(1-exp(-μs))r)t (3);
Figure BDA0003171124530000073
in the formula, P { T(r,t)S represents that the response time T of a single downloading task related to the proportion r of the downloading flow of the downloading server in the total bandwidth and the downloading time T in the distribution value P is larger than the downloading time S of any random variable, and exp (-us) represents an inverse function of an exponential calculated by the product of the downloading time S of the random variable and the assumed rate mu; beta (T +1,1/r) represents the minimum response time from T +1 to 1/r, mu represents the response rate between different download request instructions and the server, E [ T [ [ T ](r,t)]And e are both represented inThe expected value of the desired download time at the response time T of a single download task.
Wherein, the meaning of the expected downloading time is as follows: for example, if there is a bus with 20 minutes of departure interval, the waiting time is 10 minutes because the departure time of the bus satisfies the uniform distribution of [0,20 ]. Similarly, the expected time is reduced along with the increase of the IP addresses, the expected value is to estimate the distributed IP value, the expected time can be less than the expected value and can also be slightly greater than the expected value, but is not infinite or infinitesimal, the shortest download time and the longest delay stop time of the opposite website server are both satisfied, otherwise, the download cannot be performed.
Step 2: and identifying and downloading the standard file according to machine vision so as to obtain the geometric attributes and text attributes of the table picture in the text, judging the type of the content object by using a heuristic algorithm to obtain the physical structure and logical structure of the document, and identifying the text of the standard file.
First, machine vision is utilized (machine vision does not require parameter setting of the distance between physical structures), as shown in equation (5), where (x) is in the documentnn,ynn) Representing the coordinates of the upper left corner of the character, (x)nm,ynm) Lower right corner coordinate, as shown in equation (6), where (x) is in the physical structuremn,ymn)、(xmm,ymm) Representing the coordinates of the upper left corner and the lower right corner, respectively. Calculating the space area X between charactersoverlapAnd a physically set threshold area Yoverlap
Xoverlap=max(0,min(xnm,xmm)-max(xnn,xmn)) (5);
Yoverlap=max(0,min(ynm,ymm)-max(ynn,ymn)) (6);
Consider a character as a rectangular box, where min (x)nm,xmm) Represents the minimum value of the coordinate of the upper left corner of the frame of the selected character, max (x)nn,xmn) Represents the maximum value of the coordinate of the lower right corner of the frame of the selected character, min (x)nm,xmm)-max(xnn,xmn) Representing the diagonal values of the character frame, and finally calculating the longest diagonal to calculate the space area of the character, and similarly, wherein min (y)nm,ymm) Represents the minimum value of the coordinate of the upper left corner of the frame of the physical structure of the selected character, max (y)nn,ymn) The maximum value of the coordinate of the lower right corner of the frame representing the physical structure of the selected character, min (y)nm,ymm)-max(ynn,ymn) Representing the diagonal value of the character frame, and finally solving the longest diagonal so as to solve the space area of the character physical structure frame.
Then, according to XoverlapAnd YoverlapCalculating the maximum character structured area as SoverlapAs shown in equation (7):
Soverlap=Xoverlap×Yoverlap (7);
finally, area S is structured due to the characteroverlapMuch smaller than the physical structure area, and comparing the Ratio of the overlapped surfacesoverlapClustering characters to read text content, as shown in formula (8):
Figure BDA0003171124530000081
and step 3: and acquiring a self-defined multi-class entity keyword dictionary list in the sentence text according to the document context language segment.
Let the multiclass entity keyword dictionary list dataset be denoted as ER, which is a set of records for entity classes, denoted ER ═ ER1,er2,...,erNN represents the number of entity keyword dictionary lists in the ER;
the record of the entity class is a multi-attribute tuple, and the nth (1 ≦ N ≦ N) tuple in the ER is represented as ERn=(timn,geon,natn,orgn,pern) Wherein timnRepresenting the time of existence of an entity, geonRepresenting the geographical location, nat, of the discovered entitynRepresenting devicenCorresponding entity name inScale, orgnRepresenting an organization of discovered entities, pernRepresenting the entity finder.
And 4, step 4: and identifying the keywords with the same part of speech in the text according to the types of the keywords by adopting a cosine similarity measurement analysis technology to generate the unstructured multi-classification text.
Using a large corpus of text to compute the probability of relationships between documents and terms, words of the same meaning will produce similar text, i.e., parts of the same kind. Subsequently, using cosine similarity measurements, text mining is performed on the given dictionary document and the database to determine semantic proximity and word vectors, ernIn the expression of the word vector as vcn=(B-vcn,I-vcn) Wherein B-vcnRepresenting the beginning of a multi-attribute tuple in the word vector space, I-vcnRepresenting the middle position of the multi-attribute tuple in the word vector space and utilizing a regular expression to expand the usernAttribute tuple and English character [ A-Z ]]And numeric characters 0-9]Is the manifestation of (1)? Represents the expression before matching once, represents the expression before matching any times, represents matching from the current position, represents the end of the expression before matching, for example, the time part of speech VB has three expressions, as shown in the formula (9-11):
VB1=r'^~?[0-9]'+r'^[A-Z].*$' (9);
VB2=r'^±'+r'^~?[0-9]+(.[0-9]+)?$'+r'.*' (10);
VB3=r'^~?[0-9]+(.[0-9]+)?$'+r'and$'+r'.*' (11);
equation 9 the process is as follows:
the first step is as follows: do? And matching sentences with symbols or sentences without symbols in the sentences.
The second step is that: in the case where the first step is satisfied, [0-9] indicates that any one of numbers between 0 and 9 is matched.
Third, in case the second step is satisfied, a-Z represents any letter between the matching alphabetical letters a to Z.
A fourth step, where the third step is satisfied, $' indicates that the third step letter is matched multiple times and may be located at the end of the sentence, e.g., -9 Ma or 9 Ma.
Equation 10 processes as follows:
the first step is as follows: ^ + -represents that there are + -sentences in the matching sentence.
The second step is that: in the case where the first step is satisfied? And matching sentences with symbols or sentences without symbols in the sentences.
The third step: in the case where the second step is satisfied, [0-9] indicates that any one of numbers between 0 and 9 is matched.
Step four, ([ 0-9] +) indicates that the decimal point is matched first and any one number between the numbers 0 and 9 is matched second, in the case where step three is satisfied.
And a fifth step of, in case that the fourth step is satisfied,
Figure BDA0003171124530000091
meaning matching the previous step any number of times, e.g. + -. 9.38 or. + -. 9.1.
Equation 11 processes as follows:
the first step is as follows: do? And matching sentences with symbols or sentences without symbols in the sentences.
The second step is that: in the case where the first step is satisfied, [0-9] indicates that any one of numbers between 0 and 9 is matched.
Third, in the case where the second step is satisfied, ([ 0-9] +) indicates that the decimal point is matched first, and any one number between the numbers 0 and 9 is matched second.
The fourth step: in the case where the third step is satisfied, and $ represents a sentence that matches a sentence with an and.
And a fifth step of, in case that the fourth step is satisfied,
Figure BDA0003171124530000102
indicating that the previous step was matched any number of times, e.g., -1 and2 or 1.5and 1.68.
Cosine similarity highest value cos (theta) and vcnConstituent part vc ofi(i is more than or equal to 1and less than or equal to m) and phrase and word vector attribute wc of each sentence of the textnComponent wc ofiCorrelation (1. ltoreq. i. ltoreq. m), vciDenotes the i-th word vector variable, wc, among the m word vectorsiRepresenting the ith text sentence phrase word vector variable in the m text sentence phrase word vectors, as shown in formula (12), when the cos (theta) value is 1, representing that the variables point to the same word vector space, and pointing to the same space, successfully matching the given dictionary document with the database, and finding the required words in the database in the document;
Figure BDA0003171124530000101
and 5: and outputting and generating an unstructured multi-classification text through cos (theta), traversing the multi-classification text to perform multi-path matching on classification entities respectively, recording entity label attributes, and generating a large sample training data set.
Go the text sentence on ernClassification matching, if the text sentence can not be matched, ernThen, it is represented as O; er (a)nThe sentence entity sets corresponding to all the multi-classification structure subsets (here, the multi-classification structure subset and the former er)nSet of entity keyword dictionary lists representing multiple categories have the same meaning), denoted as ERn={B-ern,O,I-ern} to generate training set data with labeled BIO; wherein B-ernExpressed as the first letter of the beginning of the sentence entity, I-ernThe expression is the expression mode of each residual letter except the first letter, and the generated data set is expressed as B-tim, B-geo, B-nat, O, I-tim, I-geo, I-nat and B-ernComprises B-tim, B-geo, B-nat, I-ernIncluding I-tim, I-geo, I-nat.
Step 6: and carrying out NER pre-training by using BilSTM in combination with CRF to identify the sedimentology entity, thereby screening out valuable hot spots according to context information.
First, using the transfer matrix in crf to avoid multiple consecutive B-ersnUsing crf as the output layer of BilSTM, as shown in equation (13), where ER is input for each inputnEventually obtain the corresponding prediction label PLnThen predict input ERnHas a continuous probability of Score (ER)n,PLn) (e.g., the input label is O, the corresponding output label is obtained, and the probability that consecutive inputs are O is predicted),
Figure BDA0003171124530000117
output as PL for the ith positioni(0≤PLiProbability of < 1), A(PLi,PLi+1)To be from PLiTo PLi+1The transition probability of (2);
Figure BDA0003171124530000111
wherein successive inputs of O, or I-ern, are correct, and B-ern is incorrect if three successive occurrences occur. For transition probability, which is the probability of directly jumping from O to I-ern, in the above BIO tag, an error may occur, and I-ern is normalized to O, which is a similar error.
Then, for each ERnTo find all PLnScore (ER) of (C)n,PLn) Performing probability normalization processing P on input label, namely output label by utilizing Viterbi algorithm(PLn|ERn)And thus, text data is mined as shown in equation (14):
Figure BDA0003171124530000112
in the formula,
Figure BDA0003171124530000113
indicating the prediction of the ith input label ERiThe continuous correct probability of;
Figure BDA0003171124530000114
indicating the prediction of the ith ERiThe error probability of (2);
Figure BDA0003171124530000115
denoted as ER for the ith input tagiTo obtain the wrong output label rate (e.g. original ER)nThe corresponding output label PLi is obtained, the probability of being correct is 0.7,
Figure BDA0003171124530000116
it indicates that he is incorrect with a probability of 0.3).
The idea of the invention is as follows: firstly, according to the minimum flow limit and the expected downloading time expected value, research and conference files containing relevant depositional contents in a website are read in a distributed manner by utilizing the robot process automation; then according to machine vision, acquiring the geometric attributes and text attributes of the researched content object, and performing type judgment on the content object by using a heuristic algorithm to obtain the physical structure and the logical structure of the document; further, analyzing the context language segment of the document, and acquiring a user-defined multi-class entity keyword dictionary list in the sentence text; on the basis, a cosine similarity measurement analysis technology is adopted, keywords with the same part of speech in the text are identified according to the types of the keywords, and an unstructured multi-classification text is generated, so that the unstructured multi-classification text is output; then, aiming at the unstructured text, respectively carrying out multi-path matching on classified entities, recording entity label attributes, and generating a large sample training data set; and finally, in a large sample data set, recognizing a name body by using a bidirectional long-short term memory neural network model in combination with a conditional random field to realize recognition of a required entity, and screening out entities in the text for storage.
Example, this example selects depositional text matching data as the input dataset to experiment and selects tensiorflow as the simulation platform.
The parameters involved in the experimental environment are shown in table 1.
Table 1 parameter settings involved in the execution of the method
Experimental parameters Value taking
Substance switch B-nat
Intermediate of matter I-nat
Time switch B-tim
Middle of time I-tim
Beginning of a venue B-geo
In the middle of a site I-geo
Others O
Number of data set records 274292
FIG. 2 shows the accuracy of the training set test in the BilSTM model in conjunction with conditional random field CRF according to the present invention.
It should be noted that the terms "upper", "lower", "left", "right", "front", "back", etc. used in the present invention are for clarity of description only, and are not intended to limit the scope of the present invention, and the relative relationship between the terms and the terms is not limited by the technical contents of the essential changes.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims (7)

1. The sedimentology literature mining method based on NLP information extraction and part of speech rules is characterized by comprising the following steps:
step 1: downloading files containing relevant chemistry contents in an RPA distributed reading website according to the lowest flow limit and an expected value of expected downloading time;
step 2: identifying the file downloaded in the step 1 according to machine vision so as to obtain the geometric attribute and the text attribute of the content object, judging the type of the content object by a heuristic algorithm to obtain the physical structure and the logical structure of the document, and identifying the text content;
and step 3: analyzing context language segments of the text content to obtain a dictionary list of user-defined multi-class entity keywords in the text content;
and 4, step 4: recognizing entity keywords with the same part of speech in the text content by using the text content obtained in the step 2 and the dictionary list obtained in the step 3 and adopting a cosine similarity measurement analysis technology according to the entity keyword types to generate an unstructured multi-classification text;
and 5: respectively performing multi-path matching on classified entities by using the unstructured multi-classification texts output in the step 4, recording entity label attributes, and generating a large sample training data set;
step 6: and (3) aiming at the large sample data set generated in the step (5), performing NER pre-training by using BilSTM in combination with CRF to realize entity identification on the readable text content in the document in the step (2), and screening entity keywords according to the context.
2. The sedimentology document mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein:
when a task requests a network service system to download files, a network platform generates an access request record, the record comprises a local network IP address and expected downloading time, the communication system is accessed by utilizing the non-invasive characteristic of RPA to provide cross access to related sedimentology research content hotspots, and multiple paths of IP addresses are cross copied to a server;
when the server receives a download request at any time interval, calculating the response time T of a single download task as follows:
T=tdeparture-tarrival
in the formula, tdepartureRepresenting request arrival time, tarrivalRepresenting the request completion time, wherein an exponential random variable of a single download task response time T is e.r.v, e represents an expected value of expected download time under the single download task response time T, r represents the proportion of download flow of a server to the total bandwidth, and v represents the speed of a download hotspot;
when the download request application is successful and the download hotspot is cross-accessed, the download request is carried out within any period of time x being more than or equal to TimeN being less than or equal to y, and the response time of the download request is betanTherefore, the minimum response time β of the download request(x,y)The expression of (a) is:
Figure FDA0003171124520000011
in the formula, x is the lowest download time, y is the highest download time, and TimeN is any download time in the time period;
expected value of expected download time E [ T ] under single download task response time T(r,t)]The expression of (a) is:
Figure FDA0003171124520000021
where β (t +1,1/r) represents the minimum response time from time t +1 to time 1/r, μ tableShowing the response rate between different download request instructions and the server, ET(r,t)]And e both represent expected values of expected download times at a single download task response time T;
selecting a value satisfying the expected value E [ T ](r,t)]The service IP address of (a) is downloaded in multiple ways.
3. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein the specific content for recognizing text content is:
identifying documents using machine vision, in which (x)nn,ynn) Representing the coordinates of the upper left corner of the character, (x)nm,ynm) Representing the coordinates of the lower right corner of the character; in the physical structure (x)mn,ymn) Representing the coordinates of the upper left corner of the character, (x)mm,ymm) Representing the coordinates of the lower right corner of the character; calculating the space area X between charactersoverlapAnd a physically set threshold area YoverlapThe expression of (a) is as follows:
Xoverlap=max(0,min(xnm,xmm)-max(xnn,xmn));
Yoverlap=max(0,min(ynm,ymm)-max(ynn,ymn));
in the formula, min (x)nm,xmm) Represents the minimum value of the coordinate of the upper left corner of the frame of the selected character, max (x)nn,xmn) Represents the maximum value of the coordinate of the lower right corner of the frame of the selected character, min (x)nm,xmm)-max(xnn,xmn) Representing the diagonal values of the character frame; min (y)nm,ymm) Represents the minimum value of the coordinate of the upper left corner of the frame of the physical structure of the selected character, max (y)nn,ymn) The maximum value of the coordinate of the lower right corner of the frame representing the physical structure of the selected character, min (y)nm,ymm)-max(ynn,ymn) Representing the diagonal values of the character frame;
according to XoverlapAnd YoverlapThereby obtaining the maximum character structured area SoverlapThe expression is as follows:
Soverlap=Xoverlap×Yoverlap
due to the character structured area SoverlapMuch smaller than the physical structure area, and comparing the Ratio of the overlapped surfacesoverlapClustering characters, clustering words into sentences, and forming sentences into paragraphs, thereby reading text contents, wherein the Ratio of overlapped surfaces isoverlapThe expression is as follows:
Figure FDA0003171124520000022
in the formula, xnnX-axis coordinate, x, representing the top left corner of a character in a documentnmX-axis coordinate, y, representing the lower right corner of a character in a documentnnY-axis coordinate, y, representing the top left corner of a character in a documentnmRepresenting the y-axis coordinates of the lower right hand corner of the characters in the document.
4. The sedimentology document mining method based on NLP information extraction and part-of-speech rules according to claim 1, wherein:
analyzing the context language segment of the text content, obtaining a dictionary list of user-defined multi-class entity keywords in the text sentence, and integrating into a dictionary list data set ER ═ ER1,er2,...,erNTherein, er1A dictionary list of entity keywords, er, representing a first category2A dictionary list of entity keywords, er, representing a second categoryNA list of entity keyword dictionaries representing an nth category;
the category record of the entity keyword dictionary list is a multi-attribute tuple, and the tuple of the entity keyword dictionary list of the nth category in the ER is represented as ERn=(timn,geon,natn,orgn,pern) Wherein N is more than or equal to 1and less than or equal to N, wherein timnRepresenting entity Key Presence time, geonRepresenting geographic locations of discovered entity keywords, natnRepresenting devicenCorresponding entity keyword name in,orgnRepresenting an organization that discovers entity keywords, pernRepresenting the discoverer of the entity's keyword.
5. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 4, wherein:
calculating the relation probability between text content and terms by using a large text corpus, and setting words with the same meaning as the same part of speech;
using cosine similarity measurement to perform text mining on a given entity keyword dictionary list in a large text corpus, and determining semantic proximity and word vectors of entity keywords; er (a)nDenoted vc in the word vectorn=(B-vcn,I-vcn) Wherein B-vcnRepresenting the beginning position of a multi-attribute tuple in the word vector space, I-vcnRepresenting the middle position of the multi-attribute tuple in the word vector space and utilizing a regular expression to expand the usernAttribute tuple and English character [ A-Z ]]And numeric characters 0-9]The expression of (1);
the cosine similarity cos (θ) value calculation expression is as follows:
Figure FDA0003171124520000031
in the formula, vciDenotes the ith word vector variable, wc, in a total of m word vectorsiAn ith text sentence phrase word vector variable represented in a total of m text sentence phrase word vectors; when the cosine similarity cos (theta) value is 1, the expression can be in vciFind wc in the corresponding entity key word dictionary listiThe corresponding words needed by the text corpus are used for realizing the mining of the entity keyword dictionary list;
after the entity keyword dictionary list is mined by the text corpus, relevant contents are extracted from the text contents, so that the unstructured multi-classification text is generated.
6. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 5,
text sentence and er of unstructured multi-classification textnPerforming classification matching, if the text sentence can not be matched with the usernThen, it is represented as O;
will ernAll the multiple classification structure subsets in the text sentence entity set matched with the corresponding classification structure are expressed as ERn={B-ern,O,I-ernAnd thus generating a training data set with labeled BIO.
7. The sedimentary literature mining method based on NLP information extraction and part-of-speech rules according to claim 6,
let crf be the output layer of BilSTM, ER for each input tagnThe output label PL corresponding to it is obtainednIs predicted as input ERnHas a probability of continuous correctness of Score (ER)n,PLn),Score(ERn,PLn) The expression of (a) is as follows:
Figure FDA0003171124520000041
wherein R represents a total of R labels in the training dataset,
Figure FDA0003171124520000046
for the ith input label ERiThe output is PLiProbability of (A)(PLi,PLi+1)To be from PLiTo PLi+1The transition probability of (2);
determining all input labels ERnContinuous correct probability of (ER)n,PLn) Using Viterbi algorithm to input label ERnAnd output label PLnPerforming probability normalization process P(PLn|ERn)To complete and train and mine text data, wherein the probability normalization process P(PLn|ERn)The expression is as follows:
Figure FDA0003171124520000042
in the formula,
Figure FDA0003171124520000043
indicating the prediction of the ith input label ERiThe index value of the continuous probability of correctness,
Figure FDA0003171124520000044
denoted as ER for the ith input tagiAnd obtaining the probability of the wrong output label,
Figure FDA0003171124520000045
indicating a mispredicted input label ERiIndex value of the continuous probability of (a).
CN202110818775.XA 2021-07-20 2021-07-20 Sedimentology literature mining method based on NLP information extraction and part-of-speech rules Active CN113468890B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110818775.XA CN113468890B (en) 2021-07-20 2021-07-20 Sedimentology literature mining method based on NLP information extraction and part-of-speech rules

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110818775.XA CN113468890B (en) 2021-07-20 2021-07-20 Sedimentology literature mining method based on NLP information extraction and part-of-speech rules

Publications (2)

Publication Number Publication Date
CN113468890A true CN113468890A (en) 2021-10-01
CN113468890B CN113468890B (en) 2023-05-26

Family

ID=77881608

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110818775.XA Active CN113468890B (en) 2021-07-20 2021-07-20 Sedimentology literature mining method based on NLP information extraction and part-of-speech rules

Country Status (1)

Country Link
CN (1) CN113468890B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117061A (en) * 2021-10-27 2022-03-01 南京信息工程大学 River facies knowledge graph reverse-deducing method based on data mining and tree structure
CN114625885A (en) * 2022-03-07 2022-06-14 南京信息工程大学 Entity dependency extraction and identification method, system and device based on NLP and trigger and storage medium
CN117076703A (en) * 2023-10-11 2023-11-17 中邮消费金融有限公司 Automatic card structured information extraction technical method and system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109558569A (en) * 2018-12-14 2019-04-02 昆明理工大学 A kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model
CN109672613A (en) * 2018-12-12 2019-04-23 北京数码视讯软件技术发展有限公司 Adaptive access method, apparatus and electronic equipment
KR20200044176A (en) * 2018-10-05 2020-04-29 동아대학교 산학협력단 System and Method for Korean POS Taging Using the Concatenation of Jamo and Sylable Embeding
CN111950287A (en) * 2020-08-20 2020-11-17 广东工业大学 Text-based entity identification method and related device
CN112417880A (en) * 2020-11-30 2021-02-26 太极计算机股份有限公司 Court electronic file oriented case information automatic extraction method
CN112632228A (en) * 2020-12-30 2021-04-09 深圳供电局有限公司 Text mining-based auxiliary bid evaluation method and system
CN112801010A (en) * 2021-02-07 2021-05-14 华南理工大学 Visual rich document information extraction method for actual OCR scene
CN112817561A (en) * 2021-02-02 2021-05-18 山东省计算中心(国家超级计算济南中心) Structured extraction method and system for transaction function points of software requirement document

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200044176A (en) * 2018-10-05 2020-04-29 동아대학교 산학협력단 System and Method for Korean POS Taging Using the Concatenation of Jamo and Sylable Embeding
CN109672613A (en) * 2018-12-12 2019-04-23 北京数码视讯软件技术发展有限公司 Adaptive access method, apparatus and electronic equipment
CN109558569A (en) * 2018-12-14 2019-04-02 昆明理工大学 A kind of Laotian part-of-speech tagging method based on BiLSTM+CRF model
CN111950287A (en) * 2020-08-20 2020-11-17 广东工业大学 Text-based entity identification method and related device
CN112417880A (en) * 2020-11-30 2021-02-26 太极计算机股份有限公司 Court electronic file oriented case information automatic extraction method
CN112632228A (en) * 2020-12-30 2021-04-09 深圳供电局有限公司 Text mining-based auxiliary bid evaluation method and system
CN112817561A (en) * 2021-02-02 2021-05-18 山东省计算中心(国家超级计算济南中心) Structured extraction method and system for transaction function points of software requirement document
CN112801010A (en) * 2021-02-07 2021-05-14 华南理工大学 Visual rich document information extraction method for actual OCR scene

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
K.E.RAVIKUMAR: "BELMiner:adapting a rule-based relation extraction system to extract biological expression language statements from bio-medical literature evidence sentences", 《DATABASE》 *
刘炜 等: "一种面向突发事件的文本语料自动标注方法", 《中文信息学报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114117061A (en) * 2021-10-27 2022-03-01 南京信息工程大学 River facies knowledge graph reverse-deducing method based on data mining and tree structure
CN114625885A (en) * 2022-03-07 2022-06-14 南京信息工程大学 Entity dependency extraction and identification method, system and device based on NLP and trigger and storage medium
CN114625885B (en) * 2022-03-07 2024-10-18 南京信息工程大学 Entity affiliation extraction and identification method, system, device and storage medium based on NLP and trigger
CN117076703A (en) * 2023-10-11 2023-11-17 中邮消费金融有限公司 Automatic card structured information extraction technical method and system
CN117076703B (en) * 2023-10-11 2024-02-06 中邮消费金融有限公司 Automatic card structured information extraction technical method

Also Published As

Publication number Publication date
CN113468890B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
CN110298033B (en) Keyword corpus labeling training extraction system
CN109858010B (en) Method and device for recognizing new words in field, computer equipment and storage medium
CN107992585B (en) Universal label mining method, device, server and medium
US8156053B2 (en) Automated tagging of documents
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN112100356A (en) Knowledge base question-answer entity linking method and system based on similarity
CN113468890B (en) Sedimentology literature mining method based on NLP information extraction and part-of-speech rules
CN110162771B (en) Event trigger word recognition method and device and electronic equipment
CN112101040A (en) Ancient poetry semantic retrieval method based on knowledge graph
CN114254653A (en) Scientific and technological project text semantic extraction and representation analysis method
CN110728151B (en) Information depth processing method and system based on visual characteristics
CN112818093A (en) Evidence document retrieval method, system and storage medium based on semantic matching
KR20220134695A (en) System for author identification using artificial intelligence learning model and a method thereof
CN111274822A (en) Semantic matching method, device, equipment and storage medium
CN107357765A (en) Word document flaking method and device
CN113961666A (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN113947086A (en) Sample data generation method, training method, corpus generation method and apparatus
CN114840685A (en) Emergency plan knowledge graph construction method
CN118132719A (en) Intelligent dialogue method and system based on natural language processing
CN116644148A (en) Keyword recognition method and device, electronic equipment and storage medium
CN113946668A (en) Semantic processing method, system and device based on edge node and storage medium
CN113656429A (en) Keyword extraction method and device, computer equipment and storage medium
CN116955534A (en) Intelligent complaint work order processing method, intelligent complaint work order processing device, intelligent complaint work order processing equipment and storage medium
WO2023083176A1 (en) Sample processing method and device and computer readable storage medium
Eswaraiah et al. A Hybrid Deep Learning GRU based Approach for Text Classification using Word Embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant