CN112052305A - Information extraction method and device, computer equipment and readable storage medium - Google Patents

Information extraction method and device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN112052305A
CN112052305A CN202010909654.1A CN202010909654A CN112052305A CN 112052305 A CN112052305 A CN 112052305A CN 202010909654 A CN202010909654 A CN 202010909654A CN 112052305 A CN112052305 A CN 112052305A
Authority
CN
China
Prior art keywords
information
data
processed
extracted
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010909654.1A
Other languages
Chinese (zh)
Inventor
李贤杰
王昊
罗水权
刘剑
高寒冰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Asset Management Co Ltd
Original Assignee
Ping An Asset Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Asset Management Co Ltd filed Critical Ping An Asset Management Co Ltd
Priority to CN202010909654.1A priority Critical patent/CN112052305A/en
Publication of CN112052305A publication Critical patent/CN112052305A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an information extraction method, an information extraction device, computer equipment and a readable storage medium, which relate to the field of artificial intelligence and comprise the following steps: acquiring a text to be processed, and analyzing the text to be processed to obtain data to be processed; providing a preset database, wherein the database comprises a plurality of keywords; matching the text to be processed in the database to obtain a keyword corresponding to the text to be processed as an information point to be extracted; the method comprises the steps of extracting key information of data to be processed by adopting a pre-trained information extraction model based on information points to be extracted to obtain target information, extracting the key information in a text to be processed by adopting the information extraction model based on machine learning, and solving the problems that when the key content in the information is extracted, a large number of regular expressions need to be written manually, and later-period maintenance and use are time-consuming and labor-consuming. Meanwhile, the invention also relates to a block chain technology, and the method can be applied to the technical and financial field.

Description

Information extraction method and device, computer equipment and readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an information extraction method, an information extraction device, computer equipment and a readable storage medium.
Background
In the financial field, a large amount of unstructured document data exists, such as a collection specification of bonds, a collection specification of stocks, a collection specification of shares and the like, most of documents are long in length and many in information types, and the extraction process is easy to make mistakes, the key content in the information is usually extracted by adopting regular expression processing in the prior art, different regular expressions are written aiming at information of different normal forms to extract the key content in the information, and the regular expressions are written according to the information of a specific normal form, so that the required information content can be extracted quickly and accurately by using the specially customized regular expressions for a small amount of information of the specific normal form.
However, the regular expressions cannot exhaust various special situations of information, and when a large number of information texts in different paradigms are faced, a large number of regular expressions need to be written manually, so that the regular expressions have a certain professional threshold, and are time-consuming and labor-consuming.
Disclosure of Invention
The invention aims to provide an information extraction method, an information extraction device, computer equipment and a readable storage medium, which are used for solving the problems that a large number of regular expressions need to be written manually when key contents in information are extracted, and time and labor are wasted in later-period maintenance and use.
In order to achieve the above object, the present invention provides an information extraction method, including:
acquiring a text to be processed, and analyzing the text to be processed to obtain data to be processed;
matching the data to be processed with keywords in a preset database, and taking the matched keywords corresponding to the text to be processed as information points to be extracted, wherein the database is pre-stored with a plurality of keywords;
and extracting key information of the data to be processed by adopting a pre-trained information extraction model based on the information points to be extracted to obtain target information.
Further, analyzing the text to be processed to obtain data to be processed, further comprising:
carrying out normalization processing on the text to be processed to obtain a first processed text;
and matching a corresponding analysis rule according to the type of the text to be processed, and processing the first processed text by adopting the analysis rule to obtain at least one piece of first processed data.
Further, performing normalization processing on the text to be processed to obtain a first processed text;
and matching a corresponding analysis rule according to the type of the text to be processed, and processing the first processed text by adopting the analysis rule to obtain data to be processed.
Further, before extracting key information from the data to be processed based on the information points to be extracted by using a pre-trained information extraction model, the method further includes:
acquiring a training sample, wherein the training sample comprises a training data set of information to be extracted, sample information points to be extracted and sample extraction information;
splicing the to-be-extracted sample information with each training data to obtain first sample data corresponding to each training data;
coding each first sample data, classifying the coded first sample data by adopting a first neural network to obtain a classification result of the first sample data, and obtaining second sample data containing sample extraction information based on the classification result;
splicing sample information points to be extracted and sample data, and then coding to obtain a processing vector;
determining target sample data corresponding to the to-be-extracted sample information point by adopting a classifier based on the processing vector;
marking the first sample data according to the sample extraction information to obtain sample comparison data; and after comparing the sample comparison data with the classification result and comparing the target sample data with the sample extraction information, adjusting model parameters until the training is finished to obtain a pre-trained information extraction model.
Further, extracting key information of the data to be processed by adopting a pre-trained information extraction model based on the information points to be extracted to obtain target information, and the method comprises the following steps:
positioning target key information based on the information point to be extracted to obtain second processing data containing the target key information;
extracting information of the second processing data to obtain target key information corresponding to the information point to be extracted;
and splicing the target key information and the second processing data to obtain target information.
Further, locating the target key information based on the information point to be extracted to obtain second processing data including the target key information, including:
splicing the information points to be extracted and the data to be processed and then coding to obtain third processed data;
and processing by adopting a first neural network based on the third processing data to obtain second processing data containing the target key information.
Further, extracting information from the second processed data to obtain target key information corresponding to the information point to be extracted, including:
splicing the information points to be extracted and the second processing data and then coding to obtain fourth processing data;
and predicting a sequence index of the target key information based on the fourth processing data by adopting a classifier, and acquiring the target key information corresponding to the information point to be extracted based on the sequence index.
Further, after extracting key information of the data to be processed by using a pre-trained information extraction model based on the information points to be extracted to obtain target information, the method further includes:
providing a preset synonym library, wherein the synonym library comprises a plurality of synonyms related to each keyword in the database;
when the target key information in the target information is empty;
obtaining synonyms associated with the information points to be extracted from the synonym library as candidate information points;
replacing the information points to be extracted with the candidate information points to obtain new information points to be extracted;
and extracting key information of the data to be processed by adopting an information extraction model based on the new information point to be extracted to obtain target information.
In order to achieve the above object, the present invention also provides an information extracting apparatus, comprising:
the preprocessing module is used for acquiring a text to be processed, analyzing the text to be processed and acquiring data to be processed;
the information point to be extracted determining module is used for matching the text to be processed in the database according to the text to be processed to obtain a keyword corresponding to the text to be processed as an information point to be extracted;
and the processing module is used for extracting key information of the data to be processed by adopting a pre-trained information extraction model based on the information points to be extracted to obtain target information.
To achieve the above object, the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the above information extraction method when executing the computer program.
In order to achieve the above object, the present invention further provides a computer-readable storage medium comprising a plurality of storage media, each storage medium having a computer program stored thereon, the computer programs stored in the storage media collectively implementing the steps of the above information extraction method when executed by a processor.
According to the information extraction method, the information extraction device, the computer equipment and the readable storage medium, the text to be processed is analyzed to obtain the data to be processed which can be used for processing the information extraction model, the information points to be extracted corresponding to the text to be processed are obtained from the database in a matching mode, the key information extraction is carried out on the data to be processed by adopting the information extraction model, the extraction of the key information in the text to be processed is realized based on the machine learning information extraction model, and the problems that when the key content in the information is extracted, a large number of regular expressions need to be written manually, and the later maintenance and the use are time-consuming and labor-consuming are solved.
Drawings
FIG. 1 is a flowchart of a first embodiment of an information extraction method according to the present invention;
fig. 2 is a flowchart of analyzing the text to be processed to obtain data to be processed in the first embodiment of the information extraction method according to the present invention;
fig. 3 is a flowchart of a training process for training the information extraction model before extracting key information according to a first embodiment of the information extraction method of the present invention;
fig. 4 is a flowchart of extracting key information from the data to be processed by using a pre-trained information extraction model based on the information points to be extracted to obtain target information according to the first embodiment of the information extraction method of the present invention;
fig. 5 is a flowchart of processing the data to be processed based on the information points to be extracted to obtain second processed data including target key information corresponding to the information points to be extracted in the first embodiment of the information extraction method according to the present invention;
fig. 6 is a flowchart of information extraction performed on the second processed data to obtain target key information according to the first embodiment of the information extraction method of the present invention;
fig. 7 is a flowchart of a first embodiment of the information extraction method according to the present invention, after extracting key information from the to-be-processed data by using a pre-trained information extraction model based on the to-be-extracted information points to obtain target information;
FIG. 8 is a schematic diagram of program modules of a second embodiment of an information extraction apparatus according to the present invention;
fig. 9 is a schematic diagram of a hardware structure of a computer device according to a third embodiment of the present invention.
Reference numerals:
5. information extraction device 51, preprocessing module 52 and information point to be extracted determination module
53. Execution module 531, first processing unit 532, second processing unit
533. Splicing unit 54, replacement module 6, computer device
61. Memory 62, processor 63, network interface
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides an information extraction method, an information extraction device, computer equipment and a readable storage medium, which are suitable for the field of artificial intelligence machine learning and are based on a preprocessing module, a to-be-extracted information point determining module and an execution module method. The invention analyzes the text to be processed through the preprocessing module to obtain the data to be processed which can be used for processing the information extraction model, then the information point determining module to be extracted obtains the information point to be extracted corresponding to the text to be processed through matching from the database, then the executing module extracts the key information of the data to be processed by adopting the information extraction model, and the extraction of the key information in the text to be processed is realized based on machine learning, thereby solving the problems that a large number of regular expressions need to be written manually when the key content in the information is extracted, the later maintenance is time-consuming and labor-consuming, the regular expressions do not need to be written manually, only the labeled data in the training sample needs to be added in the later maintenance, the writing of rules is not needed, the maintenance difficulty of non-professional technicians is reduced, and simultaneously, the executing unit extracts the key information of the data to be processed, the method comprises the steps of firstly classifying target key information in a text to be processed through a BilSTM network of an information extraction model under a first processing unit to realize positioning of a paragraph where the target key information is located, then extracting information of paragraph data of the target key information through a classifier of the information extraction model by a second processing unit to obtain target data, and improving generalization capability of information positioning and information extraction of the information extraction model through a training process.
Example one
Referring to fig. 1, an information extraction method of this embodiment is applied to a server side and used in an existing unstructured document data information extraction scenario, and the present patent proposes an information positioning and extraction scheme based on machine learning, and referring to fig. 1, includes the following steps:
s100: acquiring a text to be processed, and analyzing the text to be processed to obtain data to be processed;
in the scheme, the text to be processed can be documents such as a bid-shares specification, a bond recruitment specification and the like, and can also be other unstructured document data which are data with irregular or incomplete data structures, have no predefined data model and are not represented by a database two-dimensional logic table conveniently.
Specifically, the method for analyzing the text to be processed to obtain the data to be processed further includes the following steps, referring to fig. 2:
s110: carrying out normalization processing on the text to be processed to obtain a first processed text;
in the above embodiment, the normalization process is used to convert the text to be processed into a preset format, so as to analyze the obtained first processed text according to an analysis rule in the subsequent step S120, and the normalization process includes the following processes: converting data in the text to be processed into a waterfall flow form, wherein the waterfall flow form has no page number; setting the minimum unit in the text to be processed as a paragraph, wherein each table or picture is converted into a paragraph; each paragraph data in the text to be processed includes at least one piece of data, each piece of data corresponds to a unique index number, more specifically, the extraction of the chapter structure is performed according to the identifiers of the titles of the chapters, such as "one", "two", "three", "1", "2", "3", "1.1", "1.2", "1.3", "1.1.1", and the like, and the black font, and each index is mapped with a corresponding primary title, secondary title, tertiary title, and the like.
S120: and matching a corresponding analysis rule according to the type of the text to be processed, and processing the first processed text by adopting the analysis rule to obtain data to be processed.
Specifically, the types of the text to be processed include, but are not limited to, the following parsing rules:
firstly, for a doc document, converting the doc document into a docx document by using a win32com component in a windows environment, and directly extracting information such as texts by directly using an anti word in a linux environment;
for the docx document, analyzing the document by using a python-docx package;
analyzing the pdf non-scanned document by using a fitz package;
fourthly, analyzing the pdf scanned document by using an ocr tool available on the market;
besides the analysis rule in the above 4, the texts in other forms can be analyzed by using a tool or an analysis rule commonly used in the prior art, and a data set is obtained after the analysis.
It should be noted that the obtained to-be-processed data includes each piece of data in the to-be-processed text, as for example, for easy understanding, taking the data with an index of 486 as an example, "486": 4, lead sales organization ×) XX plan transaction structure and related party profile # & #461# & #4, XX sales organization # & #486# & # ". Among the above data: previous data is an index number in a corresponding article, and a chapter in which a current data corresponding to an "XX plan transaction structure and a related party profile # & #461# & #4 and an XX sales organization # & #486# &" is located is provided, wherein titles of the chapters are spliced by "|", so that the sentence has two titles, the primary title XX plan transaction structure and the related party profile # & #461# & # and the secondary title 4 and the lead sales organization # & #486#, respectively, and the "# &" in the title is the splice of the title and the index number where the title is located, and it can be known that the index numbers corresponding to the two titles are 461 and 486 respectively.
Providing a preset database, wherein the database comprises a plurality of keywords;
it should be noted that the keywords are used for matching with the text to be processed and serving as information points to be extracted, and the sources of the keywords include, but are not limited to, titles, special paragraphs, and the like in the text to be processed, or words related to the field of the text to be processed.
S200: matching the text to be processed in the database to obtain a keyword corresponding to the text to be processed as an information point to be extracted;
it should be noted that the number of the keywords corresponding to the text to be processed may be one or multiple, the preset rule may be, but is not limited to, a mapping table, etc., when the number of the keywords is multiple, the number of the information points to be extracted also includes multiple corresponding keywords, and the subsequent processing is performed multiple times when the information points to be extracted are subjected to key information extraction by using the information extraction model.
S300: and extracting key information of the data to be processed by adopting a pre-trained information extraction model based on the information points to be extracted to obtain target information.
It should be emphasized that, in the present embodiment, an information extraction model is adopted to extract key information, which is different from a regular expression mode in the prior art, and meanwhile, the information extraction model performs target key information positioning based on the information point to be extracted to obtain a paragraph containing the target key information, and then performs target key information acquisition on the paragraph.
Before extracting the key information, training the information extraction model, where the training process includes the following steps, with reference to fig. 3, including:
s300-1: acquiring a training sample, wherein the training sample comprises a training data set of information to be extracted, sample information points to be extracted and sample extraction information;
it should be noted that the information point of the sample to be extracted and the sample extraction information are both marked in advance, and can be manually completed in advance or directly obtained from a database.
S300-2: splicing the to-be-extracted sample information with each training data to obtain first sample data corresponding to each training data;
specifically, the sample information to be extracted is spliced with each piece of training data, that is, the sample information to be extracted is placed at the head of each piece of training data, as an example, it is easy to understand, taking the data with index 486 obtained in the above S120 as an example, if the information point to be extracted is a legal representative, the spliced data is presented as [ legal representative, "4, leading sales organization # & #461# & #4, XX sales organization # & #486# & # ] of the planned transaction structure and the related party, and the splicing process of the following processing procedure is consistent with the processing mode.
S300-3: coding each first sample data, classifying the coded first sample data by adopting a first neural network to obtain a classification result of the first sample data, and obtaining second sample data containing sample extraction information based on the classification result;
specifically, the coding uses a BERT coder, the first neural network is a BilSTM network, and the first sample data is classified through the BilSTM network to obtain the positioning information of all sample information points to be extracted in the training sample (i.e. all data containing sample extraction information can be obtained).
S300-4: splicing sample information points to be extracted and sample data, and then coding to obtain a processing vector;
s300-5: determining target sample data corresponding to the to-be-extracted sample information point by adopting a classifier based on the processing vector;
the step of obtaining the target sample data by using the classifier specifically includes using two classifiers, where one classifier predicts a start sequence index of the target sample data in sample data, and the other classifier predicts an end sequence index of the target sample data in the sample data, and determines whether the start sequence index is satisfied with or smaller than the end sequence index, and if so, obtains the target sample data.
S300-6: marking the first sample data according to the sample extraction information to obtain sample comparison data; and after comparing the sample comparison data with the classification result and comparing the target sample data with the sample extraction information, adjusting model parameters until training is finished.
In the present embodiment, data is labeled for the first sample data, and a positive example is defined as 1, and a negative example is defined as 0; defining the information point to be extracted as item, and the text information (i.e. the first sample data) of the paragraph is para, because a paragraph may contain multiple information points, or there may be only one information point information, when marking, item + para constructs a character string send, if send can extract item, it is marked as 1, otherwise it is marked as 0.
Specifically, after the processing procedure after pre-training is completed, extracting key information from the to-be-processed data by using a pre-trained information extraction model based on the to-be-extracted information point to obtain target information, with reference to fig. 4, includes the following steps:
s310: processing the data to be processed based on the information points to be extracted to obtain second processing data containing target key information corresponding to the information points to be extracted;
specifically, processing the to-be-processed data based on the to-be-extracted information point to obtain second processed data including target key information corresponding to the to-be-extracted information point, with reference to fig. 5, includes:
s311: splicing the information points to be extracted and the data to be processed and then coding to obtain third processed data;
it should be noted that splicing the information point to be extracted and the data to be processed specifically is to place the information point to be extracted at the head of the data to be processed.
S312: and processing by adopting a first neural network based on the third processing data to obtain second processing data containing target key information corresponding to the information point to be extracted.
By way of example and not limitation, let XX be the mechanism, the target information point to which the XX should correspond is YY, and let the third processed data and the corresponding information point to be extracted form a binary set, which is:
[ XX organization, "4, XX sales organization: & ] XX planned transaction structure and related party profiles # & #461# & #4, XX sales organization # & #486# & #" ]
[ XX mechanism, "name: YY company x planned transaction structure and associated side profile # & #461# & #4, Xx sales organization # & #486# & # "]
[ XX organization, "legal representatives: TT × Xx planned transaction structure and related party profiles # & #461# & #4, Xx sales organization # & #486# & # "]
The classification result of the above statements is 0, 1, 0 respectively (where 0 is that no target key information exists in the data, and 1 is that target key information exists in the data);
the corresponding comparison data is:
[ XX sales organization, "4, XX sales organization: &, # 0 ] with related party profiles # & #461# & #, 4, XX sales organization # & #486# & #, 0 &, #
[ XX sales organization, "name: YY company x planned transaction structure and associated side profile # & #461# & #4, Xx sales organization # & #486# & #, 1
[ XX sales organization, "legal representatives: TT × Xx planned transaction structure and related party profiles # & #461# & #4, Xx sales organization # & #486# & # ", 0
The second processed data thus obtained is [ XX sales organization, "name: YY company x planned trading structure and associated side profile # & #461# & #4, Xx sales organization # & #486# & # "].
S320: extracting information of the second processing data to obtain target key information;
specifically, extracting information from the second processed data to obtain target key information, with reference to fig. 6, includes:
s321: splicing the information points to be extracted and the second processing data and then coding to obtain fourth processing data;
s322: and predicting the sequence index of the target key information based on the fourth processing data by adopting a classifier, and obtaining the target key information based on the sequence index.
The method for obtaining the key information by using the classifier includes the specific implementation mode that two classifiers are used, the first classifier is used for predicting a starting sequence index of the key information in the fourth processing data, the second classifier is used for predicting an ending sequence index of the key information in the fourth processing data, whether the starting sequence index is smaller than the ending sequence index is judged, and if yes, the key information is obtained.
The classifier can adopt SVM classifier, Bayesian classifier or other classifiers which can be used for realizing the index determination of the target key information sequence in the prior art.
By way of example and not limitation, by taking "sea YY company" as an example of prediction, the key information sea YY company between the start index and the end sequence index can be obtained by setting the characters "sea" corresponding to the start sequence index and the characters "si" corresponding to the end sequence index in the third processed data. As the above-described second process data, XX sales organization, "name: the YY company x XX planned trading structure and the related party profiles # & #461# & #4, XX sales organization # & #486# & # ", the key information is YY company.
S330: and splicing the target key information and the second processing data to obtain target information.
The target key information and the second processing data are spliced, so that the information point to be extracted and the target key information can be directly determined through the target information, the condition that the information point to be extracted and the target key information in the output target information are not matched is reduced, and the information extraction accuracy is further improved.
As an example, the second process data described above is [ XX sales organization, "name: YY company x planned trading structure and related side profile # & #461# & #4, Xx sales organization # & #486# & "], the target information is [ Xx sales organization," name: YY company x XX planned trading structure and related party profiles # & #461# & #4, XX sales organization # & #486# & #, YY company.
In the processing process, the first neural network is adopted to obtain paragraph data (namely, second processing data) where the target key information is located, then the start sequence index and the end sequence index of the target key information are predicted in the second processing data through the classifier, the target key information corresponding to the information point to be extracted is obtained, and then the target key information and the paragraph data (namely, the second processing data) where the target key information is located are spliced to generate the target data, so that the extraction of the key data in the text to be processed is realized.
After extracting key information from the to-be-processed data by using a pre-trained information extraction model based on the to-be-extracted information point to obtain target information, referring to fig. 7, the following steps are also included:
providing a preset synonym library, wherein the synonym library comprises a plurality of synonyms related to each keyword in the database;
s410: when the target key information in the target information is empty, obtaining synonyms associated with the information points to be extracted from the synonym library as candidate information points;
specifically, in the step S410, it is required to identify whether the target key information is empty, and it may be detected whether the target key information includes a character immediately after the target key information is obtained, in this embodiment, the target key information is empty, that is, corresponding information cannot be obtained in the data to be processed according to the information point to be extracted, and it may be that automatic matching cannot be performed due to a large difference between the information point to be extracted and the target key information description in the data to be processed, so that a synonym library is established, a synonym having a meaning close to a keyword (i.e., an information point to be extracted) is obtained according to a preset rule (such as a mapping table) in the synonym library, and then the information extraction model is adopted again to extract the target key information.
S420: replacing the information points to be extracted with candidate information points to obtain new information points to be extracted;
it should be noted that the number of the candidate information points may be one or more, and when there are a plurality of information points to be extracted that are obtained through matching, the synonyms associated with each information point to be extracted are matched one by one as the candidate information points, and are replaced one by one to obtain a plurality of new information points to be extracted.
S430: and extracting key information of the data to be processed by adopting an information extraction model based on the new information point to be extracted to obtain target information.
Specifically, the extraction of the key information of the data to be processed based on the new information point to be extracted by using the information extraction model is consistent with the steps executed in S310 to S330 in S300, and the manner of setting the synonym library in the steps S410 to S430 ensures that all the target key information associated with the information point to be extracted of the text to be processed can be obtained, thereby further improving the accuracy and comprehensiveness of the extraction of the target key information in the text to be processed.
The text to be processed and the target information correspondingly generate a group of data which can be uploaded to the block chain so as to be used as a reference sample or a training sample subsequently, the safety and the fairness and transparency to the user can be guaranteed by uploading the data to the block chain, the user equipment can download the abstract information from the block chain so as to check whether the priority list is tampered, and can download a voice file which obtains corresponding amount data from the block chain for voice broadcasting subsequently without a generation process, so that the voice processing efficiency is effectively improved.
The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
According to the scheme, the key information extraction is realized on the text to be processed by adopting the information extraction model, the key information in the text to be processed is specifically classified through the BilSTM network, the data where the key information is located is positioned, then the key information is subjected to information extraction, the generalization capability of information positioning and information extraction is improved through the training model, the method is different from the regular expression commonly used in the prior art, a large number of regular expressions are not required to be written manually, only the labeled data in the training sample needs to be added in later maintenance, the writing of rules is not required, and the maintenance difficulty of non-professional technicians is reduced.
Example two:
referring to fig. 8, an information extracting apparatus 5 of the present embodiment includes the following:
the preprocessing module 51 is configured to obtain a text to be processed, analyze the text to be processed, and obtain data to be processed;
in this embodiment, the text to be processed may be a document such as a bid amount instruction or a bond recruitment instruction, or may be other unstructured document data.
The to-be-extracted information point determining module 52 is configured to obtain, according to matching of the to-be-processed text in the database, a keyword corresponding to the to-be-processed text as an information point to be extracted;
providing a preset database, wherein the database comprises a plurality of keywords, the keywords may be titles, special paragraphs and the like in the text to be processed, or words related to the field of the text to be processed, the keywords corresponding to the text to be processed may be one or more, the preset rule may be but is not limited to a mapping table and the like, and when the keywords are more, the information points to be extracted also include a plurality of corresponding keywords.
And the execution module 53 is configured to extract key information of the to-be-processed data by using a pre-trained information extraction model based on the to-be-extracted information point, so as to obtain target information.
The information extraction model is trained before being processed by the information extraction model.
Preferably, the execution module 53 includes the following:
the first processing unit 531 is configured to process the to-be-processed data based on the to-be-extracted information point, and obtain second processing data including target key information corresponding to the to-be-extracted information point;
specifically, the first processing unit splices and encodes the information points to be extracted and the data to be processed, and inputs the information points and the data to be processed into a first neural network (BilSTM) for processing, so as to obtain corresponding second processing data.
A second processing unit 532, configured to perform information extraction on the second processed data to obtain target key information;
specifically, the second processing unit processes the second processing unit by using two classifiers, where the first classifier is configured to predict a start sequence index of the key information in the third processed data, and the second classifier is configured to predict an end sequence index of the key information in the third processed data.
A splicing unit 533, configured to splice the target key information and the second processed data to obtain target information.
Specifically, the splicing unit is configured to splice the target key information and the second processing data, so as to directly determine the information point to be extracted and the target key information through the target information.
Preferably, the information extraction device 5 further includes:
and a replacing module 54, configured to, when the target key information in the target information is empty, obtain a synonym associated with the information point to be extracted from the synonym library as a candidate information point, and replace the information point to be extracted with the candidate information point to obtain a new information point to be extracted.
The technical scheme is based on machine learning of artificial intelligence, a text to be processed is analyzed through a preprocessing module to obtain data to be processed which can be used for processing an information extraction model, then an information point determining module to be extracted is matched with a database to obtain information points to be extracted corresponding to the text to be processed, target key information in the text to be processed is classified through a BilSTM network of the information extraction model under a first processing unit in an execution module to realize positioning of a paragraph where the target key information is located, a second processing unit adopts a classifier of the information extraction model to extract paragraph data of the target key information, extraction of the key information is realized through the information extraction model, a large number of regular expressions are not required to be written manually, and only mark data in a training sample is required to be added in later maintenance, rules do not need to be compiled, maintenance difficulty of non-professional technicians is reduced, and generalization capability of information extraction model information positioning and information extraction is improved through a training process. In addition, in the technical scheme, a replacement module is further arranged for acquiring synonyms associated with the information points to be extracted from the synonym library as candidate information points when target key information in the target information is empty, replacing the information points to be extracted with the candidate information points to acquire new information points to be extracted, and extracting key information from the data to be processed according to the new information points to be extracted, so that the situation that the target key information corresponding to the information points to be extracted cannot be acquired due to expression errors is reduced, and the accuracy and the comprehensiveness of information extraction are further improved.
Example three:
in order to achieve the above object, the present invention further provides a computer device 6, where the computer device 6 includes a plurality of computer devices 6, and components of the information extraction apparatus 5 in the second embodiment may be distributed in different computer devices, and the computer device may be a smartphone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster formed by a plurality of servers) that executes a program, and the like. The computer device of the embodiment at least includes but is not limited to: a memory 61, a processor 62, a network interface 63, and the information extracting apparatus 5, which are communicatively connected to each other through a system bus, as shown in fig. 9. It should be noted that fig. 9 only shows a computer device with components, but it should be understood that not all of the shown components are required to be implemented, and more or fewer components may be implemented instead.
In the present embodiment, the memory 61 includes at least one type of computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 61 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. In other embodiments, the memory 61 may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device. Of course, the memory 61 may also include both internal and external storage devices of the computer device. In this embodiment, the memory 61 is generally used for storing an operating system and various application software installed in the computer device, such as a program code of the information extraction apparatus in the first embodiment. Further, the memory 61 may also be used to temporarily store various types of data that have been output or are to be output.
Processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 62 is typically used to control the overall operation of the computer device. In this embodiment, the processor 62 is configured to run the program codes stored in the memory 61 or process data, for example, run the information extraction device, so as to implement the information extraction method of the first embodiment.
The network interface 63 may comprise a wireless network interface or a wired network interface, and the network interface 63 is typically used to establish a communication connection between the computer device 6 and other computer devices 6. For example, the network interface 63 is used to connect the computer device 6 to an external terminal via a network, establish a data transmission channel and a communication connection between the computer device 6 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a Global System of Mobile communication (GSM), Wideband Code Division Multiple Access (WCDMA), a 4G network, a 5G network, Bluetooth (Bluetooth), Wi-Fi, and the like.
It is noted that fig. 9 only shows the computer device 6 with components 61-63, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.
In this embodiment, the information extraction device 5 stored in the memory 61 may also be divided into one or more program modules, and the one or more program modules are stored in the memory 61 and executed by one or more processors (in this embodiment, the processor 62) to complete the present invention.
Example four:
to achieve the above objects, the present invention also provides a computer-readable storage system including a plurality of storage media, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor 62, implements corresponding functions. The computer-readable storage medium of the embodiment is used for storing an information extraction device, and when being executed by the processor 62, the information extraction method of the first embodiment is implemented.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. An information extraction method, comprising:
acquiring a text to be processed, and analyzing the text to be processed to obtain data to be processed;
matching the data to be processed with keywords in a preset database, and taking the matched keywords corresponding to the text to be processed as information points to be extracted, wherein the database is pre-stored with a plurality of keywords;
and extracting key information of the data to be processed by adopting a pre-trained information extraction model based on the information points to be extracted to obtain target information.
2. The information extraction method according to claim 1, wherein analyzing the text to be processed to obtain data to be processed further comprises:
carrying out normalization processing on the text to be processed to obtain a first processed text;
and matching a corresponding analysis rule according to the type of the text to be processed, and processing the first processed text by adopting the analysis rule to obtain data to be processed.
3. The information extraction method according to claim 1, wherein before extracting key information from the data to be processed based on the information points to be extracted by using a pre-trained information extraction model, the method further comprises: :
acquiring a training sample, wherein the training sample comprises a training data set of information to be extracted, sample information points to be extracted and sample extraction information;
splicing the to-be-extracted sample information with each training data to obtain first sample data corresponding to each training data;
coding each first sample data, classifying the coded first sample data by adopting a first neural network to obtain a classification result of the first sample data, and obtaining second sample data containing sample extraction information based on the classification result;
splicing sample information points to be extracted and sample data, and then coding to obtain a processing vector;
determining target sample data corresponding to the to-be-extracted sample information point by adopting a classifier based on the processing vector;
marking the first sample data according to the sample extraction information to obtain sample comparison data; and after comparing the sample comparison data with the classification result and comparing the target sample data with the sample extraction information, adjusting model parameters until the training is finished to obtain a pre-trained information extraction model.
4. The information extraction method of claim 1, wherein extracting key information from the data to be processed based on the information points to be extracted by using a pre-trained information extraction model to obtain target information comprises:
positioning target key information based on the information point to be extracted to obtain second processing data containing the target key information;
extracting information of the second processing data to obtain target key information corresponding to the information point to be extracted;
and splicing the target key information and the second processing data to obtain target information.
5. The information extraction method according to claim 4, wherein locating target key information based on the information point to be extracted to obtain second processed data including the target key information comprises:
splicing the information points to be extracted and the data to be processed and then coding to obtain third processed data;
and processing the third processing data by adopting a first neural network to obtain second processing data containing the target key information.
6. The information extraction method according to claim 4, wherein extracting information from the second processed data to obtain target key information corresponding to the information point to be extracted comprises:
splicing the information points to be extracted and the second processing data and then coding to obtain fourth processing data;
and predicting a sequence index of the target key information based on the fourth processing data by adopting a classifier, and acquiring the target key information corresponding to the information point to be extracted based on the sequence index.
7. The information extraction method according to claim 1, wherein after extracting key information from the data to be processed based on the information points to be extracted by using a pre-trained information extraction model to obtain target information, the method further comprises:
providing a preset synonym library, wherein the synonym library comprises a plurality of synonyms related to each keyword in the database;
when the target key information in the target information is empty;
obtaining synonyms associated with the information points to be extracted from the synonym library as candidate information points;
replacing the information points to be extracted with the candidate information points to obtain new information points to be extracted;
and extracting key information of the data to be processed by adopting an information extraction model based on the new information point to be extracted to obtain target information.
8. An information extraction apparatus characterized by comprising:
the preprocessing module is used for acquiring a text to be processed, analyzing the text to be processed and acquiring data to be processed;
the information point to be extracted determining module is used for matching the text to be processed in the database according to the text to be processed to obtain a keyword corresponding to the text to be processed as an information point to be extracted;
and the processing module is used for extracting key information of the data to be processed by adopting a pre-trained information extraction model based on the information points to be extracted to obtain target information.
9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the information extraction method of any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium comprising a plurality of storage media, each storage medium having a computer program stored thereon, wherein the computer programs stored in the storage media, when executed by a processor, collectively implement the steps of the information extraction method of any one of claims 1 to 7.
CN202010909654.1A 2020-09-02 2020-09-02 Information extraction method and device, computer equipment and readable storage medium Pending CN112052305A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010909654.1A CN112052305A (en) 2020-09-02 2020-09-02 Information extraction method and device, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010909654.1A CN112052305A (en) 2020-09-02 2020-09-02 Information extraction method and device, computer equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN112052305A true CN112052305A (en) 2020-12-08

Family

ID=73606772

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010909654.1A Pending CN112052305A (en) 2020-09-02 2020-09-02 Information extraction method and device, computer equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN112052305A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749530A (en) * 2021-01-11 2021-05-04 北京光速斑马数据科技有限公司 Text encoding method, device, equipment and computer readable storage medium
CN112819622A (en) * 2021-01-26 2021-05-18 深圳价值在线信息科技股份有限公司 Information entity relationship joint extraction method and device and terminal equipment
CN113011144A (en) * 2021-03-30 2021-06-22 中国工商银行股份有限公司 Form information acquisition method and device and server
CN113177401A (en) * 2021-04-25 2021-07-27 鼎富智能科技有限公司 Information extraction method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150254213A1 (en) * 2014-02-12 2015-09-10 Kevin D. McGushion System and Method for Distilling Articles and Associating Images
CN108763368A (en) * 2018-05-17 2018-11-06 爱因互动科技发展(北京)有限公司 The method for extracting new knowledge point
CN110275935A (en) * 2019-05-10 2019-09-24 平安科技(深圳)有限公司 Processing method, device and storage medium, the electronic device of policy information
US20190370396A1 (en) * 2018-05-31 2019-12-05 Wipro Limited Method and Device for Identifying Relevant Keywords from Documents
CN110781299A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Asset information identification method and device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150254213A1 (en) * 2014-02-12 2015-09-10 Kevin D. McGushion System and Method for Distilling Articles and Associating Images
CN108763368A (en) * 2018-05-17 2018-11-06 爱因互动科技发展(北京)有限公司 The method for extracting new knowledge point
US20190370396A1 (en) * 2018-05-31 2019-12-05 Wipro Limited Method and Device for Identifying Relevant Keywords from Documents
CN110275935A (en) * 2019-05-10 2019-09-24 平安科技(深圳)有限公司 Processing method, device and storage medium, the electronic device of policy information
CN110781299A (en) * 2019-09-18 2020-02-11 平安科技(深圳)有限公司 Asset information identification method and device, computer equipment and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749530A (en) * 2021-01-11 2021-05-04 北京光速斑马数据科技有限公司 Text encoding method, device, equipment and computer readable storage medium
CN112749530B (en) * 2021-01-11 2023-12-19 北京光速斑马数据科技有限公司 Text encoding method, apparatus, device and computer readable storage medium
CN112819622A (en) * 2021-01-26 2021-05-18 深圳价值在线信息科技股份有限公司 Information entity relationship joint extraction method and device and terminal equipment
CN112819622B (en) * 2021-01-26 2023-10-17 深圳价值在线信息科技股份有限公司 Information entity relationship joint extraction method and device and terminal equipment
CN113011144A (en) * 2021-03-30 2021-06-22 中国工商银行股份有限公司 Form information acquisition method and device and server
CN113011144B (en) * 2021-03-30 2024-01-30 中国工商银行股份有限公司 Form information acquisition method, device and server
CN113177401A (en) * 2021-04-25 2021-07-27 鼎富智能科技有限公司 Information extraction method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110704633B (en) Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium
CN112052305A (en) Information extraction method and device, computer equipment and readable storage medium
CN110020424B (en) Contract information extraction method and device and text information extraction method
CN112380343B (en) Problem analysis method, device, electronic equipment and storage medium
US10839207B2 (en) Systems and methods for predictive analysis reporting
CN111695439A (en) Image structured data extraction method, electronic device and storage medium
CN113707300B (en) Search intention recognition method, device, equipment and medium based on artificial intelligence
WO2022048363A1 (en) Website classification method and apparatus, computer device, and storage medium
CN111814482B (en) Text key data extraction method and system and computer equipment
CN111858843B (en) Text classification method and device
CN109977014B (en) Block chain-based code error identification method, device, equipment and storage medium
CN111191275A (en) Sensitive data identification method, system and device
US20220083772A1 (en) Identifying matching fonts utilizing deep learning
WO2019075967A1 (en) Enterprise name recognition method, electronic device, and computer-readable storage medium
CN111221936B (en) Information matching method and device, electronic equipment and storage medium
CN115080750B (en) Weak supervision text classification method, system and device based on fusion prompt sequence
CN110674250A (en) Text matching method, text matching device, computer system and readable storage medium
CN113868419B (en) Text classification method, device, equipment and medium based on artificial intelligence
CN110750637B (en) Text abstract extraction method, device, computer equipment and storage medium
KR102468975B1 (en) Method and apparatus for improving accuracy of recognition of precedent based on artificial intelligence
CN115481599A (en) Document processing method and device, electronic equipment and storage medium
CN114611495A (en) Text comparison method, device, equipment and medium
CN113936130A (en) Document information intelligent acquisition and error correction method, system and equipment based on OCR technology
CN113627173A (en) Manufacturer name identification method and device, electronic equipment and readable medium
CN113962196A (en) Resume processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination