CN114201582A - Text data intelligent extraction method and device based on BilSTM-CRF model - Google Patents

Text data intelligent extraction method and device based on BilSTM-CRF model Download PDF

Info

Publication number
CN114201582A
CN114201582A CN202111481294.0A CN202111481294A CN114201582A CN 114201582 A CN114201582 A CN 114201582A CN 202111481294 A CN202111481294 A CN 202111481294A CN 114201582 A CN114201582 A CN 114201582A
Authority
CN
China
Prior art keywords
data
model
training
bilstm
prediction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111481294.0A
Other languages
Chinese (zh)
Inventor
杨细勇
王毅宏
刘树锋
陈贵民
李剑煜
林山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Anscen Network Technology Co ltd
Original Assignee
Xiamen Anscen Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Anscen Network Technology Co ltd filed Critical Xiamen Anscen Network Technology Co ltd
Priority to CN202111481294.0A priority Critical patent/CN114201582A/en
Publication of CN114201582A publication Critical patent/CN114201582A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method comprises the steps of firstly, using Flink to extract streaming data from a data source at regular time and write the streaming data into a ClickHouse, then pulling data to be processed from the ClickHouse, carrying out batch classification, marking and merging on the data to be processed and generating dictionary processing to form pre-training data, then leading the pre-training data into a BiLSTM-CRF model to be trained to form a prediction model, forming a prediction model service API based on the prediction model, finally pulling the streaming data in the data source to obtain a prediction result through the prediction model service API, mapping the prediction result to a specific entity and writing the prediction result into a service database for storage, thereby realizing processing such as entity recognition, extraction and storage of non-structural and irregular text contents. The application also relates to a text data intelligent extraction device based on the BilSTM-CRF model, which has the effects of identifying, extracting and storing unstructured and irregular text contents.

Description

Text data intelligent extraction method and device based on BilSTM-CRF model
Technical Field
The application relates to the technical field of unstructured data processing, in particular to a text data intelligent extraction method and device based on a BilSTM-CRF model.
Background
Unstructured data refers to data whose field lengths are variable and the records for each field may be composed of repeatable or non-repeatable sub-fields, which can be used to process not only structured data (e.g., numeric, symbolic, etc. information) but also full text, image, sound, video, hypermedia, etc. information. The traditional structured data is extracted by analyzing the content and structure of the data to extract key elements, generally by artificial recognition and definition of analytical rules, and generally by regular or fixed labels. But it is difficult to identify, extract and store irregular and unstructured content by manual means.
Disclosure of Invention
Aiming at the problem that irregular and unstructured contents are difficult to identify, extract and store in a manual mode, the application provides a text data intelligent extraction method and device based on a BilSTM-CRF model.
In a first aspect, the application provides a text data intelligent extraction method based on a BilSTM-CRF model, which comprises the following steps:
s1: extracting stream data from a data source by using the Flink timing and writing the stream data into the ClickHouse;
s2: pulling data to be processed from the ClickHouse, dividing the data to be processed into training data and testing data, wherein the proportion of the training data is 80%, the proportion of the testing data is 20%, labeling the training data and the testing data respectively, merging multiple files to form training data corpora and testing data corpora, reading the training data corpora and the testing data corpora respectively, constructing a two-dimensional array formed by word id and word frequency, performing pickle dump on the two-dimensional array, and storing the two-dimensional array into a plk dictionary file to form a training data dictionary and a testing data dictionary;
s3: importing training data corpus, testing data corpus, training data dictionary and testing data dictionary as pre-training data into a BilSTM-CRF model for training to form a prediction model;
s4: initializing the prediction model by utilizing tenserflow, then calling a prediction interface of the model, and extracting a self-defined tag value from return data of the interface to obtain a prediction model service API;
s5: and pulling flow data in a data source to obtain a prediction result through the prediction model service API, mapping the prediction result to a specific entity and writing the prediction result into a service database for storage, wherein the service database comprises a plurality of output database components which are in butt joint with the Flink.
By adopting the technical scheme, firstly, using Flink to extract stream data from a data source at regular time and write the stream data into a ClickHouse, then pulling data to be processed from the ClickHouse, carrying out batch classification, label merging and dictionary generation on the data to be processed to form pre-training data, then introducing the pre-training data into a BilSTM-CRF model to train to form a prediction model, forming a prediction model service API based on the prediction model, finally pulling the stream data in the data source to obtain a prediction result through the prediction model service API, mapping the prediction result to a specific entity and writing the prediction result into a service database for storage, thereby realizing the processing of entity recognition, extraction, storage and the like of unstructured and irregular text content.
Preferably, the S1 specifically includes: and extracting stream data from a data source by using the Flink timing, writing the stream data into the ClickHouse, and recording an execution log to the MySQL database.
By adopting the technical scheme, the method and the device extract the streaming data from the data source at regular time and write the stream data into the execution log in the ClickHouse and record the execution log in the MySQL database, so that the condition of extracting the data at regular time by the Flink can be inquired, and the execution period can be adjusted conveniently.
Preferably, in S1, binding with the execute entry class of the flash by using quartz to realize that the flash extracts stream data from the data source at the timing and writes the stream data into the clickwouse.
By adopting the technical scheme, the quartz executes the tasks based on a timing and regular strategy, the display page at the front end provides selection of an execution period, for example, 11 to 12 points at night every friday in X month in X year, and the selection is executed for 1 time every 10 minutes, the quartz binds the scheduling rule with the flink execution entry class, and the flink timing extraction task can be triggered as soon as the time defined by the rule is up.
Preferably, in S1, streaming data including the key element is extracted from the data source by using a Connector of Flink, and written into ClickHouse, and the Connector performs extraction of the streaming data of the key element by using SQL API.
By adopting the technical scheme, the SQL API can define the source of the extracted input data source and define the sink of the output data source, thereby realizing the extraction of the stream data of key elements.
Preferably, in S2, after the data to be processed is divided into training data and test data, the training data is divided into a plurality of batches.
By adopting the technical scheme, the training data are equally divided into a plurality of batches, which is beneficial to marking and combining the training data subsequently.
Preferably, the S3 specifically includes the following steps: and importing the pre-training data into a BilSTM-CRF model for training to form a prediction model, and recording an execution log to a MySQL database.
By adopting the technical scheme, the pre-training data is imported into a BilSTM-CRF model for training to form a prediction model, an execution log is recorded in a MySQL database, and the training parameters can be optimized according to the actual situation to form a model with better accuracy.
Preferably, in S5, streaming data is pulled from a data source through the prediction model service API by using Flink in real time or in timed batches.
By adopting the technical scheme, the Flink can be used for better pulling the stream data from the data source in real time or in timed batch.
Preferably, the output database component in S5 includes ElasticSearch, MinIO, clickwouse, or HDFS.
By adopting the technical scheme, the elastic search, MinIO, ClickHouse or HDFS are used as output database components which are in butt joint with the flash, so that the data of the layer can be used as data service and provided for a third-party system or other upper-layer application programs to inquire and use.
In a second aspect, the present application further provides an intelligent text data extraction device based on a BiLSTM-CRF model, including:
the data processing module is used for extracting stream data from a data source at regular time and writing the stream data into the ClickHouse;
the data processing module is used for pulling data to be processed from the ClickHouse, and processing the data to be processed by batch classification, labeling, combination and dictionary generation to form pre-training data;
the model training module is used for importing the pre-training data into a BilSTM-CRF model for training to form a prediction model;
the prediction task module provides a prediction model service API based on the prediction model, and the prediction model service API is used for inputting stream data into the prediction model service API to output a prediction result and mapping the prediction result to a specific entity;
the data pulling module is used for pulling stream data from a data source in real time or in timed batch by using the Flink and inputting the stream data into the prediction model service API;
the timing task module is used for configuring an execution cycle of a timing acquisition task or a timing batch pulling task;
an element storage module for storing data after the prediction result output through the prediction model service API is mapped to a specific entity.
By adopting the technical scheme, firstly, the data acquisition module extracts flow data from a data source at regular time and writes the flow data into the ClickHouse, then the data processing module is used for pulling the data to be processed from the ClickHouse, the data to be processed is subjected to batch classification, labeling, merging and dictionary generation processing to form pre-training data, then the model training module is used for leading the pre-training data into a BilSTM-CRF model for training to form a prediction model, then a prediction model service API is provided based on the prediction model, the prediction model service API is used for inputting the flow data into the prediction model service API to output a prediction result, the prediction result is mapped to a specific entity, and then the data pulling module utilizes Flink to pull the flow data from the data source in real time or in batch at regular time to input the prediction model service API and output the prediction result, and then mapping the prediction result to a specific entity and storing the entity in an element storage module, thereby realizing the processing of entity identification, extraction, storage and the like of unstructured and irregular text content.
In a third aspect, the present application also proposes a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the computing method according to the first aspect.
The application provides a text data intelligent extraction method based on a BilSTM-CRF model, which comprises the steps of regularly extracting streaming data from a data source by using a Flink, writing the streaming data into a ClickHouse, pulling data to be processed from the ClickHouse, carrying out batch classification, marking and merging on the data to be processed and generating a dictionary to form pre-training data, then leading the pre-training data into the BilSTM-CRF model for training to form a prediction model, forming a prediction model service API based on the prediction model, finally pulling the streaming data in the data source to obtain a prediction result through the prediction model service API, mapping the prediction result to a specific entity and writing the prediction result into a service database for storage, thereby realizing the entity recognition, extraction, storage and other processing of non-structural and irregular text content, and simultaneously providing a text data intelligent extraction device based on the BilSTM-CRF model, the device realizes the effects of entity recognition, extraction and storage of unstructured and irregular text contents through a text data intelligent extraction method based on a BilSTM-CRF model.
Drawings
The accompanying drawings are included to provide a further understanding of the embodiments and are incorporated in and constitute a part of this specification. The drawings illustrate embodiments and together with the description serve to explain the principles of the invention. Other embodiments and many of the intended advantages of embodiments will be readily appreciated as they become better understood by reference to the following detailed description. The elements of the drawings are not necessarily to scale relative to each other. Like reference numerals designate corresponding similar parts.
FIG. 1 is a flowchart of a text data intelligent extraction method based on a BilSTM-CRF model disclosed in the embodiment of the application.
FIG. 2 is a schematic diagram of an embodiment of an intelligent text data extraction method based on a BilSTM-CRF model according to the application.
FIG. 3a is a diagram illustrating a data file before annotation merging in an embodiment of the present application.
FIG. 3b is a diagram of a corpus file after annotation merging according to an embodiment of the present application.
FIG. 4 is a schematic block diagram of an intelligent text data extraction apparatus based on a BilSTM-CRF model according to an embodiment of the present application.
Fig. 5 is a schematic block diagram of an intelligent text data extraction device based on the BiLSTM-CRF model in another embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 shows a flowchart of an intelligent text data extraction method based on a BiLSTM-CRF model in the present application, and fig. 2 shows a schematic diagram of an embodiment of the intelligent text data extraction method based on the BiLSTM-CRF model in the present application, with reference to fig. 1 and fig. 2, the method includes the following steps:
s1: extracting stream data from a data source by using the Flink timing and writing the stream data into the ClickHouse;
the Flink is an open source stream processing framework developed by the Apache software foundation, and the core is a distributed stream data processing engine written by Java and Scale. Flink executes arbitrary stream data programs in a data parallel and pipelined manner, and Flink's pipelined runtime system can execute batch and stream processing programs. ClickHouse is a tubular storage database sourced in 2016 by Yandex in russia, mainly used for online analytical processing queries (OLAP), and is capable of generating analytical data reports in real time using SQL queries. In particular embodiments, the data source is Kafka, hbase, hdfs, or the like.
In a specific embodiment, in step S1, the method uses quartz to bind with the execution entry class of the flash to realize that the flash extracts the stream data from the data source and writes the stream data into the clickwouse at the timing of the flash. The quartz executes the tasks based on a timing and regular strategy, the display page at the front end provides selection of an execution period, for example, 11 o 'clock to 12 o' clock in evening every friday of month X of X, and executes the selection for 1 time every 10 minutes, the quartz binds the scheduling rule with the flink execution entry class, and the flink timing extraction task can be triggered as soon as the time defined by the rule is up.
In a further embodiment, S1 specifically includes the following steps: streaming data is extracted from a data source by using Flink timing and written into a ClickHouse, and an execution log is recorded into a MySQL database, which is an open source relational database management system, and a first version is released in 1998 month 1 by using a most common database management language, namely Structured Query Language (SQL).
In a further embodiment, in S1, streaming data containing the key element is extracted from the data source and written into clickwouse using a Connector of Flink, wherein the Connector uses SQL API to extract the streaming data of the key element. The SQL API in Flink can define the source of the input data to be extracted, define the sink of the output data source, and then collect the data through SQL statements such as INSERT INTO my _ sink _ table SELECT id, user _ name, msg and create _ time FROM mysql _ source _ table.
In a further embodiment, the step S1 is configured with a data collection table for storing collected data, which needs to be created according to actual services, such as: user _ msg (id int, create _ time, update _ time, create _ id varchar, user _ name varchar, msg varchar, … key element fields), i.e., (id, creation time, update time, creator, updater, user id, user name, short message content, … key element fields).
In a further embodiment, step S1 is further configured with an information timed task table for recording timed tasks, such as: schedule _ task (id int, task _ name varchar, task _ no varchar, task _ type int, task _ freq varchar, task _ status int, data _ num long, model _ no int, create _ time estimate, update _ time estimate, generator varchar, update varchar), that is, (id, task name, task batch number, task type (collection, pull), task period, state, data amount, model used, creation time, update time, creator, updater).
S2: pulling data to be processed from the ClickHouse, and carrying out batch classification, labeling and merging and dictionary generation processing on the data to be processed to form pre-training data;
in a specific embodiment, step S2 specifically includes: the method comprises the steps of pulling data to be processed from ClickHouse, dividing the data to be processed into training data and testing data, wherein the proportion of the training data is 80%, the proportion of the testing data is 20%, then labeling the training data and the testing data respectively, combining multiple files to form training data corpora and testing data corpora, then reading the training data corpora and the testing data corpora respectively, constructing a two-dimensional array formed by word id and word frequency, and then carrying out pickle dump on the two-dimensional array to store the two-dimensional array into a plk dictionary file to form a training data dictionary and a testing data dictionary;
in a further embodiment, after the data to be processed is divided into the training data and the test data in step S2, the training data is divided into several batches, for example: train _ data _ batch _0, train _ data _ batch _50, test _ data _ batch _0, and train _ data _ batch _ 50.
In a specific embodiment, a data file before tagging and merging is shown in fig. 3a, a corpus file after tagging and merging is shown in fig. 3B, wherein a K value is a service-specific index, and a custom entity tag, such as B-KPARAM-T or I-KPARAM-T (B represents beigin, which is the start of the tag, and T represents title, which is an entity title, is annotated to the K value), and then multiple files are merged to form the corpus file. The self-defined entity label is used for guiding model learning through the self-defined entity label in the subsequent training process.
And generating a dictionary, namely performing dictionary generation on the corpus file marked and combined in the last step, namely respectively reading the training data corpus and the test data corpus, constructing a two-dimensional array consisting of word id and word frequency, and performing pick dump on the array to store the array into a plk file so as to form a training data dictionary and a test data dictionary.
S3: importing pre-training data into a BilSTM-CRF model for training to form a prediction model;
in a specific embodiment, the pre-training data is embodied as a training data corpus, a test data corpus, a training data dictionary, and a test data dictionary.
The CRF is a common sequence labeling algorithm and can be used for tasks such as part-of-speech labeling, word segmentation, named entity identification and the like. BilSTM + CRF is a popular sequence labeling algorithm at present, and the BilSTM and the CRF are combined together, so that the model can not only consider the relevance between the front and the back of the sequence like the CRF, but also has the feature extraction and fitting capability of the LSTM.
In a further embodiment, the training data corpus, the test data corpus, the training data dictionary and the test data dictionary are used as pre-training data to be input into a BilSTM-CRF model, training parameters are estimated, then the BilSTM-CRF model training is carried out, the training parameters are optimized according to actual conditions, a model with better accuracy is formed, and an execution saying log is recorded to a MySQL database.
In a further embodiment, step S3 is configured with a model table for recording generated model information, such as: ner _ model (id int, model _ name varchar, model _ no int, train _ param varchar, cost _ time int, model _ status int, create _ time, update _ time, creator, updater), i.e., (id, model name, model number, training parameters, elapsed time, state, creation time, update time, creator, updater).
S4: initializing a prediction model by utilizing tenserflow, then calling a prediction interface of the model, and extracting a self-defined tag value from return data of the interface to obtain a prediction model service API;
tensorflow is a powerful open source software library developed by the Google Brain team for Deep Neural Networks (DNN). It allows the deployment of deep neural network computations onto servers, PCs or mobile devices of any number of CPUs or GPUs. The method can automatically derive, support various CPUs/GPUs, have a pre-training model, and support common NN architectures such as a Recurrent Neural Network (RNN), a Convolutional Neural Network (CNN) and a Deep Belief Network (DBN).
S5: and pulling stream data in the data source to obtain a prediction result through a prediction model service API, mapping the prediction result to a specific entity and writing the prediction result into a service database for storage, wherein the service database comprises a plurality of output database components in butt joint with the Flink.
In a specific embodiment, the output prediction result is a K value, a plurality of K values can be extracted after the stream data or batch data to be predicted is passed through the prediction model service API, the prediction result is traversed, the output K value is mapped to the entity tag, and then the entity tag is written into the service database for storage.
In a further embodiment, step S5 utilizes Flink to pull streaming data from a data source through a predictive model service API in real-time or timed batches. The output database component in step S5 includes ElasticSearch, MinIO, clickwouse, or HDFS, and the data in the service database can be provided as a data service to a third-party system or other upper-layer application for querying and using.
In a further embodiment, step S5 is configured with an element table for storing the prediction result after recognition and extraction, which needs to be created according to the actual service, such as: user _ msg _ indicator (id int, create _ time timestamp, update _ time timestamp, creator, updater, user _ id varchar, user _ name varchar, msg varchar, … key element fields), i.e. (id, creation time, update time, creator, updater, user id, user name, package contained in the short message, service related to the short message, region related to the short message, … other key element fields).
With further reference to fig. 4, as an implementation of the foregoing method, the present application provides an embodiment of a text data intelligent extraction apparatus based on a BiLSTM-CRF model, where the apparatus embodiment corresponds to the method embodiment shown in fig. 1, and the apparatus may be applied to various electronic devices. The system comprises:
the data acquisition module 101, the data processing module 102 is used for extracting stream data from a data source at regular time and writing the stream data into a clickwouse;
the data processing module 102 is used for pulling data to be processed from the ClickHouse, and processing the data to be processed through batch classification, labeling, merging and dictionary generation to form pre-training data;
the model training module 103 is used for importing pre-training data into a BilSTM-CRF model for training to form a prediction model by the model training module 103;
the prediction task module 104 is used for providing a prediction model service API based on the prediction model, wherein the prediction model service API is used for inputting the stream data into the prediction model service API to output a prediction result and mapping the prediction result to a specific entity;
the data pulling module 105 is used for pulling stream data from a data source in real time or in timed batch by the data pulling module 105 through a Flink to input the stream data into a prediction model service API;
the timing task module 106 is used for configuring an execution cycle of a timing acquisition task or a timing batch pulling task;
and the element storage module 107, wherein the element storage module 107 is used for storing data after the prediction result output by the prediction model service API is mapped to a specific entity.
In a further embodiment, as shown in fig. 5, the text data intelligent extraction device based on the BiLSTM-CRF model comprises a data acquisition layer 100, a NER data model layer 200 and an element storage layer 300.
The data collection layer 100 includes a data collection module 101, a data pull module 105, a timed task module 106, and a clickwouse. The data processing module 102 is used for extracting stream data from a data source at regular time and writing the stream data into the clickwouse; the data pulling module 105 is configured to pull data to be extracted from a data source in real time or (regularly) in batch through Flink, so that the data can be updated in time as needed, and a subsequent prediction result can be updated correspondingly, thereby ensuring that the prediction result is more real-time and accurate; the timing task module 106 is configured to configure an execution cycle of a timing collection task or a timing batch pull task.
The NER data model layer 200 includes a data processing module 102, a model training module 103, and a predicted tasks module 104. The data processing module 102 is configured to pull data to be processed from the ClickHouse, and perform batch classification, labeling, merging and dictionary generation processing on the data to be processed to form pre-training data; the model training module 103 is used for importing pre-training data into a BilSTM-CRF model for training to form a prediction model by the model training module 103; the prediction task module 104 is used for providing a prediction model service API based on the prediction model, wherein the prediction model service API is used for inputting the stream data into the prediction model service API to output a prediction result and mapping the prediction result to a specific entity;
the element storage layer 300 consists of a variety of output database components interfaced by Flink, including elastic search, MinIO, clickwouse, or HDFS. The data of the layer can be used as a data service and provided to a third-party system or other upper-layer application programs for inquiring and using.
According to embodiments disclosed herein, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program containing program code for performing the method illustrated in fig. 1. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) and a Graphics Processing Unit (GPU).
It should be noted that the computer readable medium of the present application can be a computer readable signal medium or a computer readable medium or any combination of the two. The computer readable medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor device, apparatus, or any combination of the foregoing. More specific examples of the computer readable medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution apparatus, device, or apparatus. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution apparatus, device, or apparatus. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present application may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes an acquisition module, an analysis module, and an output module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.
While the principles of the invention have been described in detail in connection with the preferred embodiments thereof, it will be understood by those skilled in the art that the foregoing embodiments are merely illustrative of exemplary implementations of the invention and are not limiting of the scope of the invention. The details of the embodiments are not to be interpreted as limiting the scope of the invention, and any obvious changes, such as equivalent alterations, simple substitutions and the like, based on the technical solution of the invention, can be interpreted without departing from the spirit and scope of the invention.

Claims (10)

1. A text data intelligent extraction method based on a BilSTM-CRF model is characterized by comprising the following steps: the method comprises the following steps:
s1: extracting stream data from a data source by using the Flink timing and writing the stream data into the ClickHouse;
s2: pulling data to be processed from the ClickHouse, dividing the data to be processed into training data and testing data, wherein the proportion of the training data is 80%, the proportion of the testing data is 20%, labeling the training data and the testing data respectively, merging multiple files to form training data corpora and testing data corpora, reading the training data corpora and the testing data corpora respectively, constructing a two-dimensional array formed by word id and word frequency, performing pickledump on the two-dimensional array, and storing the two-dimensional array into a plk dictionary file to form a training data dictionary and a testing data dictionary;
s3: importing the training data corpus, the testing data corpus, the training data dictionary and the testing data dictionary as pre-training data into a BilSTM-CRF model for training to form a prediction model;
s4: initializing the prediction model by utilizing tenserflow, then calling a prediction interface of the model, and extracting a self-defined tag value from return data of the interface to obtain a prediction model service API;
s5: and pulling flow data in a data source to obtain a prediction result through the prediction model service API, mapping the prediction result to a specific entity and writing the prediction result into a service database for storage, wherein the service database comprises a plurality of output database components which are in butt joint with the Flink.
2. The intelligent text data extraction method based on the BilSTM-CRF model as claimed in claim 1, wherein: the S1 specifically includes: and extracting stream data from a data source by using the Flink timing, writing the stream data into the ClickHouse, and recording an execution log to the MySQL database.
3. The intelligent text data extraction method based on the BilSTM-CRF model as claimed in claim 1, wherein: in the step S1, binding is performed by using quartz and the execution entry class of the Flink to realize that the Flink extracts the streaming data from the data source at the timing and writes the streaming data into the clickwouse.
4. The intelligent text data extraction method based on the BilSTM-CRF model as claimed in claim 1, wherein: in S1, streaming data including the key elements is extracted from the data source and written into the clickwouse using a Connector of Flink, which extracts the streaming data of the key elements using the SQL API.
5. The intelligent text data extraction method based on the BilSTM-CRF model as claimed in claim 1, wherein: and in the step S2, after the data to be processed are divided into training data and test data, the training data are divided into a plurality of batches.
6. The intelligent text data extraction method based on the BilSTM-CRF model as claimed in claim 1, wherein the step S3 specifically comprises the steps of: and importing the pre-training data into a BilSTM-CRF model for training to form a prediction model, and recording an execution log to a MySQL database.
7. The intelligent text data extraction method based on the BilSTM-CRF model as claimed in claim 1, wherein: and in the step S5, streaming data is pulled from a data source through the prediction model service API by using Flink in real time or in timed batches.
8. The intelligent text data extraction method based on the BilSTM-CRF model as claimed in any one of claims 1-7, wherein: the output database component in the S5 includes ElasticSearch, MinIO, clickwouse, or HDFS.
9. The utility model provides a text data intelligence extraction element based on BilSTM-CRF model which characterized in that: the device comprises:
the data processing module is used for extracting stream data from a data source at regular time and writing the stream data into the ClickHouse;
the data processing module is used for pulling data to be processed from the ClickHouse, and processing the data to be processed by batch classification, labeling, combination and dictionary generation to form pre-training data;
the model training module is used for importing the pre-training data into a BilSTM-CRF model for training to form a prediction model;
the prediction task module provides a prediction model service API based on the prediction model, and the prediction model service API is used for inputting stream data into the prediction model service API to output a prediction result and mapping the prediction result to a specific entity;
the data pulling module is used for pulling stream data from a data source in real time or in timed batch by using the Flink and inputting the stream data into the prediction model service API;
the timing task module is used for configuring an execution cycle of a timing acquisition task or a timing batch pulling task;
an element storage module for storing data after the prediction result output through the prediction model service API is mapped to a specific entity.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the calculation method according to any one of claims 1 to 8.
CN202111481294.0A 2021-12-06 2021-12-06 Text data intelligent extraction method and device based on BilSTM-CRF model Pending CN114201582A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111481294.0A CN114201582A (en) 2021-12-06 2021-12-06 Text data intelligent extraction method and device based on BilSTM-CRF model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111481294.0A CN114201582A (en) 2021-12-06 2021-12-06 Text data intelligent extraction method and device based on BilSTM-CRF model

Publications (1)

Publication Number Publication Date
CN114201582A true CN114201582A (en) 2022-03-18

Family

ID=80650841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111481294.0A Pending CN114201582A (en) 2021-12-06 2021-12-06 Text data intelligent extraction method and device based on BilSTM-CRF model

Country Status (1)

Country Link
CN (1) CN114201582A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114466362A (en) * 2022-04-11 2022-05-10 武汉卓鹰世纪科技有限公司 Method and device for filtering junk short messages under 5G communication based on BilSTM

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114466362A (en) * 2022-04-11 2022-05-10 武汉卓鹰世纪科技有限公司 Method and device for filtering junk short messages under 5G communication based on BilSTM

Similar Documents

Publication Publication Date Title
Mehmood et al. Implementing big data lake for heterogeneous data sources
CN109255031B (en) Data processing method based on knowledge graph
CN107491547B (en) Search method and device based on artificial intelligence
CN107679039B (en) Method and device for determining statement intention
CN110298019A (en) Name entity recognition method, device, equipment and computer readable storage medium
CN106663037A (en) Feature processing tradeoff management
US11204957B2 (en) Multi-image input and sequenced output based image search
CN108491421B (en) Method, device and equipment for generating question and answer and computing storage medium
US11093857B2 (en) Method and apparatus for generating information
CN110516077A (en) Knowledge mapping construction method and device towards enterprise's market conditions
CN112100401B (en) Knowledge graph construction method, device, equipment and storage medium for science and technology services
CN105843793B (en) The method and system of appropriate rows concept is detected and created during automodel generates
CN109684354A (en) Data query method and apparatus
CN110046231A (en) A kind of customer service information processing method, server and system
CN116108194A (en) Knowledge graph-based search engine method, system, storage medium and electronic equipment
CN114201582A (en) Text data intelligent extraction method and device based on BilSTM-CRF model
CN113468196B (en) Method, apparatus, system, server and medium for processing data
CN113011126A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN117290481A (en) Question and answer method and device based on deep learning, storage medium and electronic equipment
CN115858822B (en) Time sequence knowledge graph construction method and system
CN112749325A (en) Training method and device for search ranking model, electronic equipment and computer medium
CN117033649A (en) Training method and device for text processing model, electronic equipment and storage medium
CN112487154B (en) Intelligent search method based on natural language
AU2019290658B2 (en) Systems and methods for identifying and linking events in structured proceedings
US11989217B1 (en) Systems and methods for real-time data processing of unstructured data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination