CN116166801A - Target text field label identification method and device - Google Patents

Target text field label identification method and device Download PDF

Info

Publication number
CN116166801A
CN116166801A CN202310096604.XA CN202310096604A CN116166801A CN 116166801 A CN116166801 A CN 116166801A CN 202310096604 A CN202310096604 A CN 202310096604A CN 116166801 A CN116166801 A CN 116166801A
Authority
CN
China
Prior art keywords
data
identified
text
target text
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310096604.XA
Other languages
Chinese (zh)
Inventor
李信
马跃
娄竞
邢宁哲
陈重韬
王艺霏
梁东
王骏
王畅
尚芳剑
李欣怡
温馨
张海明
梁潇
刘卫卫
姚艳丽
王森
庞思睿
苏丹
那琼澜
周子阔
姜蕴洲
曲洪泽
王晓慧
黄复鹏
安宁钰
雷舒娅
张文思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Smart Grid Research Institute Co ltd
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Original Assignee
State Grid Smart Grid Research Institute Co ltd
State Grid Corp of China SGCC
Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Smart Grid Research Institute Co ltd, State Grid Corp of China SGCC, Information and Telecommunication Branch of State Grid Jibei Electric Power Co Ltd filed Critical State Grid Smart Grid Research Institute Co ltd
Priority to CN202310096604.XA priority Critical patent/CN116166801A/en
Publication of CN116166801A publication Critical patent/CN116166801A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a target text field tag identification method and device, wherein the method comprises the following steps: acquiring metadata information of a target database; screening metadata information corresponding to the target text field from the acquired metadata information and adding the metadata information into an object list to be identified; according to the object list to be identified, acquiring a plurality of pieces of text data to be identified of a target text field from a target database; classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; and determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field. The method is suitable for analyzing massive and complex text field data, and can improve the efficiency of data analysis.

Description

Target text field label identification method and device
Technical Field
The present disclosure relates to the field of data identification, and in particular, to a method and apparatus for identifying a target text field tag.
Background
With the vigorous development of hard technology such as Internet, big data, artificial intelligence, digital twinning and the like, enterprises face management and application of massive data and bring great challenges. In particular, when enterprises are greatly advancing digital transformation, users who use data are not limited to professional data analysts, data warehouse engineers or database administrators, and the like, and more are capable of flexibly probing data tasks, namely identifying data type labels, in a manner which can be understood by non-technical background business personnel.
However, in this probe data task, the identification of text class field data tags is more complex. Specifically, various data are stored in various service system databases in a scattered manner or are collected in a data warehouse and a data lake, so that the attribute, the relation and the type of text data are more complex and various, and most of data are directly associated with services, so that the text data have a certain service threshold. The existing text field has error conditions, for example, one field stores data with different labels, so how to quickly and intelligently analyze massive text field data, thereby helping users to comprehensively and quickly read and understand the data conditions, and being one of the core problems in the current enterprise digital transformation process.
In the prior art, a text data tag identification mode adopts a regular expression or fuzzy search to match text data, and the mode has the problems of low identification rate and poor flexibility when facing complex text data.
The other text data tag recognition mode is to search text data by means of natural language processing mode, and the mode carries out data recognition based on static universal field word vectors, so that the word stock for storing the universal field word vectors has high requirements and has the problem of poor recognition effect.
In addition, in the prior art, when there is an error in text data stored in a database, for example, digits are stored in fields of chinese text or english text, and the existing text data tag identification method does not verify specific stored data in the database, so that under the condition of abnormal data storage, there is a problem that the identification efficiency is low and the computing resource is wasted. In the prior art, text types (such as Chinese text and English text) are not recognized, and target text type fields cannot be subjected to targeted tag recognition.
Disclosure of Invention
The method and the device are used for solving the problems that in the prior art, when the text field data labels in the database are identified, the identification efficiency is low, the identification effect is poor, the calculation resources are wasted and the identification can not be carried out according to the user requirements.
In order to solve the above technical problem, an aspect herein provides a target text class field tag identification method, including:
acquiring metadata information of a target database;
screening metadata information corresponding to the target text field from the acquired metadata information and adding the metadata information into an object list to be identified;
according to the object list to be identified, acquiring a plurality of pieces of text data to be identified of a target text field from a target database;
Classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
and determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
As a further embodiment herein, the metadata information of the target database includes: database address, database port, database identification, data table identification in database, field identification in data table and field type.
In a further embodiment, the step of screening metadata information corresponding to the target text field from the acquired metadata information and adding the metadata information to the to-be-identified object list includes:
screening metadata information with a field type of a character string from the acquired metadata information;
For each piece of the screened metadata information, the following analysis is performed:
adopting a JDBC driver to establish the connection of a data source corresponding to the screened metadata information;
querying data through the connection by using SQL language;
counting the data quantity containing the target text from the inquired data;
and judging whether the data amount of the target text is larger than a preset value, if so, taking the screened metadata information as target text metadata information and adding the target text metadata information into an object list to be identified.
As a further embodiment herein, the coding module is a BERT model, the classification module is a LightGBM model, and the classification model training process includes:
determining the output label of the target text type and the classification model according to the service demand;
according to the output labels of the classification model, input text data corresponding to each output label and input text data of non-output labels are obtained from a database;
performing data checking and sample balancing processing on the acquired data to obtain a final training data set, wherein the training data set comprises a plurality of samples, and each sample comprises input text data and an output label;
training a pre-established classification model with a training dataset.
As a further embodiment herein, after the training of the classification model is completed, the method further includes:
evaluating the classification model by using the test set;
calculating an evaluation index according to the evaluation result, wherein the evaluation index comprises: accuracy, precision and recall, and comprehensive index F1 value;
if the evaluation index is not satisfied, the classification model is adjusted and then retraining is carried out until the classification model satisfying the evaluation index is obtained.
In a further embodiment, according to the object to be identified list, after obtaining the plurality of pieces of text data to be identified in the target text class field from the target database, the method further includes:
and deleting irrelevant information in the acquired pieces of text data to be identified.
As a further embodiment herein, determining the tag of the target text class field from the tags of the plurality of text data to be identified of the target text class field includes:
counting the data quantity of the text data to be identified of each label according to the labels of the text data to be identified of the target text field;
and taking the label with the highest data volume as the label of the target text type field.
A second aspect of the present invention provides a target text class field tag identifying apparatus, comprising:
A metadata information acquisition unit configured to acquire metadata information of a target database;
the metadata information screening unit is used for screening metadata information corresponding to the target text type field from the acquired metadata information and adding the metadata information into the object list to be identified;
the text data acquisition unit to be identified is used for acquiring a plurality of pieces of text data to be identified of the target text type field from the target database according to the object list to be identified;
the classification unit is used for classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
and the label determining unit is used for determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
A third aspect herein provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding embodiments when the computer program is executed.
A fourth aspect herein provides a computer storage medium having stored thereon a computer program which, when executed by a processor of a computer device, implements a method as described in any of the previous embodiments.
According to the target text field tag identification method and device, text data to be identified containing target text fields can be screened through analysis of target database metadata information, tag identification time can be saved through filtering data of non-target text fields, and identification efficiency is improved. The text data to be identified of the target text field is classified through the pre-trained classification model of the target text field, the labels of the text data to be identified of the target text field are obtained, and the labels of the target text field are determined according to the labels of the plurality of text data to be identified of the target text field, so that the identification efficiency of the target text field can be improved. When the text is implemented, the target text can be specified by a user, so that the text can also meet the personalized requirements of the user, such as performing tag recognition on Chinese text, performing tag recognition on English text and the like. Summarizing, the method is suitable for analyzing massive and complex text field data, can improve the efficiency of data analysis, and provides technical support for users to quickly know and master data assets and promote enterprise digital transformation.
The foregoing and other objects, features and advantages will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments herein or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments herein and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 illustrates a block diagram of a target text class field tag identification system of embodiments herein;
FIG. 2 illustrates a flow chart of a target text class field tag identification method of embodiments herein;
FIG. 3 is a flow chart illustrating a metadata information determination process corresponding to a target text class field of embodiments herein;
FIG. 4 illustrates a flow chart of a classification model training process of embodiments herein;
FIG. 5 illustrates a flow chart of a classification model evaluation process of embodiments herein;
FIG. 6 illustrates a flow diagram of a target text class field tag determination process of an embodiment herein;
FIG. 7 illustrates a block diagram of a target text class field tag identification apparatus of an embodiment herein;
FIG. 8 illustrates a block diagram of a BERT model of an embodiment herein;
fig. 9 shows a block diagram of a computer device of embodiments herein.
Description of the drawings:
101. a client;
102. a server;
701. a metadata information acquisition unit;
702. a metadata information screening unit;
703. a text data acquisition unit to be identified;
704. a classification unit;
705. a tag determination unit;
902. a computer device;
904. a processor;
906. a memory;
908. a driving mechanism;
910. an input/output module;
912. an input device;
914. an output device;
916. a presentation device;
918. a graphical user interface;
920. a network interface;
922. a communication link;
924. a communication bus.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the disclosure. All other embodiments, based on the embodiments herein, which a person of ordinary skill in the art would obtain without undue burden, are within the scope of protection herein.
It should be noted that the terms "first," "second," and the like in the description and claims herein and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or device.
The present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When a system or apparatus product in practice is executed, it may be executed sequentially or in parallel according to the method shown in the embodiments or the drawings.
It should be noted that, the method and the device for identifying the target text field tag are applicable to analysis of various text field data, and the application fields of the method and the device for identifying the target text field tag are not limited herein.
The data (including, but not limited to, data for analysis, data stored, data displayed, etc.) referred to in this application are information and data authorized by the user or sufficiently authorized by each party.
In one embodiment, a target text class field tag identification system is provided herein, as shown in fig. 1, including: client 101 and server 102.
The client 101 is used for a user to operate to set a target database and a target text by the user. The target database may include a plurality of databases, each of which may store a plurality of data tables, each of which includes a plurality of fields. In specific implementation, a user can specify a target database from service system databases, for example, a power grid is taken as an example, and service systems such as marketing, mining, PMS, scheduling and the like respectively store production and management data in different fields of the power grid, wherein the storage formats of the data are different.
The target text includes text in a different language such as chinese, english, etc. In some embodiments of the present description, the client may be a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, or the like. Wherein, intelligent wearable equipment can include intelligent bracelet, intelligent wrist-watch, intelligent glasses, intelligent helmet etc.. Of course, the client is not limited to the electronic device with a certain entity, and may also be software running in the electronic device.
The server 102 is configured to obtain metadata information of a target database according to a target database set by a user, screen metadata information corresponding to a target text field from the obtained metadata information, and add the metadata information to a to-be-identified object list; according to the object list to be identified, acquiring a plurality of pieces of text data to be identified of a target text field from a target database; classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified; and determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
In some embodiments, after determining the tag of the target text class field, the tag of the target text class field is also sent to the client 101, so that the user can view the recognition result.
In detail, metadata information of a database is data describing an object such as information resource (i.e., data) or data, and is used for the purpose of: identifying resources, evaluating resources, tracking changes of resources in the use process, realizing simple and efficient management of a large amount of networked data, realizing effective discovery, searching, integrated organization and effective management of the used resources of information resources.
In the prior art, for a database, metadata thereof includes: management attributes (e.g., creator, application, business line, business responsible, etc.), lifecycle (e.g., creation time, DDL time, last update time, version information, etc.), storage attributes (e.g., location, space size, physical size, etc.), data characteristics (e.g., data tilt, average length, etc.), usage characteristics (e.g., DML, refresh rate, etc.), data structure tables/partitions (e.g., name, type, remarks, etc.), columns (e.g., name, type, length, precision, etc.), indexes (e.g., name, type, field, etc.), constraints (e.g., type, field, etc.). In this application, the metadata information of the database includes: database address, database port, database identification, data table identification in database, field identification in data table and field type, wherein the field type comprises character type (varchar 2, nvarchar2, char, nchar, long), digital type (number, float) and date type (data, timestamp). In particular implementations, the metadata information of the database also includes an application system identification.
The metadata information in the object list to be identified is executed in a first-in first-out order.
In some embodiments, the coding module is a BERT model, the classification module is a LightGBM model, and recognition of text field tags is completed by combining the BERT model and the LightGBM (Light Gradient Boosting Machine) model.
The BERT model is implemented based on a bi-directional transducer encoder, where bi-directional and transducer encoders are two of the core features of the model, and are also the innovation of the model. In particular, the bi-directional mechanism employed in the BERT model differs from the conventional bi-directional model that only context information is considered in that the BERT model also fuses context information that is commonly relied upon in all layer structures. And compared with the LSTM model structure, the transducer encoder fuses and feeds back the context information of the words more deeply. Meanwhile, the Attention mechanism adopted in the conventional model generally needs to rely on a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN), and the transducer framework is constructed based on only the Attention mechanism. From the main structure, the BERT model follows the organization structure of an Encoder-Decode, and the Encoder and the Decode are both multi-layer structures, and the layers are connected and information transferred by using residual values. The word vector is output through the Decoder to a linear map, which maps it to the entire dictionary space. Finally, outputting the probability distribution of dictionary words through a Softmax classifier, namely the score of each word vector, and finally selecting the word vector with the highest score as a final model output result.
The BERT model consists mainly of three parts: word vectors, location vectors, and segment vectors. In particular, the head and tail of the text are marked with two special symbols [ CLS ] and [ SEP ], which are mainly used to distinguish different sentences. And the output of the model is a semantic representation result fed back by each word in the text through the multi-layer encoder and fusing information of its context. The BERT model structure is shown in fig. 8.
LightGBM (Light Gradient Boosting Machine) is a framework for realizing GBDT (Gradient Boosting Decision Tree) algorithm, supports high-efficiency parallel training, and has the advantages of faster training speed, lower memory consumption, better accuracy, support of distributed type and capability of rapidly processing mass data. The principle of the algorithm is as follows:
(1) histogram algorithm
The basic idea of the histogram algorithm is to first discretize the continuous floating point eigenvalues into k integers while constructing a histogram of width k. When traversing data, accumulating statistics in the histogram according to the discretized value as an index, accumulating needed statistics in the histogram after traversing the data once, and then traversing to find the optimal segmentation point according to the discretized value of the histogram. There are many advantages to using a histogram algorithm. Firstly, the most obvious is the reduction of the memory consumption, and the histogram algorithm not only does not need to additionally store the pre-ordered result, but also can only store the value after feature discretization.
(2) Histogram difference acceleration of LightGBM
An easily observed phenomenon: the histogram of a leaf may be derived from the difference between the histogram of its parent node and the histogram of its siblings. A histogram is typically constructed that needs to be traversed through all the data on that leaf, but the histogram is worse than just traversing k bins of the histogram. With this approach, the LightGBM can be doubled in speed after constructing a histogram of a leaf (the parent node has been computed in the last round), and can obtain the histogram of its sibling leaf at very little cost.
(3) Leaf-wise Leaf growth strategy with depth limitation
Once the Level-wise data is passed, the leaves of the same layer can be split at the same time, the multithreading optimization is easy to carry out, the complexity of the model is well controlled, and the fitting is not easy to be passed. But in practice Level-wise is an inefficient algorithm because it does not distinguish between leaves to be treated at the same Level, and introduces much unnecessary overhead because in practice many leaves have low splitting gain and do not have to be searched and split.
Leaf-wise is a more efficient strategy to find one Leaf with the greatest splitting gain from all the current leaves at a time, then split, and so on. Therefore, compared with the Level-wise, the Leaf-wise can reduce more errors and obtain better precision under the condition of the same splitting times. The disadvantage of Leaf-wise is that a relatively deep decision tree may grow, resulting in an overfitting. The LightGBM thus adds a maximum depth limit above Leaf-wise, preventing overfitting while ensuring high efficiency.
(4) Direct support of category features (i.e. without one-hot coding)
In fact, most machine learning tools cannot directly support category features, and category features are generally required to be converted into multidimensional one-hot coding features, so that space and time efficiency is reduced. The use of class features is common in practice. Based on this consideration, the LightGBM optimizes support for class features that can be directly entered without additional one-hot code expansion. And the decision rule of the category characteristic is added on the decision tree algorithm.
(5) Direct support of efficient parallelism
LightGBM also has the advantage of supporting efficient parallelism. The LightGBM native supports parallel learning, currently supporting both feature parallelism and data parallelism.
First, the main idea of feature parallelism is to find the optimal segmentation points on different feature sets on different machines, and then synchronize the optimal segmentation points among the machines.
Secondly, the data parallelism is to make different machines construct histograms locally first, then to carry out global merging and finally to find the optimal dividing point on the merged histograms.
According to the method and the device for identifying the text data, the text data to be identified containing the target text fields can be screened through analysis of the metadata information of the target database, and the tag identification time can be saved and the identification efficiency can be improved through filtering the data of the non-target text fields. The text data to be identified of the target text field is classified through the pre-trained classification model of the target text field, the labels of the text data to be identified of the target text field are obtained, and the labels of the target text field are determined according to the labels of the plurality of text data to be identified of the target text field, so that the identification efficiency of the target text field can be improved. When the text is implemented, the target text can be specified by a user, so that the text can also meet the personalized requirements of the user, such as performing tag recognition on Chinese text, performing tag recognition on English text and the like. Summarizing, the method is suitable for analyzing massive and complex text field data, can improve the efficiency of data analysis, and provides technical support for users to quickly know and master data assets and promote enterprise digital transformation.
In one embodiment herein, as shown in fig. 2, there is provided a target text class field tag identification method, including:
step 201, obtaining metadata information of a target database.
Step 202, the metadata information corresponding to the target text field is screened from the obtained metadata information and added to the object list to be identified.
Step 203, according to the object list to be identified, a plurality of pieces of text data to be identified of the target text class field are obtained from the target database.
In particular, in order to improve recognition accuracy, step 203 further includes, after obtaining the text data to be recognized: and deleting irrelevant information in the acquired pieces of text data to be identified. Irrelevant information includes, but is not limited to, characters such as line feed, space, etc.
Step 204, classifying each text data to be identified of the target text field by using a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified.
Step 205, determining the label of the target text class field according to the labels of the plurality of text data to be identified of the target text class field.
Further, as shown in fig. 3, step 202 of screening metadata information corresponding to the target text field from the obtained metadata information and adding the metadata information to the object list to be identified includes:
step 301, the metadata information with the field type being a character string is screened from the acquired metadata information.
Step 302, for each piece of the screened metadata information, performing the following analysis:
in step 3021, a JDBC driver is used to establish a connection to a data source corresponding to the filtered metadata information.
Step 3022, query data over the connection using the SQL language.
Step 3023, counting the data amount containing the target text from the queried data.
When the step is implemented, a regular expression is adopted to judge whether the queried data contains a target text.
And 3024, judging whether the data amount of the target text is larger than a preset value, if so, indicating that the data corresponding to the screened metadata information is valid, and that large-area abnormality does not occur, taking the screened metadata information as target text metadata information and adding the target text metadata information into a to-be-identified object list. If not, the data corresponding to the screened metadata information is invalid, large-area abnormality occurs, and the screened metadata information is discarded.
In this step, the predetermined value is, for example, 80%, and can be set according to the actual requirement, which is not limited herein.
According to the embodiment, through screening the data of the character string type, the metadata information which does not contain the target text due to abnormal storage can be filtered, and the metadata information which contains the target text can be accurately determined.
In one embodiment herein, as shown in FIG. 4, the classification model training process includes:
step 401, determining the output label of the target text type and the classification model according to the service requirement.
In a specific implementation, the output tag may be set according to an actual requirement, and in an embodiment, the tag includes: other, event, organization, address are denoted by 0, 1, 2, 3, respectively.
Step 402, according to the output labels of the classification model, input text data corresponding to each output label and input text data of non-output labels are obtained from the database.
Step 403, performing data checking and sample balancing processing on the obtained data to obtain a final training data set, wherein the training data set comprises a plurality of samples, and each sample comprises input text data and an output label.
The data is checked, so that the accuracy of each piece of data can be ensured, and high-quality training data is provided for model training. The problem of poor training model effect caused by sample unbalance can be avoided through sample balance. In practice, the data size of each tag may be constrained to be in the interval 1:1 to 1:1.5.
Step 404, training a pre-established classification model using the training data set.
Further, as shown in fig. 5, in order to ensure the accuracy of the classification model, after the embodiment shown in fig. 4 trains the classification model, the method further includes:
step 501, a classification model is evaluated using a test set.
Step 502, calculating an evaluation index according to the evaluation result, wherein the evaluation index comprises: accuracy, precision and recall, and a comprehensive index F1 value.
Where accuracy refers to the number of label identifications in the test set that are correct divided by the total number of samples in the test set. The accuracy refers to the ratio of the number of labels correctly determined by the classifier to the total number of labels determined by the classifier. Recall is the ratio of the number of labels correctly determined by the classifier to the total number of labels. The F1 value is the harmonic mean of the precision and recall, with a maximum value of 1 and a minimum value of 0. The specific calculation formula is as follows:
accuracy rate:
Figure BDA0004071896890000121
precision ratio:
Figure BDA0004071896890000122
recall rate:
Figure BDA0004071896890000123
f1 value:
Figure BDA0004071896890000124
TP, FN, FP, TN in the above formula is shown in table 1 below:
TABLE 1
Figure BDA0004071896890000125
If the evaluation index is not satisfied, step 503 is performed, after the classification model is adjusted, the training is performed again until a classification model satisfying the evaluation index is obtained.
After the index evaluation is completed in steps 501 to 503, the final classification model is subjected to persistence storage, so that model support is provided for the subsequent text field label recognition.
In one embodiment, based on real data, a training set and a testing set of a model are formed through data processing and marking, and finally the accuracy of the model on the testing set is 0.940 after training of the model; the accuracy is 0.951; the recall rate is 0.888; the comprehensive index F1 value is 0.915.
In one embodiment, as shown in fig. 6, step 205 of determining the tag of the target text class field according to the tags of the plurality of text data to be identified of the target text class field includes:
step 601, counting the data amount of the text data to be identified of each label according to the labels of the text data to be identified of the target text class field.
Step 602, taking the label with the highest data volume as the label of the target text class field.
In an embodiment of the present disclosure, after identifying the tag of the target text field, determining the text attribute corresponding to the tag from the predetermined association relationship between the tag and the text attribute, and sending the tag and the text attribute of the target text field to the client, so that the user can determine whether the content corresponding to the target text field is useful by checking the tag and the text attribute of the target text field.
By identifying the target text type field label, the attribute and the usefulness of the data content in the target text type field can be determined through the label, and a technical basis is provided for subsequent data analysis.
Based on the same inventive concept, a target text class field tag recognition device is also provided herein, as described in the following embodiments. Because the principle of solving the problem of the target text field tag recognition device is similar to that of the target text field tag recognition method, the implementation of the target text field tag recognition device can refer to the target text field tag recognition method, and the repetition is omitted.
Specifically, as shown in fig. 7, the target text class field tag recognition apparatus includes:
a metadata information acquisition unit 701, configured to acquire metadata information of a target database;
a metadata information screening unit 702, configured to screen metadata information corresponding to the target text field from the acquired metadata information and add the metadata information to the to-be-identified object list;
a text data to be identified obtaining unit 703, configured to obtain, from the target database, a plurality of pieces of text data to be identified in the target text class field according to the list of objects to be identified;
A classification unit 704, configured to classify each text data to be identified in the target text field by using a pre-trained classification model of the target text field, so as to obtain a label of each text data to be identified in the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
the tag determining unit 705 is configured to determine a tag of the target text class field according to tags of a plurality of text data to be identified of the target text class field.
In an embodiment herein, a computer device is also provided, as shown in fig. 9, the computer device 902 may include one or more processors 904, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The computer device 902 may also comprise any memory 906 for storing any kind of information, such as code, settings, data, etc., in some embodiments a computer program is stored in the memory 906, which, when executed in the processor 904, implements the method described in any of the previous embodiments. For example, and without limitation, the memory 906 may include any one or more of the following combinations: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may store information using any technique. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 902. In one case, when the processor 904 executes associated instructions stored in any memory or combination of memories, the computer device 902 can perform any of the operations of the associated instructions. The computer device 902 also includes one or more drive mechanisms 908 for interacting with any memory, such as a hard disk drive mechanism, optical disk drive mechanism, and the like.
The computer device 902 may also include an input/output module 910 (I/O) for receiving various inputs (via an input device 912) and for providing various outputs (via an output device 914). One particular output mechanism may include a presentation device 916 and an associated graphical user interface 918 (GUI). In other embodiments, input/output module 910 (I/O), input device 912, and output device 914 may not be included, but merely as a computer device in a network. The computer device 902 may also include one or more network interfaces 920 for exchanging data with other devices via one or more communication links 922. One or more communication buses 924 couple the above-described components together.
The communication link 922 may be implemented in any manner, for example, through a local area network, a wide area network (e.g., the internet), a point-to-point connection, etc., or any combination thereof. Communication link 922 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.
Embodiments herein also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.
Embodiments herein also provide a computer readable instruction, wherein the program therein causes the processor to perform the above method when the processor executes the instruction.
It should be understood that, in the various embodiments herein, the sequence number of each process described above does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments herein.
It should also be understood that in embodiments herein, the term "and/or" is merely one relationship that describes an associated object, meaning that three relationships may exist. For example, a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided herein, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the elements may be selected according to actual needs to achieve the objectives of the embodiments herein.
In addition, each functional unit in the embodiments herein may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions herein are essentially or portions contributing to the prior art, or all or portions of the technical solutions may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Specific examples are set forth herein to illustrate the principles and embodiments herein and are merely illustrative of the methods herein and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the teachings herein, many variations are possible in the specific embodiments and in the scope of use, and nothing in this specification should be construed as a limitation on the invention.

Claims (10)

1. A method for identifying a target text field tag, comprising:
acquiring metadata information of a target database;
screening metadata information corresponding to the target text field from the acquired metadata information and adding the metadata information into an object list to be identified;
according to the object list to be identified, acquiring a plurality of pieces of text data to be identified of a target text field from a target database;
classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
And determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
2. The method of claim 1, wherein the metadata information of the target database comprises: database address, database port, database identification, data table identification in database, field identification in data table and field type.
3. The method of claim 2, wherein the step of screening the metadata information corresponding to the target text class field from the acquired metadata information and adding the metadata information to the object list to be identified includes:
screening metadata information with a field type of a character string from the acquired metadata information;
for each piece of the screened metadata information, the following analysis is performed:
adopting a JDBC driver to establish the connection of a data source corresponding to the screened metadata information;
querying data through the connection by using SQL language;
counting the data quantity containing the target text from the inquired data;
and judging whether the data amount of the target text is larger than a preset value, if so, taking the screened metadata information as target text metadata information and adding the target text metadata information into an object list to be identified.
4. The method of claim 1, wherein the encoding module is a BERT model, the classification module is a LightGBM model, and the classification model training process comprises:
determining the output label of the target text type and the classification model according to the service demand;
according to the output labels of the classification model, input text data corresponding to each output label and input text data of non-output labels are obtained from a database;
performing data checking and sample balancing processing on the acquired data to obtain a final training data set, wherein the training data set comprises a plurality of samples, and each sample comprises input text data and an output label;
training a pre-established classification model with a training dataset.
5. The method of claim 4, wherein after training is completed, further comprising:
evaluating the classification model by using the test set;
calculating an evaluation index according to the evaluation result, wherein the evaluation index comprises: accuracy, precision and recall, and comprehensive index F1 value;
if the evaluation index is not satisfied, the classification model is adjusted and then retraining is carried out until the classification model satisfying the evaluation index is obtained.
6. The method of claim 1, wherein the step of obtaining the plurality of pieces of text data to be recognized in the target text class field from the target database according to the list of objects to be recognized further comprises:
And deleting irrelevant information in the acquired pieces of text data to be identified.
7. The method of claim 1, wherein determining the tag of the target text class field based on the tags of the plurality of text data to be identified of the target text class field comprises:
counting the data quantity of the text data to be identified of each label according to the labels of the text data to be identified of the target text field;
and taking the label with the highest data volume as the label of the target text type field.
8. A target text class field tag recognition apparatus, comprising:
a metadata information acquisition unit configured to acquire metadata information of a target database;
the metadata information screening unit is used for screening metadata information corresponding to the target text type field from the acquired metadata information and adding the metadata information into the object list to be identified;
the text data acquisition unit to be identified is used for acquiring a plurality of pieces of text data to be identified of the target text type field from the target database according to the object list to be identified;
the classification unit is used for classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
And the label determining unit is used for determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 7 when executing the computer program.
10. A computer storage medium having stored thereon a computer program, which when executed by a processor of a computer device implements the method of any of claims 1 to 7.
CN202310096604.XA 2023-01-19 2023-01-19 Target text field label identification method and device Pending CN116166801A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310096604.XA CN116166801A (en) 2023-01-19 2023-01-19 Target text field label identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310096604.XA CN116166801A (en) 2023-01-19 2023-01-19 Target text field label identification method and device

Publications (1)

Publication Number Publication Date
CN116166801A true CN116166801A (en) 2023-05-26

Family

ID=86415936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310096604.XA Pending CN116166801A (en) 2023-01-19 2023-01-19 Target text field label identification method and device

Country Status (1)

Country Link
CN (1) CN116166801A (en)

Similar Documents

Publication Publication Date Title
CN108460014B (en) Enterprise entity identification method and device, computer equipment and storage medium
US11361004B2 (en) Efficient data relationship mining using machine learning
CN111324602A (en) Method for realizing financial big data oriented analysis visualization
US20170075983A1 (en) Subject-matter analysis of tabular data
CN105159971B (en) A kind of cloud platform data retrieval method
CN112463774B (en) Text data duplication eliminating method, equipment and storage medium
CN111125116B (en) Method and system for positioning code field in service table and corresponding code table
CN114238573B (en) Text countercheck sample-based information pushing method and device
CN110633371A (en) Log classification method and system
Tsytsarau et al. Managing diverse sentiments at large scale
JP2008210024A (en) Apparatus for analyzing set of documents, method for analyzing set of documents, program implementing this method, and recording medium storing this program
CN103778206A (en) Method for providing network service resources
CA2956627A1 (en) System and engine for seeded clustering of news events
WO2019006550A1 (en) System and method for value based region searching and associated search operators
CN115827862A (en) Associated acquisition method for multivariate expense voucher data
CN112685374B (en) Log classification method and device and electronic equipment
Sundari et al. A study of various text mining techniques
CN113688257B (en) Author name identity judging method based on large-scale literature data
CN116166801A (en) Target text field label identification method and device
CN113254583B (en) Document marking method, device and medium based on semantic vector
Panagopoulos et al. Scientometrics for success and influence in the microsoft academic graph
Chen et al. Multi-modal multi-layered topic classification model for social event analysis
CN107341169B (en) Large-scale software information station label recommendation method based on information retrieval
Matsunaga et al. Data mining applications and techniques: A systematic review
US11893008B1 (en) System and method for automated data harmonization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination