CN116166801A - Target text field label identification method and device - Google Patents
Target text field label identification method and device Download PDFInfo
- Publication number
- CN116166801A CN116166801A CN202310096604.XA CN202310096604A CN116166801A CN 116166801 A CN116166801 A CN 116166801A CN 202310096604 A CN202310096604 A CN 202310096604A CN 116166801 A CN116166801 A CN 116166801A
- Authority
- CN
- China
- Prior art keywords
- data
- identified
- text
- target text
- field
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a target text field tag identification method and device, wherein the method comprises the following steps: acquiring metadata information of a target database; screening metadata information corresponding to the target text field from the acquired metadata information and adding the metadata information into an object list to be identified; according to the object list to be identified, acquiring a plurality of pieces of text data to be identified of a target text field from a target database; classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; and determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field. The method is suitable for analyzing massive and complex text field data, and can improve the efficiency of data analysis.
Description
Technical Field
The present disclosure relates to the field of data identification, and in particular, to a method and apparatus for identifying a target text field tag.
Background
With the vigorous development of hard technology such as Internet, big data, artificial intelligence, digital twinning and the like, enterprises face management and application of massive data and bring great challenges. In particular, when enterprises are greatly advancing digital transformation, users who use data are not limited to professional data analysts, data warehouse engineers or database administrators, and the like, and more are capable of flexibly probing data tasks, namely identifying data type labels, in a manner which can be understood by non-technical background business personnel.
However, in this probe data task, the identification of text class field data tags is more complex. Specifically, various data are stored in various service system databases in a scattered manner or are collected in a data warehouse and a data lake, so that the attribute, the relation and the type of text data are more complex and various, and most of data are directly associated with services, so that the text data have a certain service threshold. The existing text field has error conditions, for example, one field stores data with different labels, so how to quickly and intelligently analyze massive text field data, thereby helping users to comprehensively and quickly read and understand the data conditions, and being one of the core problems in the current enterprise digital transformation process.
In the prior art, a text data tag identification mode adopts a regular expression or fuzzy search to match text data, and the mode has the problems of low identification rate and poor flexibility when facing complex text data.
The other text data tag recognition mode is to search text data by means of natural language processing mode, and the mode carries out data recognition based on static universal field word vectors, so that the word stock for storing the universal field word vectors has high requirements and has the problem of poor recognition effect.
In addition, in the prior art, when there is an error in text data stored in a database, for example, digits are stored in fields of chinese text or english text, and the existing text data tag identification method does not verify specific stored data in the database, so that under the condition of abnormal data storage, there is a problem that the identification efficiency is low and the computing resource is wasted. In the prior art, text types (such as Chinese text and English text) are not recognized, and target text type fields cannot be subjected to targeted tag recognition.
Disclosure of Invention
The method and the device are used for solving the problems that in the prior art, when the text field data labels in the database are identified, the identification efficiency is low, the identification effect is poor, the calculation resources are wasted and the identification can not be carried out according to the user requirements.
In order to solve the above technical problem, an aspect herein provides a target text class field tag identification method, including:
acquiring metadata information of a target database;
screening metadata information corresponding to the target text field from the acquired metadata information and adding the metadata information into an object list to be identified;
according to the object list to be identified, acquiring a plurality of pieces of text data to be identified of a target text field from a target database;
Classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
and determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
As a further embodiment herein, the metadata information of the target database includes: database address, database port, database identification, data table identification in database, field identification in data table and field type.
In a further embodiment, the step of screening metadata information corresponding to the target text field from the acquired metadata information and adding the metadata information to the to-be-identified object list includes:
screening metadata information with a field type of a character string from the acquired metadata information;
For each piece of the screened metadata information, the following analysis is performed:
adopting a JDBC driver to establish the connection of a data source corresponding to the screened metadata information;
querying data through the connection by using SQL language;
counting the data quantity containing the target text from the inquired data;
and judging whether the data amount of the target text is larger than a preset value, if so, taking the screened metadata information as target text metadata information and adding the target text metadata information into an object list to be identified.
As a further embodiment herein, the coding module is a BERT model, the classification module is a LightGBM model, and the classification model training process includes:
determining the output label of the target text type and the classification model according to the service demand;
according to the output labels of the classification model, input text data corresponding to each output label and input text data of non-output labels are obtained from a database;
performing data checking and sample balancing processing on the acquired data to obtain a final training data set, wherein the training data set comprises a plurality of samples, and each sample comprises input text data and an output label;
training a pre-established classification model with a training dataset.
As a further embodiment herein, after the training of the classification model is completed, the method further includes:
evaluating the classification model by using the test set;
calculating an evaluation index according to the evaluation result, wherein the evaluation index comprises: accuracy, precision and recall, and comprehensive index F1 value;
if the evaluation index is not satisfied, the classification model is adjusted and then retraining is carried out until the classification model satisfying the evaluation index is obtained.
In a further embodiment, according to the object to be identified list, after obtaining the plurality of pieces of text data to be identified in the target text class field from the target database, the method further includes:
and deleting irrelevant information in the acquired pieces of text data to be identified.
As a further embodiment herein, determining the tag of the target text class field from the tags of the plurality of text data to be identified of the target text class field includes:
counting the data quantity of the text data to be identified of each label according to the labels of the text data to be identified of the target text field;
and taking the label with the highest data volume as the label of the target text type field.
A second aspect of the present invention provides a target text class field tag identifying apparatus, comprising:
A metadata information acquisition unit configured to acquire metadata information of a target database;
the metadata information screening unit is used for screening metadata information corresponding to the target text type field from the acquired metadata information and adding the metadata information into the object list to be identified;
the text data acquisition unit to be identified is used for acquiring a plurality of pieces of text data to be identified of the target text type field from the target database according to the object list to be identified;
the classification unit is used for classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
and the label determining unit is used for determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
A third aspect herein provides a computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of the preceding embodiments when the computer program is executed.
A fourth aspect herein provides a computer storage medium having stored thereon a computer program which, when executed by a processor of a computer device, implements a method as described in any of the previous embodiments.
According to the target text field tag identification method and device, text data to be identified containing target text fields can be screened through analysis of target database metadata information, tag identification time can be saved through filtering data of non-target text fields, and identification efficiency is improved. The text data to be identified of the target text field is classified through the pre-trained classification model of the target text field, the labels of the text data to be identified of the target text field are obtained, and the labels of the target text field are determined according to the labels of the plurality of text data to be identified of the target text field, so that the identification efficiency of the target text field can be improved. When the text is implemented, the target text can be specified by a user, so that the text can also meet the personalized requirements of the user, such as performing tag recognition on Chinese text, performing tag recognition on English text and the like. Summarizing, the method is suitable for analyzing massive and complex text field data, can improve the efficiency of data analysis, and provides technical support for users to quickly know and master data assets and promote enterprise digital transformation.
The foregoing and other objects, features and advantages will be apparent from the following more particular description of preferred embodiments, as illustrated in the accompanying drawings.
Drawings
In order to more clearly illustrate the embodiments herein or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments herein and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.
FIG. 1 illustrates a block diagram of a target text class field tag identification system of embodiments herein;
FIG. 2 illustrates a flow chart of a target text class field tag identification method of embodiments herein;
FIG. 3 is a flow chart illustrating a metadata information determination process corresponding to a target text class field of embodiments herein;
FIG. 4 illustrates a flow chart of a classification model training process of embodiments herein;
FIG. 5 illustrates a flow chart of a classification model evaluation process of embodiments herein;
FIG. 6 illustrates a flow diagram of a target text class field tag determination process of an embodiment herein;
FIG. 7 illustrates a block diagram of a target text class field tag identification apparatus of an embodiment herein;
FIG. 8 illustrates a block diagram of a BERT model of an embodiment herein;
fig. 9 shows a block diagram of a computer device of embodiments herein.
Description of the drawings:
101. a client;
102. a server;
701. a metadata information acquisition unit;
702. a metadata information screening unit;
703. a text data acquisition unit to be identified;
704. a classification unit;
705. a tag determination unit;
902. a computer device;
904. a processor;
906. a memory;
908. a driving mechanism;
910. an input/output module;
912. an input device;
914. an output device;
916. a presentation device;
918. a graphical user interface;
920. a network interface;
922. a communication link;
924. a communication bus.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the disclosure. All other embodiments, based on the embodiments herein, which a person of ordinary skill in the art would obtain without undue burden, are within the scope of protection herein.
It should be noted that the terms "first," "second," and the like in the description and claims herein and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, apparatus, article, or device that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or device.
The present specification provides method operational steps as described in the examples or flowcharts, but may include more or fewer operational steps based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When a system or apparatus product in practice is executed, it may be executed sequentially or in parallel according to the method shown in the embodiments or the drawings.
It should be noted that, the method and the device for identifying the target text field tag are applicable to analysis of various text field data, and the application fields of the method and the device for identifying the target text field tag are not limited herein.
The data (including, but not limited to, data for analysis, data stored, data displayed, etc.) referred to in this application are information and data authorized by the user or sufficiently authorized by each party.
In one embodiment, a target text class field tag identification system is provided herein, as shown in fig. 1, including: client 101 and server 102.
The client 101 is used for a user to operate to set a target database and a target text by the user. The target database may include a plurality of databases, each of which may store a plurality of data tables, each of which includes a plurality of fields. In specific implementation, a user can specify a target database from service system databases, for example, a power grid is taken as an example, and service systems such as marketing, mining, PMS, scheduling and the like respectively store production and management data in different fields of the power grid, wherein the storage formats of the data are different.
The target text includes text in a different language such as chinese, english, etc. In some embodiments of the present description, the client may be a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, a smart wearable device, or the like. Wherein, intelligent wearable equipment can include intelligent bracelet, intelligent wrist-watch, intelligent glasses, intelligent helmet etc.. Of course, the client is not limited to the electronic device with a certain entity, and may also be software running in the electronic device.
The server 102 is configured to obtain metadata information of a target database according to a target database set by a user, screen metadata information corresponding to a target text field from the obtained metadata information, and add the metadata information to a to-be-identified object list; according to the object list to be identified, acquiring a plurality of pieces of text data to be identified of a target text field from a target database; classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified; and determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
In some embodiments, after determining the tag of the target text class field, the tag of the target text class field is also sent to the client 101, so that the user can view the recognition result.
In detail, metadata information of a database is data describing an object such as information resource (i.e., data) or data, and is used for the purpose of: identifying resources, evaluating resources, tracking changes of resources in the use process, realizing simple and efficient management of a large amount of networked data, realizing effective discovery, searching, integrated organization and effective management of the used resources of information resources.
In the prior art, for a database, metadata thereof includes: management attributes (e.g., creator, application, business line, business responsible, etc.), lifecycle (e.g., creation time, DDL time, last update time, version information, etc.), storage attributes (e.g., location, space size, physical size, etc.), data characteristics (e.g., data tilt, average length, etc.), usage characteristics (e.g., DML, refresh rate, etc.), data structure tables/partitions (e.g., name, type, remarks, etc.), columns (e.g., name, type, length, precision, etc.), indexes (e.g., name, type, field, etc.), constraints (e.g., type, field, etc.). In this application, the metadata information of the database includes: database address, database port, database identification, data table identification in database, field identification in data table and field type, wherein the field type comprises character type (varchar 2, nvarchar2, char, nchar, long), digital type (number, float) and date type (data, timestamp). In particular implementations, the metadata information of the database also includes an application system identification.
The metadata information in the object list to be identified is executed in a first-in first-out order.
In some embodiments, the coding module is a BERT model, the classification module is a LightGBM model, and recognition of text field tags is completed by combining the BERT model and the LightGBM (Light Gradient Boosting Machine) model.
The BERT model is implemented based on a bi-directional transducer encoder, where bi-directional and transducer encoders are two of the core features of the model, and are also the innovation of the model. In particular, the bi-directional mechanism employed in the BERT model differs from the conventional bi-directional model that only context information is considered in that the BERT model also fuses context information that is commonly relied upon in all layer structures. And compared with the LSTM model structure, the transducer encoder fuses and feeds back the context information of the words more deeply. Meanwhile, the Attention mechanism adopted in the conventional model generally needs to rely on a Recurrent Neural Network (RNN) or a Convolutional Neural Network (CNN), and the transducer framework is constructed based on only the Attention mechanism. From the main structure, the BERT model follows the organization structure of an Encoder-Decode, and the Encoder and the Decode are both multi-layer structures, and the layers are connected and information transferred by using residual values. The word vector is output through the Decoder to a linear map, which maps it to the entire dictionary space. Finally, outputting the probability distribution of dictionary words through a Softmax classifier, namely the score of each word vector, and finally selecting the word vector with the highest score as a final model output result.
The BERT model consists mainly of three parts: word vectors, location vectors, and segment vectors. In particular, the head and tail of the text are marked with two special symbols [ CLS ] and [ SEP ], which are mainly used to distinguish different sentences. And the output of the model is a semantic representation result fed back by each word in the text through the multi-layer encoder and fusing information of its context. The BERT model structure is shown in fig. 8.
LightGBM (Light Gradient Boosting Machine) is a framework for realizing GBDT (Gradient Boosting Decision Tree) algorithm, supports high-efficiency parallel training, and has the advantages of faster training speed, lower memory consumption, better accuracy, support of distributed type and capability of rapidly processing mass data. The principle of the algorithm is as follows:
(1) histogram algorithm
The basic idea of the histogram algorithm is to first discretize the continuous floating point eigenvalues into k integers while constructing a histogram of width k. When traversing data, accumulating statistics in the histogram according to the discretized value as an index, accumulating needed statistics in the histogram after traversing the data once, and then traversing to find the optimal segmentation point according to the discretized value of the histogram. There are many advantages to using a histogram algorithm. Firstly, the most obvious is the reduction of the memory consumption, and the histogram algorithm not only does not need to additionally store the pre-ordered result, but also can only store the value after feature discretization.
(2) Histogram difference acceleration of LightGBM
An easily observed phenomenon: the histogram of a leaf may be derived from the difference between the histogram of its parent node and the histogram of its siblings. A histogram is typically constructed that needs to be traversed through all the data on that leaf, but the histogram is worse than just traversing k bins of the histogram. With this approach, the LightGBM can be doubled in speed after constructing a histogram of a leaf (the parent node has been computed in the last round), and can obtain the histogram of its sibling leaf at very little cost.
(3) Leaf-wise Leaf growth strategy with depth limitation
Once the Level-wise data is passed, the leaves of the same layer can be split at the same time, the multithreading optimization is easy to carry out, the complexity of the model is well controlled, and the fitting is not easy to be passed. But in practice Level-wise is an inefficient algorithm because it does not distinguish between leaves to be treated at the same Level, and introduces much unnecessary overhead because in practice many leaves have low splitting gain and do not have to be searched and split.
Leaf-wise is a more efficient strategy to find one Leaf with the greatest splitting gain from all the current leaves at a time, then split, and so on. Therefore, compared with the Level-wise, the Leaf-wise can reduce more errors and obtain better precision under the condition of the same splitting times. The disadvantage of Leaf-wise is that a relatively deep decision tree may grow, resulting in an overfitting. The LightGBM thus adds a maximum depth limit above Leaf-wise, preventing overfitting while ensuring high efficiency.
(4) Direct support of category features (i.e. without one-hot coding)
In fact, most machine learning tools cannot directly support category features, and category features are generally required to be converted into multidimensional one-hot coding features, so that space and time efficiency is reduced. The use of class features is common in practice. Based on this consideration, the LightGBM optimizes support for class features that can be directly entered without additional one-hot code expansion. And the decision rule of the category characteristic is added on the decision tree algorithm.
(5) Direct support of efficient parallelism
LightGBM also has the advantage of supporting efficient parallelism. The LightGBM native supports parallel learning, currently supporting both feature parallelism and data parallelism.
First, the main idea of feature parallelism is to find the optimal segmentation points on different feature sets on different machines, and then synchronize the optimal segmentation points among the machines.
Secondly, the data parallelism is to make different machines construct histograms locally first, then to carry out global merging and finally to find the optimal dividing point on the merged histograms.
According to the method and the device for identifying the text data, the text data to be identified containing the target text fields can be screened through analysis of the metadata information of the target database, and the tag identification time can be saved and the identification efficiency can be improved through filtering the data of the non-target text fields. The text data to be identified of the target text field is classified through the pre-trained classification model of the target text field, the labels of the text data to be identified of the target text field are obtained, and the labels of the target text field are determined according to the labels of the plurality of text data to be identified of the target text field, so that the identification efficiency of the target text field can be improved. When the text is implemented, the target text can be specified by a user, so that the text can also meet the personalized requirements of the user, such as performing tag recognition on Chinese text, performing tag recognition on English text and the like. Summarizing, the method is suitable for analyzing massive and complex text field data, can improve the efficiency of data analysis, and provides technical support for users to quickly know and master data assets and promote enterprise digital transformation.
In one embodiment herein, as shown in fig. 2, there is provided a target text class field tag identification method, including:
In particular, in order to improve recognition accuracy, step 203 further includes, after obtaining the text data to be recognized: and deleting irrelevant information in the acquired pieces of text data to be identified. Irrelevant information includes, but is not limited to, characters such as line feed, space, etc.
Further, as shown in fig. 3, step 202 of screening metadata information corresponding to the target text field from the obtained metadata information and adding the metadata information to the object list to be identified includes:
in step 3021, a JDBC driver is used to establish a connection to a data source corresponding to the filtered metadata information.
When the step is implemented, a regular expression is adopted to judge whether the queried data contains a target text.
And 3024, judging whether the data amount of the target text is larger than a preset value, if so, indicating that the data corresponding to the screened metadata information is valid, and that large-area abnormality does not occur, taking the screened metadata information as target text metadata information and adding the target text metadata information into a to-be-identified object list. If not, the data corresponding to the screened metadata information is invalid, large-area abnormality occurs, and the screened metadata information is discarded.
In this step, the predetermined value is, for example, 80%, and can be set according to the actual requirement, which is not limited herein.
According to the embodiment, through screening the data of the character string type, the metadata information which does not contain the target text due to abnormal storage can be filtered, and the metadata information which contains the target text can be accurately determined.
In one embodiment herein, as shown in FIG. 4, the classification model training process includes:
In a specific implementation, the output tag may be set according to an actual requirement, and in an embodiment, the tag includes: other, event, organization, address are denoted by 0, 1, 2, 3, respectively.
The data is checked, so that the accuracy of each piece of data can be ensured, and high-quality training data is provided for model training. The problem of poor training model effect caused by sample unbalance can be avoided through sample balance. In practice, the data size of each tag may be constrained to be in the interval 1:1 to 1:1.5.
Further, as shown in fig. 5, in order to ensure the accuracy of the classification model, after the embodiment shown in fig. 4 trains the classification model, the method further includes:
Where accuracy refers to the number of label identifications in the test set that are correct divided by the total number of samples in the test set. The accuracy refers to the ratio of the number of labels correctly determined by the classifier to the total number of labels determined by the classifier. Recall is the ratio of the number of labels correctly determined by the classifier to the total number of labels. The F1 value is the harmonic mean of the precision and recall, with a maximum value of 1 and a minimum value of 0. The specific calculation formula is as follows:
TP, FN, FP, TN in the above formula is shown in table 1 below:
TABLE 1
If the evaluation index is not satisfied, step 503 is performed, after the classification model is adjusted, the training is performed again until a classification model satisfying the evaluation index is obtained.
After the index evaluation is completed in steps 501 to 503, the final classification model is subjected to persistence storage, so that model support is provided for the subsequent text field label recognition.
In one embodiment, based on real data, a training set and a testing set of a model are formed through data processing and marking, and finally the accuracy of the model on the testing set is 0.940 after training of the model; the accuracy is 0.951; the recall rate is 0.888; the comprehensive index F1 value is 0.915.
In one embodiment, as shown in fig. 6, step 205 of determining the tag of the target text class field according to the tags of the plurality of text data to be identified of the target text class field includes:
In an embodiment of the present disclosure, after identifying the tag of the target text field, determining the text attribute corresponding to the tag from the predetermined association relationship between the tag and the text attribute, and sending the tag and the text attribute of the target text field to the client, so that the user can determine whether the content corresponding to the target text field is useful by checking the tag and the text attribute of the target text field.
By identifying the target text type field label, the attribute and the usefulness of the data content in the target text type field can be determined through the label, and a technical basis is provided for subsequent data analysis.
Based on the same inventive concept, a target text class field tag recognition device is also provided herein, as described in the following embodiments. Because the principle of solving the problem of the target text field tag recognition device is similar to that of the target text field tag recognition method, the implementation of the target text field tag recognition device can refer to the target text field tag recognition method, and the repetition is omitted.
Specifically, as shown in fig. 7, the target text class field tag recognition apparatus includes:
a metadata information acquisition unit 701, configured to acquire metadata information of a target database;
a metadata information screening unit 702, configured to screen metadata information corresponding to the target text field from the acquired metadata information and add the metadata information to the to-be-identified object list;
a text data to be identified obtaining unit 703, configured to obtain, from the target database, a plurality of pieces of text data to be identified in the target text class field according to the list of objects to be identified;
A classification unit 704, configured to classify each text data to be identified in the target text field by using a pre-trained classification model of the target text field, so as to obtain a label of each text data to be identified in the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
the tag determining unit 705 is configured to determine a tag of the target text class field according to tags of a plurality of text data to be identified of the target text class field.
In an embodiment herein, a computer device is also provided, as shown in fig. 9, the computer device 902 may include one or more processors 904, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The computer device 902 may also comprise any memory 906 for storing any kind of information, such as code, settings, data, etc., in some embodiments a computer program is stored in the memory 906, which, when executed in the processor 904, implements the method described in any of the previous embodiments. For example, and without limitation, the memory 906 may include any one or more of the following combinations: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may store information using any technique. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 902. In one case, when the processor 904 executes associated instructions stored in any memory or combination of memories, the computer device 902 can perform any of the operations of the associated instructions. The computer device 902 also includes one or more drive mechanisms 908 for interacting with any memory, such as a hard disk drive mechanism, optical disk drive mechanism, and the like.
The computer device 902 may also include an input/output module 910 (I/O) for receiving various inputs (via an input device 912) and for providing various outputs (via an output device 914). One particular output mechanism may include a presentation device 916 and an associated graphical user interface 918 (GUI). In other embodiments, input/output module 910 (I/O), input device 912, and output device 914 may not be included, but merely as a computer device in a network. The computer device 902 may also include one or more network interfaces 920 for exchanging data with other devices via one or more communication links 922. One or more communication buses 924 couple the above-described components together.
The communication link 922 may be implemented in any manner, for example, through a local area network, a wide area network (e.g., the internet), a point-to-point connection, etc., or any combination thereof. Communication link 922 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.
Embodiments herein also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.
Embodiments herein also provide a computer readable instruction, wherein the program therein causes the processor to perform the above method when the processor executes the instruction.
It should be understood that, in the various embodiments herein, the sequence number of each process described above does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation on the implementation process of the embodiments herein.
It should also be understood that in embodiments herein, the term "and/or" is merely one relationship that describes an associated object, meaning that three relationships may exist. For example, a and/or B may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided herein, it should be understood that the disclosed systems, devices, and methods may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices, or elements, or may be an electrical, mechanical, or other form of connection.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the elements may be selected according to actual needs to achieve the objectives of the embodiments herein.
In addition, each functional unit in the embodiments herein may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions herein are essentially or portions contributing to the prior art, or all or portions of the technical solutions may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments herein. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Specific examples are set forth herein to illustrate the principles and embodiments herein and are merely illustrative of the methods herein and their core ideas; also, as will be apparent to those of ordinary skill in the art in light of the teachings herein, many variations are possible in the specific embodiments and in the scope of use, and nothing in this specification should be construed as a limitation on the invention.
Claims (10)
1. A method for identifying a target text field tag, comprising:
acquiring metadata information of a target database;
screening metadata information corresponding to the target text field from the acquired metadata information and adding the metadata information into an object list to be identified;
according to the object list to be identified, acquiring a plurality of pieces of text data to be identified of a target text field from a target database;
classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
And determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
2. The method of claim 1, wherein the metadata information of the target database comprises: database address, database port, database identification, data table identification in database, field identification in data table and field type.
3. The method of claim 2, wherein the step of screening the metadata information corresponding to the target text class field from the acquired metadata information and adding the metadata information to the object list to be identified includes:
screening metadata information with a field type of a character string from the acquired metadata information;
for each piece of the screened metadata information, the following analysis is performed:
adopting a JDBC driver to establish the connection of a data source corresponding to the screened metadata information;
querying data through the connection by using SQL language;
counting the data quantity containing the target text from the inquired data;
and judging whether the data amount of the target text is larger than a preset value, if so, taking the screened metadata information as target text metadata information and adding the target text metadata information into an object list to be identified.
4. The method of claim 1, wherein the encoding module is a BERT model, the classification module is a LightGBM model, and the classification model training process comprises:
determining the output label of the target text type and the classification model according to the service demand;
according to the output labels of the classification model, input text data corresponding to each output label and input text data of non-output labels are obtained from a database;
performing data checking and sample balancing processing on the acquired data to obtain a final training data set, wherein the training data set comprises a plurality of samples, and each sample comprises input text data and an output label;
training a pre-established classification model with a training dataset.
5. The method of claim 4, wherein after training is completed, further comprising:
evaluating the classification model by using the test set;
calculating an evaluation index according to the evaluation result, wherein the evaluation index comprises: accuracy, precision and recall, and comprehensive index F1 value;
if the evaluation index is not satisfied, the classification model is adjusted and then retraining is carried out until the classification model satisfying the evaluation index is obtained.
6. The method of claim 1, wherein the step of obtaining the plurality of pieces of text data to be recognized in the target text class field from the target database according to the list of objects to be recognized further comprises:
And deleting irrelevant information in the acquired pieces of text data to be identified.
7. The method of claim 1, wherein determining the tag of the target text class field based on the tags of the plurality of text data to be identified of the target text class field comprises:
counting the data quantity of the text data to be identified of each label according to the labels of the text data to be identified of the target text field;
and taking the label with the highest data volume as the label of the target text type field.
8. A target text class field tag recognition apparatus, comprising:
a metadata information acquisition unit configured to acquire metadata information of a target database;
the metadata information screening unit is used for screening metadata information corresponding to the target text type field from the acquired metadata information and adding the metadata information into the object list to be identified;
the text data acquisition unit to be identified is used for acquiring a plurality of pieces of text data to be identified of the target text type field from the target database according to the object list to be identified;
the classification unit is used for classifying each text data to be identified of the target text field by utilizing a pre-trained classification model of the target text field to obtain a label of each text data to be identified of the target text field; the classification model comprises a coding module and a classification module, wherein the coding module is used for coding each piece of text data to be identified to obtain semantic vectors of the text data to be identified, and the classification model is used for classifying the semantic vectors of the text data to be identified to obtain labels of the text data to be identified;
And the label determining unit is used for determining the labels of the target text type field according to the labels of the plurality of text data to be identified of the target text type field.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 7 when executing the computer program.
10. A computer storage medium having stored thereon a computer program, which when executed by a processor of a computer device implements the method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310096604.XA CN116166801A (en) | 2023-01-19 | 2023-01-19 | Target text field label identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310096604.XA CN116166801A (en) | 2023-01-19 | 2023-01-19 | Target text field label identification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116166801A true CN116166801A (en) | 2023-05-26 |
Family
ID=86415936
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310096604.XA Pending CN116166801A (en) | 2023-01-19 | 2023-01-19 | Target text field label identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116166801A (en) |
-
2023
- 2023-01-19 CN CN202310096604.XA patent/CN116166801A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108460014B (en) | Enterprise entity identification method and device, computer equipment and storage medium | |
US11361004B2 (en) | Efficient data relationship mining using machine learning | |
CN111324602A (en) | Method for realizing financial big data oriented analysis visualization | |
US20170075983A1 (en) | Subject-matter analysis of tabular data | |
CN105159971B (en) | A kind of cloud platform data retrieval method | |
CN112463774B (en) | Text data duplication eliminating method, equipment and storage medium | |
CN111125116B (en) | Method and system for positioning code field in service table and corresponding code table | |
CN114238573B (en) | Text countercheck sample-based information pushing method and device | |
CN110633371A (en) | Log classification method and system | |
Tsytsarau et al. | Managing diverse sentiments at large scale | |
JP2008210024A (en) | Apparatus for analyzing set of documents, method for analyzing set of documents, program implementing this method, and recording medium storing this program | |
CN103778206A (en) | Method for providing network service resources | |
CA2956627A1 (en) | System and engine for seeded clustering of news events | |
WO2019006550A1 (en) | System and method for value based region searching and associated search operators | |
CN115827862A (en) | Associated acquisition method for multivariate expense voucher data | |
CN112685374B (en) | Log classification method and device and electronic equipment | |
Sundari et al. | A study of various text mining techniques | |
CN113688257B (en) | Author name identity judging method based on large-scale literature data | |
CN116166801A (en) | Target text field label identification method and device | |
CN113254583B (en) | Document marking method, device and medium based on semantic vector | |
Panagopoulos et al. | Scientometrics for success and influence in the microsoft academic graph | |
Chen et al. | Multi-modal multi-layered topic classification model for social event analysis | |
CN107341169B (en) | Large-scale software information station label recommendation method based on information retrieval | |
Matsunaga et al. | Data mining applications and techniques: A systematic review | |
US11893008B1 (en) | System and method for automated data harmonization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |