CN116910279A - Label extraction method, apparatus and computer readable storage medium - Google Patents

Label extraction method, apparatus and computer readable storage medium Download PDF

Info

Publication number
CN116910279A
CN116910279A CN202311178989.0A CN202311178989A CN116910279A CN 116910279 A CN116910279 A CN 116910279A CN 202311178989 A CN202311178989 A CN 202311178989A CN 116910279 A CN116910279 A CN 116910279A
Authority
CN
China
Prior art keywords
semantic
vector
tag
model
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202311178989.0A
Other languages
Chinese (zh)
Other versions
CN116910279B (en
Inventor
郭峻宁
陈晓锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhicheng Software Technology Service Co ltd
Shenzhen Smart City Technology Development Group Co ltd
Original Assignee
Shenzhen Zhicheng Software Technology Service Co ltd
Shenzhen Smart City Technology Development Group Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhicheng Software Technology Service Co ltd, Shenzhen Smart City Technology Development Group Co ltd filed Critical Shenzhen Zhicheng Software Technology Service Co ltd
Priority to CN202311178989.0A priority Critical patent/CN116910279B/en
Publication of CN116910279A publication Critical patent/CN116910279A/en
Application granted granted Critical
Publication of CN116910279B publication Critical patent/CN116910279B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Library & Information Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses a label extraction method, label extraction equipment and a computer readable storage medium, and belongs to the technical field of information processing. The method comprises the following steps: determining a digital vector based on unstructured text to be extracted, wherein the digital vector characterizes context information and semantic information of the unstructured text; inputting the digital vector into a trained semantic model to obtain corresponding semantic features; and matching the semantic features with the tag pool, and determining a target tag according to a matching result. The present application is directed to extracting tag information of unstructured text.

Description

Label extraction method, apparatus and computer readable storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a tag extraction method, a tag extraction device, and a computer readable storage medium.
Background
In order to quickly acquire the needed information, more and more users can select an automatic extraction mode, namely, refined and valuable information is extracted from mass data by utilizing a computer technology, so that the information extraction efficiency is greatly improved.
In the related art, a regular expression is generally used to quickly extract required information, and the specific steps are as follows: finding patterns, i.e. determining patterns of information to be extracted, such as date, number, etc.; constructing a regular expression, namely constructing a proper regular expression based on a required mode; compiling the regular expression, namely compiling the regular expression into an executable mode in the code by using a proper programming language; matching the text, namely inputting the text of the information to be extracted into the compiled regular expression to find a matched pattern; information and processing results are extracted, namely, required information is extracted according to the matched mode, and necessary processing is carried out on the extracted information.
However, regular expressions can only handle pattern-based matching and cannot understand the context of text, and therefore, for unstructured text that requires understanding of semantics, such as natural language text, regular expressions cannot perform information extraction.
The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.
Disclosure of Invention
The application mainly aims to provide a label extraction method, label extraction equipment and a computer readable storage medium, and aims to solve the technical problem that the extraction breadth of the existing information extraction mode is limited.
In order to achieve the above object, the present application provides a tag extraction method comprising the steps of:
determining a digital vector based on unstructured text to be extracted, wherein the digital vector characterizes context information and semantic information of the unstructured text;
inputting the digital vector into a trained semantic model to obtain corresponding semantic features;
and matching the semantic features with the tag pool, and determining a target tag according to a matching result.
Optionally, the digital vector includes a weighted value vector, and the step of determining the digital vector based on the unstructured text to be extracted includes:
according to a word segmentation algorithm, cutting long sentences in the unstructured text to be extracted into a plurality of word fragments with fixed lengths;
determining a query vector, a key vector and a value vector of each word segment, and calculating the attention weight of each word segment according to the query vector and the key vector;
and carrying out weighted summation on the attention weight and the value vector to obtain a weighted value vector of each word segment.
Optionally, the trained semantic model includes a trained mask language model and a trained sentence prediction model, and the step of inputting the digital vector into the trained semantic model to obtain the corresponding semantic feature includes:
randomly selecting a portion of the digital vector and replacing the selected digital vector with a mask mark;
inputting the digital vector and the mask mark into a trained mask language model, and predicting word fragments shielded by the mask mark;
inputting the digital vector into a trained sentence prediction model to predict the adjacent relation of long sentences where each word segment is located;
according to the predicted word segments and the predicted adjacent relations, semantic information is understood, and semantic features corresponding to the semantic information are determined;
dividing all the semantic features into a plurality of similar semantic feature groups, and carrying out pooling operation on each similar semantic feature group to obtain corresponding representative semantic features.
Optionally, the step of matching the semantic feature with the tag pool and determining the target tag according to the matching result includes:
obtaining a plurality of labels of a label pool, and comparing the representative semantic features with vector similarity among all the labels;
and determining the label with the highest vector similarity as the target label.
Optionally, before the step of determining the number vector based on the unstructured text to be extracted, the method includes:
crawling data on the webpage by using a crawler technology to obtain a semi-structured text;
dividing the semi-structured text into a structured text and an unstructured text according to the attribute of the body;
after the step of matching the semantic features with the tag pool and determining the target tag according to the matching result, the method comprises the following steps:
analyzing and extracting the structured text according to a standard format during text storage, and directly generating key information;
and carrying out data fusion on the target label and the key information to obtain a corresponding target structured text.
Optionally, before the step of determining the number vector based on the unstructured text to be extracted, the method includes:
acquiring a training sample and corresponding standard semantic features, and determining a training digital vector based on the training sample, wherein the training digital vector characterizes context information and semantic information of the training sample;
inputting the training digital vector into a pre-trained semantic model to obtain corresponding training semantic features;
and calculating a loss function value between the training semantic feature and the standard semantic feature according to a preset loss function, and adjusting the pre-trained semantic model according to the loss function value until the minimum loss function value is reached.
Optionally, the step of calculating a loss function value between the training semantic feature and the standard semantic feature according to a preset loss function, and adjusting the pre-training semantic model according to the loss function value until reaching a minimum loss function value includes:
calculating the accuracy, recall and balance indexes of the pre-trained semantic model through a test sample evaluation model;
according to the calculation result and the corresponding weight value, calculating the comprehensive index value of the pre-trained semantic model;
and if the comprehensive index value meets the index threshold value, a trained semantic model is derived.
Optionally, the step of matching the semantic feature with the tag pool and determining the target tag according to the matching result includes:
and matching the semantic features with a tag pool, and if the matching rate does not meet the lowest matching rate, taking the semantic features as new tags and adding the new tags into the tag pool.
In addition, in order to achieve the above object, the present application also provides a tag extraction apparatus comprising: the system comprises a memory, a processor and a label extraction program stored on the memory and capable of running on the processor, wherein the label extraction program is configured to realize the steps of the label extraction method.
In addition, in order to achieve the above object, the present application also provides a computer-readable storage medium having stored thereon a tag extraction program which, when executed by a processor, implements the steps of the tag extraction method.
In one technical scheme provided by the application, a digital vector of an unstructured text is determined, then the digital vector is input into a trained semantic model to obtain corresponding semantic features, and finally a target label of the text is determined according to a matching result of the semantic features and a label pool. According to the method and the device, through data processing and a semantic model, the context information and the semantic information of the unstructured text can be understood, further abstract information which is difficult to understand in the text is accurately extracted, and target labels are matched with the abstract information in advance, so that automatic extraction of the unstructured text is achieved, and then text classification can be rapidly achieved based on the target labels of each text. Moreover, the semantic model adopted by the scheme has higher universality, so that the problem of frequent iteration and update does not exist when facing a new policy, and the maintenance and update cost of the information extraction tool is greatly reduced.
Drawings
FIG. 1 is a schematic flow chart of a first embodiment of a label extraction method according to the present application;
FIG. 2 is a flowchart of a second embodiment of a label extraction method according to the present application;
FIG. 3 is a flowchart of a label extraction method according to a third embodiment of the present application;
fig. 4 is a schematic structural diagram of a tag extraction device of a hardware running environment according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In the process of automatically extracting information, regular expressions are generally required to be constructed, relational word libraries are required to be constructed, and the like.
However, regular expressions can only handle pattern-based matching and cannot understand the context of text, and therefore, for unstructured text that requires understanding of semantics, such as natural language text, regular expressions cannot perform information extraction. Moreover, the files of different types and the files of different periods correspond to different information modes, so that regular expressions are required to be continuously constructed and compiled for the files, the information extraction cost is high, and the universality is low.
As for the relational word stock, resources are consumed in the construction process, and the word stock is required to be updated and optimized continuously along with more and more text input, so that the operation cost is high.
In order to solve the problems, the application firstly extracts the digital vector from the unstructured text, inputs the digital vector into a trained semantic model to obtain semantic features, then matches the semantic features through a tag pool, and determines a target tag, thereby realizing information understanding and tag extraction of the unstructured text and expanding the information extraction breadth.
In order that the above-described aspects may be better understood, exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the application to those skilled in the art.
An embodiment of the present application provides a tag extraction method, referring to fig. 1, fig. 1 is a schematic flow chart of a first embodiment of a tag extraction method of the present application.
In this embodiment, the tag extraction method includes:
step S11: determining a digital vector based on unstructured text to be extracted, wherein the digital vector characterizes context information and semantic information of the unstructured text;
the application scenarios of the scheme include, but are not limited to: the government department carries out relevant works such as policy interpretation, public affair management, investigation and updating of policies based on the existing policies or information, legal relevant units or institutions process works such as legal documents, auxiliary judicial decisions and the like, works such as enterprise screening information, refined information, public opinion monitoring, market investigation and the like, works such as financial relevant units or institutions screening information, formulating directions, carrying out compliance checking and the like. For ease of understanding, this embodiment will be described with reference to extracting labels from policy documents.
It will be appreciated that unstructured text refers to text data that is not in an explicit format or organization, such text generally does not follow a particular pattern or rule, is difficult to understand directly, and requires conversion to a digital form for further analysis and processing; a numeric vector, which is a numerical form of a typical unstructured text, is capable of characterizing the context and semantics of the unstructured text.
Alternatively, bag of Words Model (Bag-of-Words Model): the unstructured text to be extracted is regarded as a set consisting of words, and the occurrence times or frequency of each word in the text are counted. In particular, the text may be converted into a numerical vector represented by the bag of words model using a tool such as a countvector or tfidfvector.
Alternatively, word Frequency-reverse file Frequency extraction (Term Frequency-Inverse Document Frequency, TF-IDF): importance weights of words are introduced on the basis of a word bag model, TF represents the frequency of the words in the text, and IDF represents the inverse document frequency of the words in the whole text set. By calculating the product of TF and IDF, the TF-IDF weight of the word can be obtained. The text may be converted into a numerical vector represented by TF-IDF using, in particular, tfidfvector tools.
Optionally, the digital vector includes a weighted value vector, and step S11 includes:
step S111: according to a word segmentation algorithm, cutting long sentences in the unstructured text to be extracted into a plurality of word fragments with fixed lengths;
it will be appreciated that policy documents are typically constructed from long sentences in paragraphs, while the input format required by the BERT model is a fixed length word segment and requires conversion of text to numbers. Thus, when policy document preprocessing is performed, steps such as word segmentation, encoding, and mask addition are required to enable the policy document to adapt to the BERT model.
Optionally, long sentences in the policy text are cut into individual words or vocabulary units, collectively referred to as word segments, using a word segmentation algorithm, such as rule-based word segmentation algorithms, statistical word segmentation algorithms, machine-learning-based word segmentation algorithms, deep-learning-based word segmentation algorithms, and the like.
Illustratively, by using Word Piece algorithm, through the steps of initializing vocabulary, calculating Word frequency, merging character fragments, cutting text, etc., the policy text is cut into Word fragments with fixed length, for example, for sentence "Bert is a powerful NLP model", the Word fragments obtained after Word segmentation are: "Bert", "is", "a", "powerfull", "NLP", "model".
Step S112: determining a query vector, a key vector and a value vector of each word segment, and calculating the attention weight of each word segment according to the query vector and the key vector;
it will be appreciated that the encoding process involves a self-attention mechanism of a converter-based bi-directional encoding model (Bidirectional Encoder Representations from Transformer, BERT), particularly a transducer model, which allows each word segment to interact with other word segments to effectively capture the semantic relationships of the context.
Alternatively, in the transducer model, each word segment is mapped to a query vector, a key vector, and a value vector, respectively. This step is typically implemented by a linear transformation, specifically, for each word segment input vector, the three weight matrices (query weight matrix, key weight matrix, and value weight matrix) are multiplied to obtain the corresponding query vector, key vector, and value vector, which are used to calculate the attention weight to focus the context-related information on the current word segment.
Further, by calculating the dot product of the query vector and the key vectors of all word segments, and then scaling and softmax operations on the dot product, normalized attention weights are obtained, which represent the similarity between the current word segment and other word segments, i.e., the importance in context.
Step S113: and carrying out weighted summation on the attention weight and the value vector to obtain a weighted value vector of each word segment.
Optionally, the value vector for each word segment is weighted summed using the attention weight to obtain the final representation, i.e., a weighted value vector for each word segment.
In one aspect, by calculating the attention weight, the weight vector captures the relationship between the word segment and other word segments, and the model can focus on the other word segments related to the current word segment, thereby capturing semantic and grammatical relationships in the context, and thus the weight vector can characterize the context information of unstructured text.
On the other hand, the semantic information of different word segments can be fused into the final representation by weighted summation of the value vectors through the attention mechanism, so that the weighted value vectors contain comprehensive semantic information of the word segments, and the weighted value vectors can characterize the semantic information of unstructured text.
It is noted that the BERT model employs a multi-headed attention mechanism, which means that the model learns a number of different attention weights, focusing on information in context from different angles. Finally, the outputs of these different heads are stitched together and the final vector representation is obtained by linear transformation.
It should be noted that, text feature extraction methods such as term frequency-reverse document frequency (TF-IDF) are very sensitive to the article length, and there may be a large number of short notifications in the text, such as policy delay notifications, policy revocation notifications, etc. In the face of these policy documents, the extraction of the key words of TF-IDF is prone to inaccuracy. In this scheme, the BERT model is not limited to the length of the unstructured text, whether it is short text, long text or even the whole document, and the corresponding word segmentation operation needs to be performed, and then each segment is encoded, so that local and global information in the text can be spread at any time.
Step S12: inputting the digital vector into a trained semantic model to obtain corresponding semantic features;
it can be appreciated that the trained semantic model is a model that can be trained through a large-scale dataset to understand and represent semantics, and the model can understand and process natural language by learning semantic relationships and context information in text data.
Optionally, ERNIE (Enhanced Representation through Knowledge Integration) is a transducer-based pre-training language model, and semantic features corresponding to the digital vectors are obtained through the forward propagation process of the ERNIE model.
Optionally, using the BERT-Base, chinese model, a pre-training model specific to Chinese language processing, the trained semantic model comprising a trained mask language model and a trained sentence prediction model, step S12 comprises:
step S121: randomly selecting a portion of the digital vector and replacing the selected digital vector with a mask mark;
step S122: inputting the digital vector and the mask mark into a trained mask language model, and predicting word fragments shielded by the mask mark;
step S123: inputting the digital vector into a trained sentence prediction model to predict the adjacent relation of long sentences where each word segment is located;
step S124: according to the predicted word segments and the predicted adjacent relations, semantic information is understood, and semantic features corresponding to the semantic information are determined;
step S125: dividing all the semantic features into a plurality of similar semantic feature groups, and carrying out pooling operation on each similar semantic feature group to obtain corresponding representative semantic features.
It will be appreciated that the processed policy text data is input into the BERT model in the form of a fixed vector, and the BERT obtains semantic features by executing a mask language model (MaskedLanguageModel, MLM) and a sentence prediction model (NextSentencePrediction, NSP).
Alternatively, in MLM, the model randomly MASKs the number vector of some word segments in the text and replaces it with a special MASK tag "[ MASK ]". The digital vector and mask labels are input into a trained sentence prediction model that predicts word segments masked by the mask labels based on relationships between words learned during a previous training process. Masking marks may help the model learn the context information better and avoid over-reliance on certain words in the input.
Further, in NSP, after detecting the input operation of the digital vector (including the digital vector of the predicted word segment), the model analyzes the long sentence where each word segment (including the predicted word segment) is located based on the relation between the sentences learned in the previous training process, and whether there is a neighboring relation in the original text, that is, whether there is a continuity between the two sentences, so as to grasp the semantics and continuity between the sentences.
So far, based on the adjacent relation between the word segments predicted by the MLM and the NSP, the specific semantic information of each digital vector is understood, and then the semantic features corresponding to the semantic information, such as 'communication', 'network', 'computer', and the like, are determined.
And then, dividing all semantic features according to the similarity degree of the semantics, classifying the semantic features into a plurality of similar semantic feature groups, wherein the group A is a building class comprising semantic features such as building, construction and real estate, and the group B is a green class comprising semantic features such as green, saving, environmental protection and the like.
After classification is completed, pooling operation is carried out on the similar semantic feature groups, and important feature information can be extracted by the pooling operation through reserving the most obvious features. Pooling operations are largely divided into two types, max pooling (MaxPooling) and average pooling (AveragePooling). The maximum pooling is implemented by selecting the maximum value from the vectors of each feature dimension and taking the maximum value as the output value of the feature dimension, wherein the word vector with the highest weight or the strongest representation is selected on each feature dimension, specifically, the vector values of the semantic features in the similar semantic feature groups are compared, and the semantic feature corresponding to the maximum vector value is taken as the representative semantic feature of the group; the average pooling is to calculate the average value of each feature dimension in the input vector and take the average value as the output value of the corresponding feature dimension, and the average pooling can be regarded as the average representation of the whole semantics of the word vector, specifically, the average vector value of all the semantic features in a similar semantic feature group is obtained, and the semantic feature corresponding to the average vector value is taken as the representative semantic feature of the group.
It should be noted that what pooling is chosen will be determined in the optimization and iteration of the subsequent model. However, the final goal of pooling is to compress and represent the semantic information of the entire text sequence with the aid of the obtained fixed length vectors, facilitating the input of data into subsequent neural network layers or classifiers to accomplish the task of classifying the policy text.
Step S13: and matching the semantic features with the tag pool, and determining a target tag according to a matching result.
It will be appreciated that in a multi-tag learning task, each unstructured text may belong to multiple tags, and thus a pool of tags is required to represent all possible tags. The method is used for the industrial label pool, and comprises labels of building engineering industry, green building related industry, building energy-saving related industry, building waste related industry and the like.
Optionally, rules or rule sets are predefined for determining matching relationships between semantic features and tags, and these rules may be summarized based on prior knowledge, domain expertise, or experience. And matching according to the rule, and matching the label conforming to the rule with the semantic feature.
Optionally, step S13 includes:
step S131: obtaining a plurality of labels of a label pool, and comparing the representative semantic features with vector similarity among all the labels;
step S132: and determining the label with the highest vector similarity as the target label.
Optionally, matching the semantic features with the tag pool, and selecting a target tag according to the matching result of the semantic features and each tag, wherein the specific process is that all tags in the tag pool are acquired, then a text similarity calculation method (such as cosine similarity, editing distance and the like) is used for comparing the vector similarity between the semantic features and each tag, and the tag with the highest similarity is selected as the tag information of the target tag, namely, the unstructured text.
It should be noted that if the matching rates of the semantic features and the existing labels are lower than the preset lowest matching rate, that is, the vector similarity is lower than the lowest similarity, it is indicated that no specially-related label exists in the existing label pool, so that the semantic features can be used as new labels and added into the label pool, and the label extraction of the subsequent text is facilitated.
In one technical scheme provided by the embodiment, a digital vector of an unstructured text is determined, then the digital vector is input into a trained semantic model to obtain corresponding semantic features, and finally a target label of the text is determined according to a matching result of the semantic features and a label pool. According to the method and the device, through data processing and a semantic model, the context information and the semantic information of the unstructured text can be understood, further abstract information which is difficult to understand in the text is accurately extracted, and target labels are matched with the abstract information in advance, so that automatic extraction of the unstructured text is achieved, and then text classification can be rapidly achieved based on the target labels of each text. Moreover, the semantic model adopted by the scheme has higher universality, so that the problem of frequent iteration and update does not exist when facing a new policy, and the maintenance and update cost of the information extraction tool is greatly reduced.
Further, referring to fig. 2, a second embodiment of the tag extraction method of the present application is presented. Based on the embodiment shown in fig. 1, before the step of determining the number vector based on the unstructured text to be extracted, the method includes:
step S21: crawling data on the webpage by using a crawler technology to obtain a semi-structured text;
step S22: dividing the semi-structured text into a structured text and an unstructured text according to the attribute of the body;
alternatively, a crawler is an automated program that simulates human operations, accesses web pages, and extracts the required data. The method specifically comprises the steps of determining a target webpage, sending an HTTP request, analyzing HTML content, cleaning and sorting data, storing data and the like, and crawling the data on the webpage to obtain a semi-structured text.
It will be appreciated that an ontology attribute refers to a sign describing an object, and in the examples of the present application, an ontology attribute primarily refers to some characteristic of a policy document (e.g., policy release time, policy related industry, etc.). Because the semi-structured data has partial structured characteristics, but also contains some unstructured data, the semi-structured data is refined according to the attribute of the body.
Optionally, if the ontology attribute is key information (e.g., policy issuing authority, policy validity period, etc.) contained in the policy content, dividing the portion of the content into structured text; if the ontology attribute is abstract information (e.g., related to industry, document objects, etc.), then the portion of the content is partitioned into unstructured text.
After the step of matching the semantic features with the tag pool and determining the target tag according to the matching result, the method comprises the following steps:
step S23: analyzing and extracting the structured text according to a standard format during text storage, and directly generating key information;
step S24: and carrying out data fusion on the target label and the key information to obtain a corresponding target structured text.
It will be appreciated that the structured document, when stored, is suitably stored in the form, row, column, etc. criteria, each row representing an independent instance of data, and each column representing a type of data that was well established at the time of data model design. Therefore, when the information extraction is carried out on the structured text, the key information can be directly generated by analyzing and extracting according to the standard format during storage, such as the position of a reference word, the context relation, the grammar rule and the like.
Optionally, the target tag and the key information are concatenated or combined according to a common field or identifier to create a comprehensive structured text containing both, or the information is extracted and processed using text mining and natural language processing techniques, the embodiment is not particularly limited.
In the technical scheme provided by the embodiment, the semi-structured text is divided into the structured text and the unstructured text, semantic understanding is performed on the unstructured text by adopting a semantic model, target labels are extracted, analysis and extraction are performed on the structured text by adopting a standard format, and finally data fusion is performed on the structured text and the unstructured text. The structured data and the unstructured data are processed separately, so that the data processing efficiency can be improved, the respective characteristics can be utilized to perform accurate calculation and analysis, different application requirements can be better met, and more accurate and effective data analysis results can be provided.
Further, referring to fig. 3, a third embodiment of the tag extraction method of the present application is presented. Based on the embodiment shown in fig. 1, before the step of determining the number vector based on the unstructured text to be extracted, the method includes:
step S31: acquiring a training sample and corresponding standard semantic features, and determining a training digital vector based on the training sample, wherein the training digital vector characterizes context information and semantic information of the training sample;
step S32: inputting the training digital vector into a pre-trained semantic model to obtain corresponding training semantic features;
step S33: and calculating a loss function value between the training semantic feature and the standard semantic feature according to a preset loss function, and adjusting the pre-trained semantic model according to the loss function value until the minimum loss function value is reached.
It can be understood that a technician can prepare a training sample in advance and manually label corresponding standard semantic features, such as "standard building waste comprehensive utilization product identification activity" of training text, further improve the building waste comprehensive utilization level in our city ", and the corresponding standard semantic features are" green "," energy saving ", and the like.
Optionally, after the training samples and the standard semantic features are obtained, data preprocessing is performed, that is, training digital vectors are determined based on the training samples, where the digital vectors represent context information and semantic information of the training samples, and the specific process is the same as step S11, and is not described herein.
Further, the training digital vector is input into the pre-training semantic model to obtain the corresponding training semantic feature, and the specific process is the same as step S12, which is not described herein.
Further, using preset function values, such as mean square error, cross entropy loss, etc., taking mean square error as an example, assume that the vector of training semantic features is x, the vector of standard semantic features is y, the loss function is L (x, y), and the calculation process of the mean square error loss function is as follows: calculating the difference, namely subtracting the training vector x and the standard vector y element by element to obtain a difference vector diff; square variance is calculated, namely square operation is carried out on each element of the difference vector diff, and square difference vector diff_squared is obtained; calculating a mean value, namely summing the square difference vectors diff_squared and dividing the sum by the length of the vectors to obtain a mean_squared; the loss function value is calculated, i.e. the mean_squared is taken as the loss function value. The specific calculation formula is as follows: l (x, y) = (1/n) × Σ (diff_squared), where n represents the length of the vector and Σ represents the summation.
Alternatively, in the BERT model, two main loss functions are used for training, the MLM loss function and the NSP loss function, respectively. The MLM loss function is used for training the prediction capability of the BERT model, namely, masking a part of input words, then enabling the model to predict the masked words, comparing a prediction result with the masked words in the original sentence, and calculating cross entropy loss; the NSP loss function is used to train the semantic understanding ability of the BERT model, i.e. by predicting whether two sentences are consecutive and comparing the prediction result with the actual label, the cross entropy loss is calculated.
Finally, calculating the gradient of the model parameter by using a back propagation algorithm according to the loss function value, and updating the model parameter by using an optimization algorithm (such as a gradient descent method) according to the gradient value so as to gradually reduce the loss function value. The previous steps are repeated until a preset stopping condition is reached, i.e. a minimum loss function value is reached.
It should be noted that the training of this scheme is a process of exposing the model to training data so that it gradually adapts to the data, where the training goal is to enable the model to learn patterns, features and relationships of the data from the training data so as to exhibit good performance on unseen data. During training, the model continuously adjusts model parameters by comparison with training data to minimize defined loss functions, which helps the model gradually adjust itself to make its predictions closer to the actual labels.
Step S35: calculating the accuracy, recall and balance indexes of the trained semantic model through a test sample evaluation model;
step S35: according to the calculation result and the corresponding weight value, calculating the comprehensive index value of the pre-trained semantic model;
step S36: and if the comprehensive index value meets the index threshold value, a trained semantic model is derived.
It will be appreciated that once the model has completed training, it may perform well on training data, but this does not mean that it performs well on real-world unknown data. Therefore, the scheme is to fine tune and optimize the trained model, so as to ensure the generalization capability of the model, namely, the model performs well on unseen data.
Optionally, the data samples are divided into training samples and test samples in advance, ensuring that the training samples and test samples are independent of each other. After training the model by using the training sample, testing the model by using the test sample to obtain a test result.
Further, calculating the accuracy, wherein the accuracy refers to the proportion of the number of samples with correct model prediction to the total number of samples, namely, the accuracy=the number of samples with correct prediction/the total number of samples; calculating a recall rate, wherein the recall rate refers to the proportion of the number of samples of the model correctly predicted as the positive example to the number of samples of the real positive example, namely, the recall rate=the number of samples of the positive example/the number of samples of the real positive example correctly predicted; calculating a balance index, wherein the balance index comprehensively considers the accuracy and the recall, and if the F1 value is a harmonic average value of the accuracy and the recall, the F1 value=2 (accuracy rate)/(accuracy rate+recall).
Further, according to the calculation result and the corresponding weight values, such as the accuracy, recall and balance index, the weight values are respectively 0.3, 0.3 and 0.4, and the sum is obtained after multiplication, so as to obtain the comprehensive index value of the pre-trained semantic model.
If the comprehensive index value meets the index threshold, the model can be well shown when facing unknown data, so that a trained semantic model is derived and put into formal use; if the composite index value does not meet the index threshold, it is stated that the model performs well on the training data, but does not perform well on the test data, and therefore fine tuning is required.
In one technical scheme provided in this embodiment, a training sample is used to train a pre-training model, a model is adjusted based on a loss function, and after a minimum loss function is reached, a test sample is used to evaluate the model. The training set is data for learning and adjusting parameters by the model, and the test set is used for evaluating the performance of the model on unseen data, so that the setting can help to detect and avoid overfitting, namely, the model performs well on the training set, but when the model performs poorly on the test set, the model cannot generalize to new data because the model overfits specific features of the training set.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a tag extraction apparatus of a hardware running environment according to an embodiment of the present application.
As shown in fig. 4, the tag extraction apparatus may include: a processor 1001, such as a central processing unit (Central Processing Unit, CPU), a communication bus 1002, a user interface 1003, a network interface 1004, a memory 1005. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a WIreless interface (e.g., a WIreless-FIdelity (WI-FI) interface). The Memory 1005 may be a high-speed random access Memory (Random Access Memory, RAM) Memory or a stable nonvolatile Memory (NVM), such as a disk Memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.
It will be appreciated by those skilled in the art that the structure shown in fig. 4 is not limiting of the tag extraction apparatus and may include more or fewer components than shown, or certain components may be combined, or a different arrangement of components.
As shown in fig. 4, an operating system, a data storage module, a network communication module, a user interface module, and a tag extraction program may be included in the memory 1005 as one type of storage medium.
In the tag extraction apparatus shown in fig. 4, the network interface 1004 is mainly used for data communication with other apparatuses; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the tag extraction apparatus of the present application may be provided in the tag extraction apparatus, which invokes the tag extraction program stored in the memory 1005 through the processor 1001 and performs the tag extraction method provided by the embodiment of the present application.
An embodiment of the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the embodiments of the tag extraction method described above.
Since the embodiments of the computer readable storage medium portion and the embodiments of the method portion correspond to each other, the embodiments of the computer readable storage medium portion are referred to the description of the embodiments of the method portion, and are not repeated herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of embodiments, it will be clear to a person skilled in the art that the above embodiment method may be implemented by means of software plus a necessary general hardware platform, but may of course also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (e.g. ROM/RAM, magnetic disk, optical disk) as described above, comprising instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the method according to the embodiments of the present application.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. A label extraction method, characterized in that the label extraction method comprises the steps of:
determining a digital vector based on unstructured text to be extracted, wherein the digital vector characterizes context information and semantic information of the unstructured text;
inputting the digital vector into a trained semantic model to obtain corresponding semantic features;
and matching the semantic features with the tag pool, and determining a target tag according to a matching result.
2. The method of tag extraction of claim 1, wherein the number vector comprises a weight vector, and the step of determining the number vector based on unstructured text to be extracted comprises:
according to a word segmentation algorithm, cutting long sentences in the unstructured text to be extracted into a plurality of word fragments with fixed lengths;
determining a query vector, a key vector and a value vector of each word segment, and calculating the attention weight of each word segment according to the query vector and the key vector;
and carrying out weighted summation on the attention weight and the value vector to obtain a weighted value vector of each word segment.
3. The method of claim 2, wherein the trained semantic model comprises a trained mask language model and a trained sentence prediction model, and wherein the step of inputting the digital vector into the trained semantic model to obtain the corresponding semantic feature comprises:
randomly selecting a portion of the digital vector and replacing the selected digital vector with a mask mark;
inputting the digital vector and the mask mark into a trained mask language model, and predicting word fragments shielded by the mask mark;
inputting the digital vector into a trained sentence prediction model to predict the adjacent relation of long sentences where each word segment is located;
according to the predicted word segments and the predicted adjacent relations, semantic information is understood, and semantic features corresponding to the semantic information are determined;
dividing all the semantic features into a plurality of similar semantic feature groups, and carrying out pooling operation on each similar semantic feature group to obtain corresponding representative semantic features.
4. The tag extraction method of claim 3, wherein the step of matching the semantic features with a tag pool and determining the target tag based on the matching result comprises:
obtaining a plurality of labels of a label pool, and comparing the representative semantic features with vector similarity among all the labels;
and determining the label with the highest vector similarity as the target label.
5. The method of tag extraction of claim 1, wherein prior to the step of determining a number vector based on unstructured text to be extracted, comprising:
crawling data on the webpage by using a crawler technology to obtain a semi-structured text;
dividing the semi-structured text into a structured text and an unstructured text according to the attribute of the body;
after the step of matching the semantic features with the tag pool and determining the target tag according to the matching result, the method comprises the following steps:
analyzing and extracting the structured text according to a standard format during text storage, and directly generating key information;
and carrying out data fusion on the target label and the key information to obtain a corresponding target structured text.
6. The method of tag extraction of claim 1, wherein prior to the step of determining a number vector based on unstructured text to be extracted, comprising:
acquiring a training sample and corresponding standard semantic features, and determining a training digital vector based on the training sample, wherein the training digital vector characterizes context information and semantic information of the training sample;
inputting the training digital vector into a pre-trained semantic model to obtain corresponding training semantic features;
and calculating a loss function value between the training semantic feature and the standard semantic feature according to a preset loss function, and adjusting the pre-trained semantic model according to the loss function value until the minimum loss function value is reached.
7. The tag extraction method of claim 6, wherein the step of calculating a loss function value between the training semantic feature and the standard semantic feature according to a preset loss function, and adjusting the pre-trained semantic model according to the loss function value until a minimum loss function value is reached, comprises:
calculating the accuracy, recall and balance indexes of the pre-trained semantic model through a test sample evaluation model;
according to the calculation result and the corresponding weight value, calculating the comprehensive index value of the pre-trained semantic model;
and if the comprehensive index value meets the index threshold value, a trained semantic model is derived.
8. The tag extraction method of claim 1, wherein the step of matching the semantic features with a tag pool and determining the target tag based on the matching result comprises:
and matching the semantic features with a tag pool, and if the matching rate does not meet the lowest matching rate, taking the semantic features as new tags and adding the new tags into the tag pool.
9. A tag extraction apparatus, characterized in that the tag extraction apparatus comprises: a memory, a processor and a tag extraction program stored on the memory and executable on the processor, the tag extraction program being configured to implement the steps of the tag extraction method of any one of claims 1 to 8.
10. A computer-readable storage medium, wherein a label extraction program is stored on the computer-readable storage medium, which when executed by a processor, implements the steps of the label extraction method according to any one of claims 1 to 8.
CN202311178989.0A 2023-09-13 2023-09-13 Label extraction method, apparatus and computer readable storage medium Active CN116910279B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311178989.0A CN116910279B (en) 2023-09-13 2023-09-13 Label extraction method, apparatus and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311178989.0A CN116910279B (en) 2023-09-13 2023-09-13 Label extraction method, apparatus and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN116910279A true CN116910279A (en) 2023-10-20
CN116910279B CN116910279B (en) 2024-01-05

Family

ID=88355081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311178989.0A Active CN116910279B (en) 2023-09-13 2023-09-13 Label extraction method, apparatus and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116910279B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093897A (en) * 2024-04-28 2024-05-28 浙江大华技术股份有限公司 Data element matching method, electronic equipment and computer readable storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115203421A (en) * 2022-08-02 2022-10-18 中国平安人寿保险股份有限公司 Method, device and equipment for generating label of long text and storage medium
CN115269834A (en) * 2022-06-28 2022-11-01 国家计算机网络与信息安全管理中心 High-precision text classification method and device based on BERT
CN115374771A (en) * 2022-07-12 2022-11-22 北京沃东天骏信息技术有限公司 Text label determination method and device
WO2023278070A1 (en) * 2021-06-29 2023-01-05 Microsoft Technology Licensing, Llc Automatic labeling of text data
CN115658906A (en) * 2022-11-08 2023-01-31 浙江大学 Large-scale multi-label text classification method based on label self-adaptive text representation
CN115687625A (en) * 2022-11-14 2023-02-03 五邑大学 Text classification method, device, equipment and medium
CN116108133A (en) * 2022-12-09 2023-05-12 广州仰望星空云科技有限公司 Text data processing method and device based on bert model
US20230161952A1 (en) * 2021-11-22 2023-05-25 Adobe Inc. Automatic semantic labeling of form fields with limited annotations

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023278070A1 (en) * 2021-06-29 2023-01-05 Microsoft Technology Licensing, Llc Automatic labeling of text data
US20230161952A1 (en) * 2021-11-22 2023-05-25 Adobe Inc. Automatic semantic labeling of form fields with limited annotations
CN115269834A (en) * 2022-06-28 2022-11-01 国家计算机网络与信息安全管理中心 High-precision text classification method and device based on BERT
CN115374771A (en) * 2022-07-12 2022-11-22 北京沃东天骏信息技术有限公司 Text label determination method and device
CN115203421A (en) * 2022-08-02 2022-10-18 中国平安人寿保险股份有限公司 Method, device and equipment for generating label of long text and storage medium
CN115658906A (en) * 2022-11-08 2023-01-31 浙江大学 Large-scale multi-label text classification method based on label self-adaptive text representation
CN115687625A (en) * 2022-11-14 2023-02-03 五邑大学 Text classification method, device, equipment and medium
CN116108133A (en) * 2022-12-09 2023-05-12 广州仰望星空云科技有限公司 Text data processing method and device based on bert model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093897A (en) * 2024-04-28 2024-05-28 浙江大华技术股份有限公司 Data element matching method, electronic equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN116910279B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
Vijayakumar et al. Automated risk identification using NLP in cloud based development environments
Hong et al. Comparing natural language processing methods to cluster construction schedules
CN109871688B (en) Vulnerability threat degree evaluation method
CN116910279B (en) Label extraction method, apparatus and computer readable storage medium
CN112036168B (en) Event main body recognition model optimization method, device, equipment and readable storage medium
CN112036842B (en) Intelligent matching device for scientific and technological service
Moreo et al. Learning regular expressions to template-based FAQ retrieval systems
Loyola et al. UNSL at eRisk 2021: A Comparison of Three Early Alert Policies for Early Risk Detection.
Gunaseelan et al. Automatic extraction of segments from resumes using machine learning
CN113761875B (en) Event extraction method and device, electronic equipment and storage medium
US20230047800A1 (en) Artificial intelligence-assisted non-pharmaceutical intervention data curation
Zheng et al. Named entity recognition in electric power metering domain based on attention mechanism
CN113780471A (en) Data classification model updating and application method, device, storage medium and product
CN117435718A (en) Science and technology information recommendation method and system
CN116342167A (en) Intelligent cost measurement method and device based on sequence labeling named entity recognition
CN116305257A (en) Privacy information monitoring device and privacy information monitoring method
Li et al. Automatic classification algorithm for multisearch data association rules in wireless networks
Corpuz An application method of long short-term memory neural network in classifying english and tagalog-based customer complaints, feedbacks, and commendations
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
Lopardo et al. Faithful and Robust Local Interpretability for Textual Predictions
CN113157892A (en) User intention processing method and device, computer equipment and storage medium
Chen et al. Location extraction from Twitter messages using a bidirectional long short-term memory neural network with conditional random field model
Wang et al. Interpretable machine learning-based text classification method for construction quality defect reports
Hauser et al. An improved assessing requirements quality with ML methods
Cheng et al. Double-weight LDA extracting keywords for financial fraud detection system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant