CN117131876A - Text processing method, device, computer equipment and storage medium - Google Patents

Text processing method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN117131876A
CN117131876A CN202311141812.3A CN202311141812A CN117131876A CN 117131876 A CN117131876 A CN 117131876A CN 202311141812 A CN202311141812 A CN 202311141812A CN 117131876 A CN117131876 A CN 117131876A
Authority
CN
China
Prior art keywords
text
features
node
knowledge graph
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311141812.3A
Other languages
Chinese (zh)
Inventor
王笑
刘智
翟毅腾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202311141812.3A priority Critical patent/CN117131876A/en
Publication of CN117131876A publication Critical patent/CN117131876A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Animal Behavior & Ethology (AREA)

Abstract

The application relates to a text processing method, a text processing device, computer equipment and a storage medium, wherein a first text is input into a trained semantic learning model to obtain semantic information; training a semantic learning model comprises the steps of obtaining a target entity and a first knowledge graph in a second text, determining nodes in the first knowledge graph according to the target entity, and generating a second knowledge graph; obtaining example features including text features and map features according to the second text and the second knowledge map; according to the partial vector of the first self-supervision task prediction example characteristic, a first loss function is obtained; obtaining a second loss function according to the connection relation of a second knowledge graph in the second self-supervision task prediction example characteristics; and adjusting parameters of the semantic learning model according to the first loss function and the second loss function until the semantic learning model converges. The semantic learning model obtained through training according to the example features and the two self-supervision tasks has good generalization, and semantic information obtained through processing texts based on the semantic learning model has high accuracy.

Description

Text processing method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a text processing method, a text processing device, a computer device, and a storage medium.
Background
If the text data corresponds to the scene to cover a plurality of fields, and higher relevance and complexity exist among the fields, the traditional technology is difficult to comprehensively and accurately understand and analyze each element in the scene, so that the efficiency and the precision of a text processing task are affected. For example, with the sea wars of green industry in recent years, green industry involves numerous financial products and services, including bonds, stocks, funds, insurance, etc., which results in the green industry scenario involving multiple fields, and there are various types of text data with numerous entries and semantically complex, how to make decisions and risk assessment based on the text data, which is a challenging task.
The traditional technology is based on deep learning technology and natural language processing technology, and by training a large number of texts, key words are extracted from the texts, and a semantic learning model is built so as to achieve the purposes of intelligent analysis and judgment of corresponding scenes. Because the entities with different dimensions, different visual angles and different granularities exist in the scene, entity relations exceeding the sequence type text semantics exist among the entities, and the entity relation information is the problem that the existing mainstream semantic learning method cannot be modeled, and the accuracy of the text semantic analysis result is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a text processing method, apparatus, computer device, and storage medium capable of improving accuracy of text semantic analysis results.
In a first aspect, the present application provides a text processing method, the method comprising:
inputting the first text into a trained semantic learning model to obtain semantic information; wherein training the semantic learning model comprises:
acquiring a target entity and a first knowledge graph in a second text, determining a node corresponding to the target entity in the first knowledge graph, and generating a second knowledge graph according to the corresponding node;
obtaining example features according to the second text and the second knowledge graph, wherein the example features comprise text features and graph features;
predicting partial vectors in the example features according to a first self-supervision task to obtain a first loss function;
predicting the connection relation of the second knowledge graph in the example characteristics according to a second self-supervision task to obtain a second loss function;
and adjusting parameters of the semantic learning model according to the first loss function and the second loss function until the semantic learning model reaches a convergence condition.
In one embodiment, obtaining the example feature according to the second text and the second knowledge-graph includes:
vectorizing the second knowledge graph and the second text to obtain an instance;
adding noise elements into the text vector of the instance, and adding interaction nodes into the second knowledge-graph vector of the instance;
and extracting text features and map features of the examples, and fusing the text features and the map features to obtain example features.
In one embodiment, extracting text features and map features of the instance, and fusing the text features and the map features to obtain instance features, including:
generating a word symbol according to the text vector of the instance;
generating node characterization of each node in the second knowledge-graph according to the second knowledge-graph vector of the example;
and carrying out fusion coding on the word symbol and the node representation to obtain the example feature.
In one embodiment, generating the second knowledge-graph includes:
acquiring a corresponding first node and a second node associated with the first node from the first knowledge graph according to the target entity in the second text;
and respectively connecting the interaction node with the first node, and generating the second knowledge graph according to the interaction node, the first node and the second node.
In one embodiment, predicting, according to a first self-supervision task, a partial vector in the example feature, to obtain a first loss function, including:
masking a portion of the vectors in the example feature, predicting a value of the portion of the vectors;
the first loss function is derived from the prediction of the partial vector.
In one embodiment, predicting, according to a second self-supervision task, a connection relationship of a second knowledge graph in the example feature, to obtain a second loss function, including:
mapping the nodes of the second knowledge graph in the example features and the node relation to obtain a mapping vector;
constructing a scoring function according to the mapping vector;
constructing a positive triplet and a negative triplet based on the mapping vector and the scoring function;
and predicting the connection relation of a second knowledge graph in the example characteristic according to the positive triplet and the negative triplet, and obtaining the second loss function.
In one embodiment, obtaining the second text includes:
acquiring a file to be processed, and extracting a text in the file to be processed;
modifying the text according to a cleaning rule;
classifying the modified text according to the category and the theme of the file to be processed to obtain a plurality of corpora;
and obtaining the second text according to the corpus.
In one embodiment, inputting the first text into the trained semantic learning model to obtain semantic information includes:
generating a third knowledge graph according to the first text and the first knowledge graph;
the first text and the first knowledge graph are coded in a combined mode, and a word symbol of the first text and vector characterization of each node in the third knowledge graph are obtained;
according to the word symbol of the first text, carrying out pooling operation on the vector characterization of each node;
and fusing the word symbol of the noise element in the first text, the vector representation of the interaction node in the knowledge graph and the vector representation after pooling.
In a second aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the text processing method according to the first aspect.
In a third aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the text processing method described in the first aspect.
The text processing method, the text processing device, the computer equipment and the computer readable storage medium are used for inputting the first text into the trained semantic learning model to obtain semantic information; wherein training the semantic learning model comprises: acquiring a target entity and a first knowledge graph in a second text, determining a node corresponding to the target entity in the first knowledge graph, and generating a second knowledge graph according to the corresponding node; obtaining example features according to the second text and the second knowledge graph, wherein the example features comprise text features and graph features; predicting partial vectors in the example features according to the first self-supervision task to obtain a first loss function; predicting the connection relation of the second knowledge graph in the example characteristics according to the second self-supervision task to obtain a second loss function; and adjusting parameters of the semantic learning model according to the first loss function and the second loss function until the semantic learning model reaches a convergence condition. According to example characteristics obtained by the two modal data and semantic learning models obtained by training two self-supervision tasks for realizing two-way interaction of the two modal data, generalization is good, semantic information obtained by processing texts based on the semantic learning models is high in accuracy, and the problem of low accuracy of semantic analysis results of the texts can be solved.
Drawings
FIG. 1 is a flow diagram of a text processing method in one embodiment;
FIG. 2 is a flow diagram of training a semantic learning model in one embodiment;
FIG. 3 is a schematic diagram of a training process of a green fused scene semantic learning model in one embodiment;
FIG. 4 is a schematic diagram of a knowledge graph of green melt domain in one embodiment;
FIG. 5 is a schematic diagram of a green melt scene text processing method in one embodiment;
fig. 6 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
In one embodiment, as shown in fig. 1, a text processing method is provided, where the method is applied to a terminal to illustrate, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes: and inputting the first text into the trained semantic learning model to obtain semantic information. The first text is a natural language text, the semantic learning model is used for analyzing and learning the first text, and the semantic information comprises prediction results, such as text classification results, multiple-choice question-answering results and the like, which are output based on the first text. Optionally, when the application scene of the semantic learning model is a green fusion field, the semantic information may be a green fusion determination result of a company service. Fig. 2 is a schematic flow chart of training a semantic learning model in the present embodiment, including the following steps.
Step S201, a target entity and a first knowledge graph in a second text are obtained, a node corresponding to the target entity is determined in the first knowledge graph, and a second knowledge graph is generated according to the corresponding node.
The first knowledge graph is generated according to standard data of the field where the second text is located. The second text is unstructured text data used for training a semantic learning model, and the second text is generated according to massive texts except standard data in the corresponding field of the first text.
Illustratively, the semantic learning model is used to process green melt-related text. And extracting unstructured text data from books, articles, bulletins, news and files related to the green melt field to obtain a second text. Industry, finance and national standard data in the green melt field, such as enterprise industry and commerce data, bulletin data, industry classification standards of green industry guidance catalogue (2019) and the like, are obtained, and a knowledge graph about green finance is constructed according to the data in the green melt field, so that a first knowledge graph is obtained.
The target entity is obtained according to the vocabulary forming the second text, and optionally, the vocabulary forming the second text is extracted to obtain the target entity corresponding to the vocabulary. And selecting a node associated with the target entity from the first knowledge graph, and connecting the target entity and the node in the first knowledge graph to obtain a second knowledge graph.
Step S202, obtaining example features according to the second text and the second knowledge graph, wherein the example features comprise text features and graph features.
The example is used for training a semantic generation model, text features are text sequence semantics of the example, and atlas features are knowledge atlas structures of the example.
And combining the second text and the second knowledge graph to obtain an example. By combining the text features and the map features, semantic information of the second knowledge map is supplemented by the text features, and relational structures of the second text target entity are supplemented by the map features, so that cross-modal interaction of the text and the map structures in two different modes is realized, and example features are obtained.
Step S203, according to the first self-supervision task, predicting partial vectors in the example features to obtain a first loss function. The first self-supervision task improves the understanding capability of the trained semantic learning model to the first text by predicting partial vectors in the instance features.
And step S204, predicting the connection relation of the second knowledge graph in the example features according to the second self-supervision task to obtain a second loss function. The second self-supervision task complements the missing nodes in the second knowledge graph and the node connection relationship by predicting the connection relationship of the second knowledge graph in the example characteristics.
Step S205, according to the first loss function and the second loss function, parameters of the semantic learning model are adjusted until the semantic learning model reaches a convergence condition. Training the semantic learning model through two self-supervision tasks, and adjusting the model by combining the first loss function and the second loss function. The self-supervision task is used as a target, so that two-way interaction of two pieces of cross-modal information of a knowledge graph structure and text sequence semantics in an example is realized, and generalization of a semantic learning model is improved.
In the text processing method, the input first text is processed through the trained semantic learning model. The training data input into the semantic learning model is example characteristics obtained by fusing two types of modal data of text and graph structures. Because the instance features are obtained by interacting two types of modal data, the relationship among the entities contained in the instance features is more accurate, and better data support is provided for the semantic learning model. The training targets of the semantic learning model are two self-supervision tasks for improving the text understanding capability and complementing the structural connection relation of the graph respectively. The semantic learning model is trained by combining the loss functions generated in the two self-supervision task processes, so that bidirectional information intercommunication between the text and the knowledge graph can be promoted, and the trained semantic learning model can jointly and accurately infer the text and the knowledge graph. Therefore, the accuracy of semantic information output according to the trained semantic learning model is high, and the problem of low accuracy of a semantic analysis result of the text is solved.
In one embodiment, generating the second knowledge-graph includes: according to the target entity in the second text, a corresponding first node and a second node associated with the first node are obtained from the first knowledge graph; and generating a second knowledge graph according to the first node, the second node and the edges among the nodes.
The first node is a node directly corresponding to the target entity in the first knowledge graph, and the second node associated with the first node may be a second-order neighborhood node of the first node.
The second text is a text segment W, and the target entity in the second text is connected to the first knowledge-graphThe entity node in (1) obtains a first node V el From knowledge graph->And searching a second-order neighborhood node bridge point of the initial node to obtain a second node. Obtaining a total node set for forming a second knowledge-graph structure from the first node and the second node>Extracting an edge +.A. of a node for constructing a second knowledge-graph>Finally, a second knowledge graph G= (V, E) is obtained.
In one embodiment, obtaining the example feature according to the second text and the second knowledge-graph includes: vectorizing the second knowledge graph and the second text to obtain an instance; adding noise elements into the text vector of the instance, and adding interaction nodes into the second knowledge-graph vector of the instance; and extracting text features and map features of the examples, and fusing the text features and the map features to obtain example features.
Vectorization of the second text W and the second knowledge graph G results in instance x= (W, G). Noise element w int As the information pool point of the text vector of the instance, the text vector w= { W of the instance 1 ,…,w i Adding noise element w to } int . According to the added noise element w int In the latter example, the extracted text features contain noise elements, so that the semantic learning model obtained through training has the capabilities of resisting data noise and interference, thereby improving the accuracy of text processing.
The interaction node Vint is an information pool point of a second knowledge-graph vector of the instance, the interaction node Vint is added in the second knowledge-graph vector G= (V, E) of the instance, and the interaction node Vint passes through a new relation type r el Connecting the interaction node Vint and each node in the second knowledge-graphAccording to the example after the interaction node Vint is added, extracting the graph characteristics of the interaction node, including the node characteristics and the connection relation between the nodes in the second knowledge graph.
Adding noise elements into the text vector of the example, and adding interaction nodes into the second knowledge-graph vector of the example, wherein the obtained example comprises the following steps: text sequence information { w int ,w 1 ,...,w i Map structure information { v } int ,v 1 ,...,v j And w is a target entity vector of the second text in the example, and v is a second knowledge-graph vector in the example.
Optionally, obtaining the instance feature includes: by a cross-mode encoder f encoder Exchanging text sequence information and graph structure information in an example, and encoding the text sequence information and the graph structure information:
(H int ,H 1 ,...,H i ),(V int ,V 1 ,...,V j )=f encoder ({w int ,w 1 ,...,w i },{v int ,v 1 ,...,v j })
wherein the cross-mode encoder f encoder Two types of data with different modes can be exchanged, the semantics of the graph structure is supplemented by the sequence information, and the entity relationship structure of the sequence information is supplemented by the graph structure, so as to obtain example characteristics (H) int ,H 1 ,...,H i ),(V int ,V 1 ,...,V j ). The example features are obtained through interaction and modeling of two types of different modal information, and comprise entity relations exceeding sequence type text semantics, so that the example features can be applied to different scene tasks, the generalization degree of a semantic generation model obtained according to example feature training is high, and quick learning and prediction of text information in a new scene can be realized.
Extracting text features and map features of the examples, fusing the text features and the map features to obtain example features, including: generating a word symbol according to the text vector of the instance; generating node characterization of each node in the second knowledge-graph according to the second knowledge-graph vector of the example; and carrying out fusion coding on the word symbol and the node representation to obtain example characteristics.
Modeling text features through a transducer, obtaining map features through map neural network modeling, and combining the text features and the map features. The text feature acquisition comprises the following steps: converting text into an initial token representation using N-layer convertors
Wherein, LM is a language model,as the intermediate element w int Initial token representation of (2),/>As the target entity w in the instance i I is the number of the target entity.
The acquisition of the map features comprises the following steps: converting knowledge graph nodes in an input instance into initial node representations using node embedding (node_embedding)
Wherein,for interaction node v int Is characterized by the initial node of->The initial node representation of the second knowledge graph node in the example is represented, and i is the number of the node.
Fusing text features and map features to obtain example features including: through the fusion layer of M layers, the characters and the node characterization are jointly encoded to obtain example characteristics (H) int ,H 1 ,...,H i ),(V int ,V 1 ,...,V j )。
Optionally, the noise element and the interaction node are respectively used as two interfaces of modal interaction. And acquiring a word symbol generated according to the noise element, acquiring a node representation generated according to the interactive node, and fusing the word symbol generated according to the noise element and the node representation generated according to the interactive node to obtain an instance feature. Generating a word symbol through a transducer, generating a node representation through a GNN (Graph Neural Networks, graph neural network), and fusing the word symbol corresponding to a noise element and the node representation corresponding to an interaction node through an L-layer fusion layer of an MLP (Multilayer Perceptron, multi-layer perceptron):
wherein,is the noise element w int Is a representation of the first layer fusion layer, +.>For interaction node v int A representation of a first layer fusion layer; />For the node representation of the noise element converted by the converter layer I,/for the conversion of the noise element>For the representation of interaction nodes in the knowledge graph through the GNN layer I,/the interaction nodes are represented by the knowledge graph>The method is a representation of the interaction node of the noise element and the graph after the multi-layer perceptron fusion.
In one embodiment, predicting a portion of vectors in the instance feature according to a first self-supervision task to obtain a first loss function includes: masking a portion of the vectors in the instance feature, predicting a value of the portion of the vectors; a first loss function is derived from the prediction of the partial vector.
The first self-supervision task is a natural language processing task, masks part of the content of the instance in the middle, marks the part of the vector to be masked, predicts the value of the part of vector according to the text adjacent to the part of vector to be masked, and enables the semantic learning model to predict the true value of the part of vector to be masked through the training of the first self-supervision task. Thereby helping the language learning model understand and generate text.
Illustratively, the first self-supervising task is masking language modeling (Masked Language Modeling) which is a natural language processing task, with portions of the instance masked and labeled [ MASK ], and with real value predictions based on vectors at corresponding masking locations so that the model can learn relationships of the entities.
According to a second self-supervision task, predicting the connection relation of a second knowledge graph in the example features to obtain a second loss function, wherein the method comprises the following steps: mapping the nodes of the second knowledge graph in the example features and the node relation to obtain a mapping vector; constructing a scoring function according to the mapping vector; constructing a positive triplet and a negative triplet based on the mapping vector and the scoring function; and predicting the connection relation of the second knowledge graph in the example characteristic according to the positive triplet and the negative triplet, and obtaining a second loss function.
Mapping knowledge graphThe relation (r) between each entity node (h or t) and the node is mapped to obtain vectors h, t and r, a scoring function phi r (h, t) is defined, and the scoring function is used for converting a predicted result (for example, the relation existing between the entity node and the entity node) into a specific score to measure the quality of the predicted result. And modeling through a scoring function to obtain positive/negative triples, and predicting the missing links in the knowledge graph through complementing the positive/negative triples. Optionally, the second self-supervising task is a knowledge-graph link prediction task. Alternatively, a scoring function is constructed using the TransE method: Φr (h, t) = - |h+r-t||.
Respectively calculating a first self-supervision task and a second self-supervision task trainingAnd (3) training the loss in the semantic learning model process, adding the two loss functions to obtain a total loss value trained by the semantic learning model, and carrying out back propagation and model parameter optimization according to the total loss value to obtain a trained semantic learning model. Alternatively, loss during masking language tasks is calculated separately using cross entropy Loss function CE MLM Loss in knowledge-graph link prediction process LinkPred ;Loss MLM And Loss of LinkPred Adding to obtain model training total loss and carrying out back propagation and model parameter optimization: loss=loss LinkPred +Loss MLM
In one embodiment, obtaining the second text includes: acquiring a file to be processed, and extracting a text in the file to be processed; modifying the text according to the cleaning rule; classifying the modified text according to the category and the theme of the file to be processed to obtain a plurality of corpora; and obtaining a second text according to the corpus. Wherein, the cleaning rule includes: unifying Chinese and English punctuation marks, deleting messy codes, and deleting stop words. And constructing a plurality of corpora according to the cleaned texts, wherein the second text is the corpus in one of the corpora. And training the semantic generation model according to the corpus in the corpus database is beneficial to improving generalization of the model.
In one embodiment, inputting the first text into the trained semantic learning model to obtain semantic information includes: generating a third knowledge graph according to the first text and the first knowledge graph; the first text and the first knowledge graph are jointly coded, and a word symbol of the first text and vector characterization of each node in the third knowledge graph are obtained; according to the word symbol of the first text, carrying out pooling operation on the vector representation of each node; and fusing the word symbol of the noise element in the first text, the vector representation of the interaction node in the knowledge graph and the vector representation after pooling.
And processing the first text through the trained semantic learning model to obtain a vector representation X for executing the downstream task. Wherein the vector characterizes x=mlp (H int ,V int G). Wherein a noise element, H, is added to the first text int Is a word symbol corresponding to the noise element.Adding interactive nodes into the third knowledge graph, V int And representing the nodes corresponding to the interactive nodes. G is H int Query the node { V ] of the third knowledge-graph j |v j ∈{v 1 ,...,v j And (3) merging H after realizing the pooling operation based on the attention int ,V int G, obtaining vector characterization X.
Optionally, the semantic learning model includes an application module, where the application module is configured to fine-tune the semantic learning model according to the downstream task, so that the semantic learning model can be applied to a different downstream task.
In one embodiment, a training flowchart of a green fused scene semantic learning model is provided, as shown in fig. 3, and the steps include:
step S301: and obtaining a green melt-related text, constructing a corpus therefrom, and cleaning and preprocessing the text. The corpus is key words and semantic features in the green melt-related text;
step S302: based on the green financial related data, a knowledge graph about green finance is constructed. The green melt-related data comprises enterprise business data, bulletin data, industry classification standards of green industry guidance catalog (2019) and the like; FIG. 4 is a schematic diagram of a knowledge graph of the green melt domain;
step S303: extracting a corpus from a corpus database, extracting a sub-graph related to the corpus from a knowledge graph based on the given corpus, and carrying out vectorization operation on the corpus and the sub-graph to create an input instance for the model;
step S304: based on the input example, a cross-mode encoder combining the sequence and the graph is used for obtaining the characteristics of the encoded example, and two self-supervised reasoning tasks are used for pre-training the model to construct a semantic learning model. Wherein, the two self-supervised reasoning tasks are respectively masking language modeling and knowledge graph link prediction.
Step S305: generalizing and fine-tuning the semantic model. The trained semantic model is applied to a new green fusion scene to realize rapid learning and prediction of the new scene, such as company business green fusion judgment.
Fig. 5 is a schematic diagram of a green fused scene text processing method in this embodiment, and as shown in fig. 5, the text "[ beginning ] xxxxxxxx, [ mask ]" is input into a trained semantic learning model to obtain semantic information. Wherein, "xxxxxxxx, xxxxx" is a natural sentence, [ start ] and [ mask ] are marking information of the natural sentence, [ start ] is used for marking a starting position of the natural sentence, and [ mask ] is used for covering and replacing part of words in the natural sentence. Specifically, pre-training the semantic learning model includes: inputting a text ([ Start ] xxxxxxx, xxxxxxx, mask ]) into an LM Layer (language model Layer), and obtaining text characteristics and a green financial knowledge graph output by the LM Layer. Inputting the text features and the knowledge graph into a fusion layer, wherein the fusion layer is used for exchanging information between two different mode data of the text and the graph structure, and outputting example features. In order to ensure the information flow intercommunication between the instance features, the instance features are trained according to the language modeling and the knowledge graph link prediction of two self-supervision reasoning tasks, and a semantic learning model is constructed. And fine tuning the semantic learning model according to the downstream task to obtain a trained semantic learning model.
In the embodiment, by training massive financial texts, key words and semantic features are automatically extracted; and constructing a knowledge graph related to green metal according to industry, finance and national standards. And (3) carrying out interaction and modeling on the information of the text and the graph structure in two different modes, so as to obtain an example feature which can be applied to a downstream specific financial scene task. The model is pre-trained according to example characteristics and two self-supervision reasoning tasks, information flow intercommunication between input examples is guaranteed, and generalization processing is further carried out according to scene requirements, so that intelligent learning and modeling of semantic information in a green fused scene are achieved.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used to store a semantic learning model. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text processing method.
It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A method of text processing, the method comprising:
inputting the first text into a trained semantic learning model to obtain semantic information; wherein training the semantic learning model comprises:
acquiring a target entity and a first knowledge graph in a second text, determining a node corresponding to the target entity in the first knowledge graph, and generating a second knowledge graph according to the corresponding node;
obtaining example features according to the second text and the second knowledge graph, wherein the example features comprise text features and graph features;
predicting partial vectors in the example features according to a first self-supervision task to obtain a first loss function;
predicting the connection relation of the second knowledge graph in the example characteristics according to a second self-supervision task to obtain a second loss function;
and adjusting parameters of the semantic learning model according to the first loss function and the second loss function until the semantic learning model reaches a convergence condition.
2. The method of claim 1, wherein deriving instance features from the second text and the second knowledge-graph comprises:
vectorizing the second knowledge graph and the second text to obtain an instance;
adding noise elements into the text vector of the instance, and adding interaction nodes into the second knowledge-graph vector of the instance;
and extracting text features and map features of the examples, and fusing the text features and the map features to obtain example features.
3. The method of claim 2, wherein extracting text features and atlas features of the instance, fusing the text features and the atlas features, obtaining instance features, comprises:
generating a word symbol according to the text vector of the instance;
generating node characterization of each node in the second knowledge-graph according to the second knowledge-graph vector of the example;
and carrying out fusion coding on the word symbol and the node representation to obtain the example feature.
4. The method of claim 1, wherein generating a second knowledge-graph comprises:
acquiring a corresponding first node and a second node associated with the first node from the first knowledge graph according to the target entity in the second text;
and generating the second knowledge graph according to the first node, the second node and the edges among the nodes.
5. The method of claim 1, wherein predicting the partial vectors in the instance feature according to a first self-supervising task results in a first loss function, comprising:
masking a portion of the vectors in the example feature, predicting a value of the portion of the vectors;
the first loss function is derived from the prediction of the partial vector.
6. The method of claim 1, wherein predicting the connection relationship of the second knowledge-graph in the example feature according to the second self-supervision task, to obtain the second loss function, comprises:
mapping the nodes of the second knowledge graph in the example features and the node relation to obtain a mapping vector;
constructing a scoring function according to the mapping vector;
constructing a positive triplet and a negative triplet based on the mapping vector and the scoring function;
and predicting the connection relation of a second knowledge graph in the example characteristic according to the positive triplet and the negative triplet, and obtaining the second loss function.
7. The method of claim 1, wherein obtaining the second text comprises:
acquiring a file to be processed, and extracting a text in the file to be processed;
modifying the text according to a cleaning rule;
classifying the modified text according to the category and the theme of the file to be processed to obtain a plurality of corpora;
and obtaining the second text according to the corpus.
8. The method of claim 1, wherein inputting the first text into the trained semantic learning model to obtain the semantic information comprises:
generating a third knowledge graph according to the first text and the first knowledge graph;
the first text and the first knowledge graph are coded in a combined mode, and a word symbol of the first text and vector characterization of each node in the third knowledge graph are obtained;
according to the word symbol of the first text, carrying out pooling operation on the vector characterization of each node;
and fusing the word symbol of the noise element in the first text, the vector representation of the interaction node in the knowledge graph and the vector representation after pooling.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 8.
CN202311141812.3A 2023-09-05 2023-09-05 Text processing method, device, computer equipment and storage medium Pending CN117131876A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311141812.3A CN117131876A (en) 2023-09-05 2023-09-05 Text processing method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311141812.3A CN117131876A (en) 2023-09-05 2023-09-05 Text processing method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117131876A true CN117131876A (en) 2023-11-28

Family

ID=88862696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311141812.3A Pending CN117131876A (en) 2023-09-05 2023-09-05 Text processing method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117131876A (en)

Similar Documents

Publication Publication Date Title
US20220269707A1 (en) Method and system for analyzing entities
CN112966074B (en) Emotion analysis method and device, electronic equipment and storage medium
CN109766557B (en) Emotion analysis method and device, storage medium and terminal equipment
CN112559734B (en) Brief report generating method, brief report generating device, electronic equipment and computer readable storage medium
CN114419642A (en) Method, device and system for extracting key value pair information in document image
CN112699215B (en) Grading prediction method and system based on capsule network and interactive attention mechanism
CN109710842B (en) Business information pushing method and device and readable storage medium
Irissappane et al. Leveraging GPT-2 for classifying spam reviews with limited labeled data via adversarial training
CN115935991A (en) Multitask model generation method and device, computer equipment and storage medium
CN116228383A (en) Risk prediction method and device, storage medium and electronic equipment
CN116595406A (en) Event argument character classification method and system based on character consistency
CN116861269A (en) Multi-source heterogeneous data fusion and analysis method in engineering field
CN111709225A (en) Event cause and effect relationship judging method and device and computer readable storage medium
CN116340635A (en) Article recommendation method, model training method, device and equipment
US20240037335A1 (en) Methods, systems, and media for bi-modal generation of natural languages and neural architectures
CN116561272A (en) Open domain visual language question-answering method and device, electronic equipment and storage medium
CN115357712A (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
CN112364666B (en) Text characterization method and device and computer equipment
CN115186085A (en) Reply content processing method and interaction method of media content interaction content
CN115357711A (en) Aspect level emotion analysis method and device, electronic equipment and storage medium
CN114254622A (en) Intention identification method and device
CN117131876A (en) Text processing method, device, computer equipment and storage medium
CN113157892A (en) User intention processing method and device, computer equipment and storage medium
Kreyssig Deep learning for user simulation in a dialogue system
CN111198933A (en) Method, device, electronic device and storage medium for searching target entity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination